Quickstart

Five minutes from zero to running.

The fast track. Install one NuGet, load one model, ship three working examples on your machine: a chat session, a function-calling agent, and a RAG pipeline. No telemetry, no API key, no signup. Copy-paste each block in order; total wall-clock is dominated by the first model download.

1 NuGet 3 working examples 0 cloud calls
Prerequisites

What you need before you start.

.NET 8 or newer

.NET 8 SDK or 9 / 10 if you have them. .NET Standard 2.0 also supported. dotnet --version should report 8.0 or higher.

~5 GB free disk

For the first model download (Qwen3 4B in this guide). Smaller models exist; larger models exist; this is a comfortable middle.

Any modern CPU

A GPU helps but is not required. The model in this guide runs on commodity laptops. CUDA, Vulkan, Metal, AVX2 backends auto-select.

Step 1

Create a project, add the NuGet.

A new console project plus one package reference. The package is self-contained: native runtimes for every supported backend ship inside.

terminal
# Create a new console project.
dotnet new console -n MyFirstLMKit
cd MyFirstLMKit

# Add the LM-Kit package.
dotnet add package LM-Kit.NET

# Optional: target .NET 8+ explicitly in the .csproj if you need to.
# <TargetFramework>net8.0</TargetFramework>

That is the only required install step. The default package already ships CPU (AVX / AVX2), Vulkan, and Metal runtimes. Auto-detect picks the fastest one when you load a model. If you have an NVIDIA GPU, the optional step below unlocks the CUDA path.

Optional, NVIDIA GPUs

Plug in the CUDA 13 backend.

If your machine has an NVIDIA GPU, add the CUDA 13 backend package. Inference moves to the GPU automatically on the next model load. No code change. Skip this step on CPU-only machines, Apple Silicon (Metal handles it), or AMD / Intel GPUs (Vulkan handles it).

The CUDA dependency package is pulled in transitively, no need to add it separately.

terminal, Windows
# Add the CUDA 13 runtime for Windows. The CUDA dependency
# package is pulled in transitively, no need to add it.
dotnet add package LM-Kit.NET.Backend.Cuda13.Windows

When to use it

Any modern NVIDIA GPU with a recent driver. Significantly faster than CPU on 4B+ models, and unlocks larger models that will not fit comfortably in RAM.

Driver requirement

A recent NVIDIA driver compatible with CUDA 13 is required. Run nvidia-smi to confirm the GPU is visible. The CUDA runtime libraries themselves ship in the NuGet, you do not need to install the full CUDA toolkit.

Verify it loaded

After running step 2, check model.Runtime.BackendName. It should report CUDA. If it falls back to Vulkan or CPU, the driver or GPU was not detected.

Older GPUs? A LM-Kit.NET.Backend.Cuda12.* variant is also published for hardware that does not support CUDA 13. macOS users skip this section entirely: the Metal backend in the base package handles Apple Silicon natively. See the local inference page for the full backend matrix.

Step 2

Hello, model.

Replace Program.cs with the snippet below. First run downloads the model (a few minutes); subsequent runs are instant because the catalogue caches it.

Replace Program.cs with this. LM.LoadFromModelID handles the download and cache automatically.

Program.cs
using LMKit.Model;
using LMKit.TextGeneration;

// Load a 4B-parameter chat model from the catalogue.
// First run downloads ~3 GB; subsequent runs read the local cache.
var model = LM.LoadFromModelID("qwen3.5:4b");

// Single-turn conversation: prompt in, completion out.
var chat = new SingleTurnConversation(model);

string reply = await chat.SubmitAsync("Two-line bio of Ada Lovelace.");
Console.WriteLine(reply);

That is local inference. No API key. No outbound call. The bytes that produced that reply live on your disk and ran on your CPU or GPU. Inspect model.Runtime.BackendName if you want to see which backend the runtime selected.

Step 3

Streaming multi-turn chat.

Real chat needs history and streaming. MultiTurnConversation handles both: history is preserved across turns; the KV-cache survives between calls; tokens stream as they generate.

StreamingChat.cs
using LMKit.Model;
using LMKit.TextGeneration;

var model = LM.LoadFromModelID("qwen3.5:4b");
var chat  = new MultiTurnConversation(model)
{
    SystemMessage = "You are a concise expert in .NET and AI."
};

// Stream the response token-by-token.
await foreach (var token in chat.StreamAsync("What is a KV-cache?"))
{
    Console.Write(token.Text);
}
Console.WriteLine();

// Follow-up. The model has the previous turn in context.
var followUp = await chat.SubmitAsync("How does it relate to context length?");
Console.WriteLine(followUp);

Want to pause the chat and free GPU memory? Cast it to IKVCache and call HibernateAsync. The next SubmitAsync call rehydrates transparently. See the context hibernation page.

Step 4

An agent with tools.

A simple agent that can call your C# methods. Annotate them with [LMFunction]; the agent picks the right one and runs it. No glue code required.

FirstAgent.cs
using LMKit.Model;
using LMKit.Agents;
using LMKit.Agents.Tools;

// 1. Annotate methods you want the agent to call.
public class DateTimeTools
{
    [LMFunction(Description = "Returns the current date and time in ISO format.")]
    public string Now() => DateTime.UtcNow.ToString("o");

    [LMFunction(Description = "Adds the given number of days to today and returns the result.")]
    public string AddDays(int days) => DateTime.UtcNow.AddDays(days).ToString("yyyy-MM-dd");
}

// 2. Build an agent that has access to those tools.
var model = LM.LoadFromModelID("qwen3.5:4b");
var agent = Agent.CreateBuilder(model)
    .WithTools(t => t.AddFromInstance(new DateTimeTools()))
    .Build();

// 3. Ask. The agent picks tools, calls them, and integrates results.
var result = await agent.RunAsync(
    "What is today's date, and what date will it be in 100 days?");

Console.WriteLine(result.Content);

That is a working agent. Add more tools, swap in built-in tools (HTTP, file system, web search, PDF, OCR), or compose multiple agents into a workflow. The full picture is on the AI agents page.

Step 5

RAG over a PDF.

Index a PDF, ask a question, get a grounded answer with citations. The same primitives used by every document workflow in LM-Kit.NET.

FirstRag.cs
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Data.Storage;
using LMKit.TextGeneration;

// 1. Models. Chat for answers, embeddings for retrieval.
var chatModel  = LM.LoadFromModelID("qwen3.5:4b");
var embedModel = LM.LoadFromModelID("embeddinggemma-300m");

// 2. RAG engine. File-based vector store, persists across runs.
var store = new FileSystemVectorStore("./embeddings");
var rag   = new DocumentRag(embedModel, store)
{
    ProcessingMode = PageProcessingMode.Auto,
    MaxChunkSize   = 512
};

// 3. Import a PDF. Ingestion handles text-layer pages and scans automatically.
await rag.ImportDocumentAsync(
    Attachment.FromFile("sample.pdf"),
    new("sample") { Name = "Sample document" });

// 4. Query. The response includes the source pages it cited.
var chat   = new SingleTurnConversation(chatModel);
var result = await rag.QueryPartitionsAsync(
    "Summarise the document in three sentences.",
    chat);

Console.WriteLine(result.Response.Completion);
foreach (var r in result.SourceReferences)
{
    Console.WriteLine($"  cited: {r.Name}, page {r.PageNumber}");
}

That is local RAG. PDF goes in, structured passages get extracted, embedded, indexed, retrieved, and used to ground the answer. The Document RAG page covers the full surface: query strategies, vision-grounded retrieval, lifecycle by ID, custom vector stores.

Common gotchas

Six things people hit on day one.

First load is slow

The first LoadFromModelID downloads the model. Subsequent loads are instant. To control where the cache lives, set Configuration.ModelStorageDirectory before any load call.

Pick a smaller model on small machines

If 4B is tight on your hardware, swap to qwen3:1.7b or gemma3:1b. Drop-in replacement; the rest of the code does not change.

Backend not what you expected

Inspect model.Runtime.BackendName. To pin a specific backend (CUDA / Vulkan / Metal / AVX2), set Configuration.PreferredBackend before loading.

Slow on first inference call

The first SubmitAsync warms the context. Subsequent calls reuse the KV-cache and are dramatically faster. Long sessions amortise the warmup.

Out of memory on a small GPU

Quantise the model (Q4 / Q5 variants in the catalogue), or use the multi-GPU and tensor-override path to put expensive weights on CPU. See the multi-GPU page.

Want it on iOS / Android / MAUI

Same NuGet, same code. Pre-bundle a small model and disable network model loading. See the edge deployment page.

You shipped something. What is next?

Get Community Edition Download