.NET 8 or newer
.NET 8 SDK or 9 / 10 if you have them. .NET Standard 2.0 also supported. dotnet --version should report 8.0 or higher.
The fast track. Install one NuGet, load one model, ship three working examples on your machine: a chat session, a function-calling agent, and a RAG pipeline. No telemetry, no API key, no signup. Copy-paste each block in order; total wall-clock is dominated by the first model download.
.NET 8 SDK or 9 / 10 if you have them. .NET Standard 2.0 also supported. dotnet --version should report 8.0 or higher.
For the first model download (Qwen3 4B in this guide). Smaller models exist; larger models exist; this is a comfortable middle.
A GPU helps but is not required. The model in this guide runs on commodity laptops. CUDA, Vulkan, Metal, AVX2 backends auto-select.
A new console project plus one package reference. The package is self-contained: native runtimes for every supported backend ship inside.
# Create a new console project. dotnet new console -n MyFirstLMKit cd MyFirstLMKit # Add the LM-Kit package. dotnet add package LM-Kit.NET # Optional: target .NET 8+ explicitly in the .csproj if you need to. # <TargetFramework>net8.0</TargetFramework>
That is the only required install step. The default package already ships CPU (AVX / AVX2), Vulkan, and Metal runtimes. Auto-detect picks the fastest one when you load a model. If you have an NVIDIA GPU, the optional step below unlocks the CUDA path.
If your machine has an NVIDIA GPU, add the CUDA 13 backend package. Inference moves to the GPU automatically on the next model load. No code change. Skip this step on CPU-only machines, Apple Silicon (Metal handles it), or AMD / Intel GPUs (Vulkan handles it).
The CUDA dependency package is pulled in transitively, no need to add it separately.
# Add the CUDA 13 runtime for Windows. The CUDA dependency # package is pulled in transitively, no need to add it. dotnet add package LM-Kit.NET.Backend.Cuda13.Windows
The CUDA toolkit and driver are installed at the OS level. Run nvidia-smi to confirm.
# Add the CUDA 13 runtime for Linux. The CUDA toolkit / driver # is installed at the OS level (nvidia-smi must work). dotnet add package LM-Kit.NET.Backend.Cuda13.Linux
Any modern NVIDIA GPU with a recent driver. Significantly faster than CPU on 4B+ models, and unlocks larger models that will not fit comfortably in RAM.
A recent NVIDIA driver compatible with CUDA 13 is required. Run nvidia-smi to confirm the GPU is visible. The CUDA runtime libraries themselves ship in the NuGet, you do not need to install the full CUDA toolkit.
After running step 2, check model.Runtime.BackendName. It should report CUDA. If it falls back to Vulkan or CPU, the driver or GPU was not detected.
Older GPUs? A LM-Kit.NET.Backend.Cuda12.* variant is also
published for hardware that does not support CUDA 13. macOS users skip
this section entirely: the Metal backend in the base package handles
Apple Silicon natively. See the
local inference page for the
full backend matrix.
Replace Program.cs with the snippet below. First run
downloads the model (a few minutes); subsequent runs are instant
because the catalogue caches it.
Replace Program.cs with this. LM.LoadFromModelID handles the download and cache automatically.
using LMKit.Model; using LMKit.TextGeneration; // Load a 4B-parameter chat model from the catalogue. // First run downloads ~3 GB; subsequent runs read the local cache. var model = LM.LoadFromModelID("qwen3.5:4b"); // Single-turn conversation: prompt in, completion out. var chat = new SingleTurnConversation(model); string reply = await chat.SubmitAsync("Two-line bio of Ada Lovelace."); Console.WriteLine(reply);
What you should see in the terminal on the first run. The model download line appears only once.
# First run only. Subsequent runs skip this section. [catalog] downloading qwen3-4b-instruct.lmk ... 100% (3.2 GB) [runtime] backend: CUDA 13 (or Vulkan / Metal / AVX2 depending on host) # The reply (your exact wording will vary): Ada Lovelace was a 19th-century English mathematician known for her work on Charles Babbage's Analytical Engine, where she described the first algorithm intended for a machine.
That is local inference. No API key. No outbound call. The bytes that
produced that reply live on your disk and ran on your CPU or GPU.
Inspect model.Runtime.BackendName if you want to see which
backend the runtime selected.
Real chat needs history and streaming. MultiTurnConversation
handles both: history is preserved across turns; the KV-cache
survives between calls; tokens stream as they generate.
using LMKit.Model; using LMKit.TextGeneration; var model = LM.LoadFromModelID("qwen3.5:4b"); var chat = new MultiTurnConversation(model) { SystemMessage = "You are a concise expert in .NET and AI." }; // Stream the response token-by-token. await foreach (var token in chat.StreamAsync("What is a KV-cache?")) { Console.Write(token.Text); } Console.WriteLine(); // Follow-up. The model has the previous turn in context. var followUp = await chat.SubmitAsync("How does it relate to context length?"); Console.WriteLine(followUp);
Want to pause the chat and free GPU memory? Cast it to
IKVCache and call HibernateAsync. The next
SubmitAsync call rehydrates transparently. See the
context hibernation page.
A simple agent that can call your C# methods. Annotate them with
[LMFunction]; the agent picks the right one and runs it.
No glue code required.
using LMKit.Model; using LMKit.Agents; using LMKit.Agents.Tools; // 1. Annotate methods you want the agent to call. public class DateTimeTools { [LMFunction(Description = "Returns the current date and time in ISO format.")] public string Now() => DateTime.UtcNow.ToString("o"); [LMFunction(Description = "Adds the given number of days to today and returns the result.")] public string AddDays(int days) => DateTime.UtcNow.AddDays(days).ToString("yyyy-MM-dd"); } // 2. Build an agent that has access to those tools. var model = LM.LoadFromModelID("qwen3.5:4b"); var agent = Agent.CreateBuilder(model) .WithTools(t => t.AddFromInstance(new DateTimeTools())) .Build(); // 3. Ask. The agent picks tools, calls them, and integrates results. var result = await agent.RunAsync( "What is today's date, and what date will it be in 100 days?"); Console.WriteLine(result.Content);
That is a working agent. Add more tools, swap in built-in tools (HTTP, file system, web search, PDF, OCR), or compose multiple agents into a workflow. The full picture is on the AI agents page.
Index a PDF, ask a question, get a grounded answer with citations. The same primitives used by every document workflow in LM-Kit.NET.
using LMKit.Model; using LMKit.Retrieval; using LMKit.Data.Storage; using LMKit.TextGeneration; // 1. Models. Chat for answers, embeddings for retrieval. var chatModel = LM.LoadFromModelID("qwen3.5:4b"); var embedModel = LM.LoadFromModelID("embeddinggemma-300m"); // 2. RAG engine. File-based vector store, persists across runs. var store = new FileSystemVectorStore("./embeddings"); var rag = new DocumentRag(embedModel, store) { ProcessingMode = PageProcessingMode.Auto, MaxChunkSize = 512 }; // 3. Import a PDF. Ingestion handles text-layer pages and scans automatically. await rag.ImportDocumentAsync( Attachment.FromFile("sample.pdf"), new("sample") { Name = "Sample document" }); // 4. Query. The response includes the source pages it cited. var chat = new SingleTurnConversation(chatModel); var result = await rag.QueryPartitionsAsync( "Summarise the document in three sentences.", chat); Console.WriteLine(result.Response.Completion); foreach (var r in result.SourceReferences) { Console.WriteLine($" cited: {r.Name}, page {r.PageNumber}"); }
That is local RAG. PDF goes in, structured passages get extracted, embedded, indexed, retrieved, and used to ground the answer. The Document RAG page covers the full surface: query strategies, vision-grounded retrieval, lifecycle by ID, custom vector stores.
The first LoadFromModelID downloads the model. Subsequent loads are instant. To control where the cache lives, set Configuration.ModelStorageDirectory before any load call.
If 4B is tight on your hardware, swap to qwen3:1.7b or gemma3:1b. Drop-in replacement; the rest of the code does not change.
Inspect model.Runtime.BackendName. To pin a specific backend (CUDA / Vulkan / Metal / AVX2), set Configuration.PreferredBackend before loading.
The first SubmitAsync warms the context. Subsequent calls reuse the KV-cache and are dramatically faster. Long sessions amortise the warmup.
Quantise the model (Q4 / Q5 variants in the catalogue), or use the multi-GPU and tensor-override path to put expensive weights on CPU. See the multi-GPU page.
Same NuGet, same code. Pre-bundle a small model and disable network model loading. See the edge deployment page.
Build agents
18 templates, multi-agent orchestration, graph workflows, MCP integration, observability. The full agent stack.
Explore agentsDocuments
OCR, layout understanding, document-to-Markdown, classification, summarisation, structured extraction, RAG. End-to-end document AI.
Explore documentsFoundations
SingleTurn, MultiTurn, Stateless. The three classes every other capability sits on. Streaming, sampling, cancellation, hibernation.
Explore primitivesRuntime
Backends, model catalog, encrypted models, multi-GPU, context hibernation, sampling controls, quantization, fine-tuning, LoRA.
Explore runtimeBridges
Microsoft.Extensions.AI bridge, Semantic Kernel bridge. Drop LM-Kit into existing pipelines without rewrites.
Explore integrationsReference
Full API reference, how-to guides, glossary, samples overview, and changelog. The deep dive when the marketing pages run out.
Open documentation