LLMs & SLMs
Gemma, Qwen, Llama, Phi, GLM, GPT OSS. Open-weight, quantizable, swappable by ID.
LM-Kit.NET is a Local AI stack for .NET. Large language models are one part of it. The same SDK also runs vision-language models, embedding models, on-device speech-to-text, OCR engines, layout analysers, classifiers, rerankers, and adapters. Every model executes inside your process. No cloud roundtrip. No vendor between your data and your output.
Gemma, Qwen, Llama, Phi, GLM, GPT OSS. Open-weight, quantizable, swappable by ID.
VLMs (Qwen2-VL, Gemma3-VL, GLM-V), LMKit OCR, PaddleOCR-VL, layout engines.
A growing local STT stack: streaming, hallucination suppression, real-time translation, 100+ languages.
Qwen3-Embedding, EmbeddingGemma, BGE, plus LM-Kit-trained classifiers (sentiment, language ID, PII, NER).
The cloud was a convenience layer that turned into a dependency. Local inference reverses that. The benefits are not philosophical; they are measurable, contractual, and architectural.
01 · Latency
No DNS, no TLS handshake, no queue, no rate limit, no rebuild after a provider rate-limits you at 3am. Inference time is bounded by your hardware, period. First-token latency on a modern GPU is in the tens of milliseconds. The slowest path is still faster than a healthy cloud round-trip.
02 · Security
Prompts, completions, embeddings, vector stores, training data, all stay inside the process boundary. No upload, no log buffer at a third-party. Air-gapped deployments are a configuration toggle, not a special tier. Sensitive workloads (medical, financial, legal, government) become tractable with no contractual gymnastics.
03 · Control
You pin the model version. You pin the runtime version. You pin the prompts. You decide when to upgrade. No silent model swap, no deprecation email, no policy change that breaks your evals overnight. Reproducibility becomes free: a snapshot of weights plus the SDK version reproduces the output bit-for-bit on the same hardware.
04 · Sovereignty
Data sovereignty means the bytes stay inside the jurisdiction you choose. Technology sovereignty means the capability does not depend on a vendor's permission to operate. Both are durable advantages when regulation, geopolitics, or pricing turn against you.
05 · Cost
A workstation GPU pays for itself in months at typical cloud token rates. Inference at the marginal cost of electricity. No per-call quota, no surprise bill when traffic spikes, no "unlimited tier" pricing that punishes success. The economics get better as your usage grows.
06 · Edge & offline
Field laptops, factory floors, ships, hospitals with strict egress, defence sites, customer machines after a cable cut. Same NuGet, same code path. The feature continues to work because there is no remote dependency to be unreachable.
The SDK abstracts the model class, the backend, and the runtime. The same call signature loads any model in the catalog. The same conversation primitives wrap any LLM. The same Attachment surface feeds any VLM or OCR engine. Swap a model ID, ship a different capability.
using LMKit.Model; using LMKit.TextGeneration; using LMKit.Graphics; using LMKit.Speech; using LMKit.Embeddings; // LLM, on the GPU if one is present. var chat = new MultiTurnConversation(LM.LoadFromModelID("qwen3.5:9b")); string answer = await chat.SubmitAsync("Summarise this transcript in three bullets."); // Vision-language model. Same load surface, different model ID. var vlm = LM.LoadFromModelID("qwen3-vl:8b"); var caption = await new SingleTurnConversation(vlm) .SubmitAsync("Describe this image.", Attachment.FromFile("photo.png")); // On-device speech-to-text. Streaming, with timestamps. var stt = new SpeechToText(LM.LoadFromModelID("whisper-large-turbo3")); await foreach (var segment in stt.StreamAsync("interview.wav")) { Console.WriteLine($"[{segment.Start}] {segment.Text}"); } // Embeddings, for retrieval. var embedder = new Embedder(LM.LoadFromModelID("embeddinggemma-300m")); float[] vector = embedder.GetEmbedding("any text");
Four model classes, one SDK, one process. Each model is selected by ID from the catalog. The runtime auto-detects the best backend on the host and loads weights into the right device. You can pin a backend if your fleet policy requires it. Backends page.
The runtime ships with the SDK. No companion service, no sidecar, no daemon. Each capability below has a dedicated page; this is the map.
Catalog
Discover, download, and run models by ID. Capability filtering, progress callbacks, configurable storage. One line from LoadFromModelID to first inference.
Backends
CUDA 12, CUDA 13, Vulkan, Metal, AVX2, AVX. Precompiled into one NuGet. Auto-detect on each host; pin a backend when fleet policy requires it.
Explore backendsSecurity
Stream-decrypted GGUF. No plaintext on disk, ever. Standards-based cipher, password-derived keys, drop-in replacement for the standard load path.
Explore encrypted modelsHardware
Split a model across multiple GPUs, pin specific tensors to specific devices, run a 70B parameter model on a workstation. Tensor offloading to CPU when VRAM is tight.
Explore multi-GPUSessions
Serialize an entire conversation context (KV-cache and all session state) to disk. Free RAM / VRAM on demand; rehydrate transparently on the next call. Long-lived chat without holding GPU memory hostage.
Explore hibernationCore tech
The adaptive inference engine under every LM-Kit call. Structural awareness, contextual perplexity, speculative grammar. Up to roughly 10x faster on structured tasks, no fine-tuning, always on.
Explore Dynamic SamplingSampling
Temperature, top-k, top-p, deterministic seeds, grammar-constrained generation, BNF and JSON-schema enforcement. Every per-turn knob a production team needs.
Explore sampling controlsOnce a model runs locally, the next question is how to make it smaller, faster, or more specialised. The optimization track is a separate hub for the tools that adapt off-the-shelf weights to your constraints.
Hub
Quantization, fine-tuning, LoRA. The complete toolkit for compressing models, adapting them to your domain, and hot-swapping behaviours at runtime without reloading the base model.
Open the hubQuantize
Compress from FP32 to 8/6/5/4/3/2-bit. Up to 75% size reduction with controlled quality loss. Ship a 70B-class model on a 24 GB GPU; ship a 7B-class model on a 12 GB laptop.
QuantizationAdapt
Train task-specific adapters from your own data. LoRA-based parameter-efficient training. Runs on the same machine that hosts inference.
Fine-tuningHot-swap
Load and merge LoRA adapters dynamically. Switch a base model's behaviour at runtime without reloading the weights. One base model + N adapters = N specialised assistants.
LoRASame NuGet, same code, every target. The runtime adapts to the host rather than the other way around.
Local inference is not a niche. It is the right answer for an expanding set of regulated, latency-sensitive, cost-sensitive, and offline-first workloads.
Regulated
HIPAA-, GDPR-, CCPA-, PCI-DSS-grade workloads where uploading prompts is a contractual non-starter. PII and PHI never cross the process boundary; audit trails stay inside your tenant.
Air-gapped
Networks with no egress. The SDK plus a bundled model is the entire deployment. No license-check call-home, no telemetry, no surprise dependency on a remote endpoint.
Edge
Intermittent or zero connectivity. The same code that powers the office assistant powers the offline kiosk and the ruggedized tablet.
Latency-sensitive
Anything where 200 ms of network round-trip is a deal-breaker. Local inference returns first tokens in the tens of milliseconds; streaming feels instant.
Volume
Batch document processing, log analysis, content moderation, large-scale RAG ingestion. The marginal cost is electricity, not per-token billing.
Distribution
If your customers run your software, your AI runs on their machine. No proxy API to maintain, no per-customer cloud account, no margin going to a third-party model provider.
A common shortcut is to equate "local AI" with "local LLM". They are not the same. Most real applications need a mix of model classes; an LLM alone is rarely the right answer.
LLMs and SLMs handle generation, reasoning, tool calls, agent orchestration. LM-Kit.NET runs the full catalog (Gemma, Qwen, Llama, Phi, GLM, GPT OSS) with streaming, KV-cache reuse, LoRA hot-swap, deterministic and grammar-constrained generation.
Vision-language models for image understanding, OCR engines for document parsing, embedding models for retrieval, classifiers for routing and moderation, on-device speech-to-text, layout analysers for PDFs. Each runs locally; all interoperate through one SDK.
A "local LLM" pipeline that calls a cloud embedding or a cloud OCR is not a local pipeline; one network dependency cancels every benefit listed above. LM-Kit.NET ensures every step of every workflow runs on-device, end to end.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Console demo: stream-decrypt GGUF without writing plaintext to disk.
Open on GitHub → How-to guidePick CUDA, Vulkan, Metal, AVX2. Tune for throughput vs latency.
Read the guide → How-to guideDiscover, filter, and load any model in the catalog.
Read the guide → How-to guidePractical patterns for switching from cloud APIs to LM-Kit.NET.
Read the guide →The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.
01 · AI Agents
ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.
AI Agents02 · Document Intelligence
PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.
Document Intelligence03 · Vision & Multimodal
Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.
Vision & Multimodal04 · RAG & Knowledge
Built-in vector store, Qdrant connector, embeddings, hybrid retrieval, document chunking, source citations.
RAG & Knowledge05 · Text Analysis
Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.
Text Analysis06 · Speech & Audio
A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.
Speech & Audio07 · Text Generation
Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.
Text GenerationThe foundation
Every capability above runs on this runtime.
Foundation
The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.
Zero cloud calls. Five backends. One NuGet. The five-minute quickstart is the fastest way to feel the difference.