Solutions · Local Inference

The full AI stack, running on your hardware.

LM-Kit.NET is a Local AI stack for .NET. Large language models are one part of it. The same SDK also runs vision-language models, embedding models, on-device speech-to-text, OCR engines, layout analysers, classifiers, rerankers, and adapters. Every model executes inside your process. No cloud roundtrip. No vendor between your data and your output.

5-minute quickstart Full documentation

0 cloud calls 5 backends (CPU, AVX2, CUDA, Vulkan, Metal) 1 NuGet package

Models that run locally

LLMs & SLMs

Gemma, Qwen, Llama, Phi, GLM, GPT OSS. Open-weight, quantizable, swappable by ID.

Models that run locally

Vision & OCR

VLMs (Qwen2-VL, Gemma3-VL, GLM-V), LMKit OCR, PaddleOCR-VL, layout engines.

Models that run locally

Speech

A growing local STT stack: streaming, hallucination suppression, real-time translation, 100+ languages.

Models that run locally

Embeddings & classifiers

Qwen3-Embedding, EmbeddingGemma, BGE, plus LM-Kit-trained classifiers (sentiment, language ID, PII, NER).

Why local

Four reasons every AI workload should run locally.

The cloud was a convenience layer that turned into a dependency. Local inference reverses that. The benefits are not philosophical; they are measurable, contractual, and architectural.

01 · Latency

Predictable, single-digit milliseconds

No DNS, no TLS handshake, no queue, no rate limit, no rebuild after a provider rate-limits you at 3am. Inference time is bounded by your hardware, period. First-token latency on a modern GPU is in the tens of milliseconds. The slowest path is still faster than a healthy cloud round-trip.

02 · Security

Data never leaves the box

Prompts, completions, embeddings, vector stores, training data, all stay inside the process boundary. No upload, no log buffer at a third-party. Air-gapped deployments are a configuration toggle, not a special tier. Sensitive workloads (medical, financial, legal, government) become tractable with no contractual gymnastics.

03 · Control

Total ownership of the runtime

You pin the model version. You pin the runtime version. You pin the prompts. You decide when to upgrade. No silent model swap, no deprecation email, no policy change that breaks your evals overnight. Reproducibility becomes free: a snapshot of weights plus the SDK version reproduces the output bit-for-bit on the same hardware.

04 · Sovereignty

Data and technology sovereignty

Data sovereignty means the bytes stay inside the jurisdiction you choose. Technology sovereignty means the capability does not depend on a vendor's permission to operate. Both are durable advantages when regulation, geopolitics, or pricing turn against you.

05 · Cost

Capex once, opex near zero

A workstation GPU pays for itself in months at typical cloud token rates. Inference at the marginal cost of electricity. No per-call quota, no surprise bill when traffic spikes, no "unlimited tier" pricing that punishes success. The economics get better as your usage grows.

06 · Edge & offline

Works without a network

Field laptops, factory floors, ships, hospitals with strict egress, defence sites, customer machines after a cable cut. Same NuGet, same code path. The feature continues to work because there is no remote dependency to be unreachable.

How it runs

One line to load, one to infer.

The SDK abstracts the model class, the backend, and the runtime. The same call signature loads any model in the catalog. The same conversation primitives wrap any LLM. The same Attachment surface feeds any VLM or OCR engine. Swap a model ID, ship a different capability.

LocalAI.cs

using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Graphics;
using LMKit.Speech;
using LMKit.Embeddings;

// LLM, on the GPU if one is present.
var chat = new MultiTurnConversation(LM.LoadFromModelID("qwen3.5:9b"));
string answer = await chat.SubmitAsync("Summarise this transcript in three bullets.");

// Vision-language model. Same load surface, different model ID.
var vlm = LM.LoadFromModelID("qwen3-vl:8b");
var caption = await new SingleTurnConversation(vlm)
    .SubmitAsync("Describe this image.", Attachment.FromFile("photo.png"));

// On-device speech-to-text. Streaming, with timestamps.
var stt = new SpeechToText(LM.LoadFromModelID("whisper-large-turbo3"));
await foreach (var segment in stt.StreamAsync("interview.wav"))
{
    Console.WriteLine($"[{segment.Start}] {segment.Text}");
}

// Embeddings, for retrieval.
var embedder = new Embedder(LM.LoadFromModelID("embeddinggemma-300m"));
float[] vector = embedder.GetEmbedding("any text");

Four model classes, one SDK, one process. Each model is selected by ID from the catalog. The runtime auto-detects the best backend on the host and loads weights into the right device. You can pin a backend if your fleet policy requires it. Backends page.

Runtime surface

Everything you need to operate local AI.

The runtime ships with the SDK. No companion service, no sidecar, no daemon. Each capability below has a dedicated page; this is the map.

Catalog

Model catalog

Discover, download, and run models by ID. Capability filtering, progress callbacks, configurable storage. One line from LoadFromModelID to first inference.

Browse the catalog

Backends

Hardware backends

CUDA 12, CUDA 13, Vulkan, Metal, AVX2, AVX. Precompiled into one NuGet. Auto-detect on each host; pin a backend when fleet policy requires it.

Explore backends

Security

Encrypted model loading

Stream-decrypted GGUF. No plaintext on disk, ever. Standards-based cipher, password-derived keys, drop-in replacement for the standard load path.

Explore encrypted models

Hardware

Multi-GPU & tensor overrides

Split a model across multiple GPUs, pin specific tensors to specific devices, run a 70B parameter model on a workstation. Tensor offloading to CPU when VRAM is tight.

Explore multi-GPU

Sessions

Context hibernation

Serialize an entire conversation context (KV-cache and all session state) to disk. Free RAM / VRAM on demand; rehydrate transparently on the next call. Long-lived chat without holding GPU memory hostage.

Explore hibernation

Core tech

Dynamic Sampling, the symbolic layer

The adaptive inference engine under every LM-Kit call. Structural awareness, contextual perplexity, speculative grammar. Up to roughly 10x faster on structured tasks, no fine-tuning, always on.

Explore Dynamic Sampling

Sampling

Sampling & generation controls

Temperature, top-k, top-p, deterministic seeds, grammar-constrained generation, BNF and JSON-schema enforcement, speculative decoding, and Multi-Token Prediction (MTP) for roughly 2x throughput on supported checkpoints. Every per-turn knob a production team needs.

Explore sampling controls

Optimization

Make any model fit your hardware.

Once a model runs locally, the next question is how to make it smaller, faster, or more specialised. The optimization track is a separate hub for the tools that adapt off-the-shelf weights to your constraints.

Hub

Model optimization

Quantization, fine-tuning, LoRA. The complete toolkit for compressing models, adapting them to your domain, and hot-swapping behaviours at runtime without reloading the base model.

Open the hub

Quantize

Model quantization

Compress from FP32 to 8/6/5/4/3/2-bit. Up to 75% size reduction with controlled quality loss. Ship a 70B-class model on a 24 GB GPU; ship a 7B-class model on a 12 GB laptop.

Quantization

Adapt

LLM fine-tuning

Train task-specific adapters from your own data. LoRA-based parameter-efficient training. Runs on the same machine that hosts inference.

Fine-tuning

Hot-swap

LoRA integration

Load and merge LoRA adapters dynamically. Switch a base model's behaviour at runtime without reloading the weights. One base model + N adapters = N specialised assistants.

LoRA

Where it runs

Anywhere .NET runs. And then some.

Same NuGet, same code, every target. The runtime adapts to the host rather than the other way around.

Frameworks: .NET Standard 2.0 · .NET 8 / 9 / 10
Operating systems: Windows, Linux x64 & ARM64, macOS (Intel + Apple Silicon)
Mobile: iOS, Android (via .NET MAUI or Xamarin)
Acceleration: CPU, AVX, AVX2, CUDA 12, CUDA 13, Vulkan, Metal
Form factors: Servers, workstations, developer laptops, edge devices, single-board computers
Deployments: Containerized, sidecar-free, air-gapped, offline-first

Use cases

Workloads that need to stay local.

Local inference is not a niche. It is the right answer for an expanding set of regulated, latency-sensitive, cost-sensitive, and offline-first workloads.

Regulated

Healthcare, finance, legal, government

HIPAA-, GDPR-, CCPA-, PCI-DSS-grade workloads where uploading prompts is a contractual non-starter. PII and PHI never cross the process boundary; audit trails stay inside your tenant.

Air-gapped

Defence, critical infrastructure, secure manufacturing

Networks with no egress. The SDK plus a bundled model is the entire deployment. No license-check call-home, no telemetry, no surprise dependency on a remote endpoint.

Edge

Field, factory, ship, hospital floor

Intermittent or zero connectivity. The same code that powers the office assistant powers the offline kiosk and the ruggedized tablet.

Latency-sensitive

UX-critical autocomplete, voice, AR

Anything where 200 ms of network round-trip is a deal-breaker. Local inference returns first tokens in the tens of milliseconds; streaming feels instant.

Volume

High-throughput pipelines

Batch document processing, log analysis, content moderation, large-scale RAG ingestion. The marginal cost is electricity, not per-token billing.

Distribution

ISVs shipping AI inside a product

If your customers run your software, your AI runs on their machine. No proxy API to maintain, no per-customer cloud account, no margin going to a third-party model provider.

Positioning

Local AI, not just local LLM.

A common shortcut is to equate "local AI" with "local LLM". They are not the same. Most real applications need a mix of model classes; an LLM alone is rarely the right answer.

Local LLM is one part of the stack

LLMs and SLMs handle generation, reasoning, tool calls, agent orchestration. LM-Kit.NET runs the full catalog (Gemma, Qwen, Llama, Phi, GLM, GPT OSS) with streaming, KV-cache reuse, LoRA hot-swap, deterministic and grammar-constrained generation.

Local AI is the full stack

Vision-language models for image understanding, OCR engines for document parsing, embedding models for retrieval, classifiers for routing and moderation, on-device speech-to-text, layout analysers for PDFs. Each runs locally; all interoperate through one SDK.

Why the distinction matters

A "local LLM" pipeline that calls a cloud embedding or a cloud OCR is not a local pipeline; one network dependency cancels every benefit listed above. LM-Kit.NET ensures every step of every workflow runs on-device, end to end.

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Encrypted model loading

Console demo: stream-decrypt GGUF without writing plaintext to disk.

Open on GitHub → How-to guide

Configure GPU backends and optimize performance

Pick CUDA, Vulkan, Metal, AVX2. Tune for throughput vs latency.

Read the guide → How-to guide

Browse and select models programmatically

Discover, filter, and load any model in the catalog.

Read the guide → How-to guide

Migrate from cloud to local inference

Practical patterns for switching from cloud APIs to LM-Kit.NET.

Read the guide →

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

01 · AI Agents

Orchestration patterns

ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.

AI Agents

02 · Document Intelligence

Parse PDFs, images, EML

PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.

Document Intelligence

03 · Vision & Multimodal

VLMs, image classification, chat with image

Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.

Vision & Multimodal

04 · RAG & Knowledge

Vector search and retrieval

Built-in vector store, Qdrant and pgvector connectors, embeddings, hybrid retrieval, document chunking, source citations.

RAG & Knowledge

05 · Text Analysis

Classification, NER, PII, sentiment

Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.

Text Analysis

06 · Speech & Audio

Audio transcription, STT

A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.

Speech & Audio

07 · Text Generation

Conversations, rewriting, summaries

Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.

Text Generation

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

You are here

Run AI on your hardware.

Zero cloud calls. Five backends. One NuGet. The five-minute quickstart is the fastest way to feel the difference.

Start in 5 minutes Download LM-Kit.NET