Solutions · Local Inference

The full AI stack, running on your hardware.

LM-Kit.NET is a Local AI stack for .NET. Large language models are one part of it. The same SDK also runs vision-language models, embedding models, on-device speech-to-text, OCR engines, layout analysers, classifiers, rerankers, and adapters. Every model executes inside your process. No cloud roundtrip. No vendor between your data and your output.

0 cloud calls 5 backends (CPU, AVX2, CUDA, Vulkan, Metal) 1 NuGet package
Models that run locally

LLMs & SLMs

Gemma, Qwen, Llama, Phi, GLM, GPT OSS. Open-weight, quantizable, swappable by ID.

Models that run locally

Vision & OCR

VLMs (Qwen2-VL, Gemma3-VL, GLM-V), LMKit OCR, PaddleOCR-VL, layout engines.

Models that run locally

Speech

A growing local STT stack: streaming, hallucination suppression, real-time translation, 100+ languages.

Models that run locally

Embeddings & classifiers

Qwen3-Embedding, EmbeddingGemma, BGE, plus LM-Kit-trained classifiers (sentiment, language ID, PII, NER).

Why local

Four reasons every AI workload should run locally.

The cloud was a convenience layer that turned into a dependency. Local inference reverses that. The benefits are not philosophical; they are measurable, contractual, and architectural.

01 · Latency

Predictable, single-digit milliseconds

No DNS, no TLS handshake, no queue, no rate limit, no rebuild after a provider rate-limits you at 3am. Inference time is bounded by your hardware, period. First-token latency on a modern GPU is in the tens of milliseconds. The slowest path is still faster than a healthy cloud round-trip.

02 · Security

Data never leaves the box

Prompts, completions, embeddings, vector stores, training data, all stay inside the process boundary. No upload, no log buffer at a third-party. Air-gapped deployments are a configuration toggle, not a special tier. Sensitive workloads (medical, financial, legal, government) become tractable with no contractual gymnastics.

03 · Control

Total ownership of the runtime

You pin the model version. You pin the runtime version. You pin the prompts. You decide when to upgrade. No silent model swap, no deprecation email, no policy change that breaks your evals overnight. Reproducibility becomes free: a snapshot of weights plus the SDK version reproduces the output bit-for-bit on the same hardware.

04 · Sovereignty

Data and technology sovereignty

Data sovereignty means the bytes stay inside the jurisdiction you choose. Technology sovereignty means the capability does not depend on a vendor's permission to operate. Both are durable advantages when regulation, geopolitics, or pricing turn against you.

05 · Cost

Capex once, opex near zero

A workstation GPU pays for itself in months at typical cloud token rates. Inference at the marginal cost of electricity. No per-call quota, no surprise bill when traffic spikes, no "unlimited tier" pricing that punishes success. The economics get better as your usage grows.

06 · Edge & offline

Works without a network

Field laptops, factory floors, ships, hospitals with strict egress, defence sites, customer machines after a cable cut. Same NuGet, same code path. The feature continues to work because there is no remote dependency to be unreachable.

How it runs

One line to load, one to infer.

The SDK abstracts the model class, the backend, and the runtime. The same call signature loads any model in the catalog. The same conversation primitives wrap any LLM. The same Attachment surface feeds any VLM or OCR engine. Swap a model ID, ship a different capability.

LocalAI.cs
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Graphics;
using LMKit.Speech;
using LMKit.Embeddings;

// LLM, on the GPU if one is present.
var chat = new MultiTurnConversation(LM.LoadFromModelID("qwen3.5:9b"));
string answer = await chat.SubmitAsync("Summarise this transcript in three bullets.");

// Vision-language model. Same load surface, different model ID.
var vlm = LM.LoadFromModelID("qwen3-vl:8b");
var caption = await new SingleTurnConversation(vlm)
    .SubmitAsync("Describe this image.", Attachment.FromFile("photo.png"));

// On-device speech-to-text. Streaming, with timestamps.
var stt = new SpeechToText(LM.LoadFromModelID("whisper-large-turbo3"));
await foreach (var segment in stt.StreamAsync("interview.wav"))
{
    Console.WriteLine($"[{segment.Start}] {segment.Text}");
}

// Embeddings, for retrieval.
var embedder = new Embedder(LM.LoadFromModelID("embeddinggemma-300m"));
float[] vector = embedder.GetEmbedding("any text");

Four model classes, one SDK, one process. Each model is selected by ID from the catalog. The runtime auto-detects the best backend on the host and loads weights into the right device. You can pin a backend if your fleet policy requires it. Backends page.

Runtime surface

Everything you need to operate local AI.

The runtime ships with the SDK. No companion service, no sidecar, no daemon. Each capability below has a dedicated page; this is the map.

Catalog

Model catalog

Discover, download, and run models by ID. Capability filtering, progress callbacks, configurable storage. One line from LoadFromModelID to first inference.

Browse the catalog

Backends

Hardware backends

CUDA 12, CUDA 13, Vulkan, Metal, AVX2, AVX. Precompiled into one NuGet. Auto-detect on each host; pin a backend when fleet policy requires it.

Explore backends

Security

Encrypted model loading

Stream-decrypted GGUF. No plaintext on disk, ever. Standards-based cipher, password-derived keys, drop-in replacement for the standard load path.

Explore encrypted models

Hardware

Multi-GPU & tensor overrides

Split a model across multiple GPUs, pin specific tensors to specific devices, run a 70B parameter model on a workstation. Tensor offloading to CPU when VRAM is tight.

Explore multi-GPU

Sessions

Context hibernation

Serialize an entire conversation context (KV-cache and all session state) to disk. Free RAM / VRAM on demand; rehydrate transparently on the next call. Long-lived chat without holding GPU memory hostage.

Explore hibernation

Core tech

Dynamic Sampling, the symbolic layer

The adaptive inference engine under every LM-Kit call. Structural awareness, contextual perplexity, speculative grammar. Up to roughly 10x faster on structured tasks, no fine-tuning, always on.

Explore Dynamic Sampling

Sampling

Sampling & generation controls

Temperature, top-k, top-p, deterministic seeds, grammar-constrained generation, BNF and JSON-schema enforcement. Every per-turn knob a production team needs.

Explore sampling controls
Where it runs

Anywhere .NET runs. And then some.

Same NuGet, same code, every target. The runtime adapts to the host rather than the other way around.

Frameworks
.NET Standard 2.0 · .NET 8 / 9 / 10
Operating systems
Windows, Linux x64 & ARM64, macOS (Intel + Apple Silicon)
Mobile
iOS, Android (via .NET MAUI or Xamarin)
Acceleration
CPU, AVX, AVX2, CUDA 12, CUDA 13, Vulkan, Metal
Form factors
Servers, workstations, developer laptops, edge devices, single-board computers
Deployments
Containerized, sidecar-free, air-gapped, offline-first
Use cases

Workloads that need to stay local.

Local inference is not a niche. It is the right answer for an expanding set of regulated, latency-sensitive, cost-sensitive, and offline-first workloads.

Regulated

Healthcare, finance, legal, government

HIPAA-, GDPR-, CCPA-, PCI-DSS-grade workloads where uploading prompts is a contractual non-starter. PII and PHI never cross the process boundary; audit trails stay inside your tenant.

Air-gapped

Defence, critical infrastructure, secure manufacturing

Networks with no egress. The SDK plus a bundled model is the entire deployment. No license-check call-home, no telemetry, no surprise dependency on a remote endpoint.

Edge

Field, factory, ship, hospital floor

Intermittent or zero connectivity. The same code that powers the office assistant powers the offline kiosk and the ruggedized tablet.

Latency-sensitive

UX-critical autocomplete, voice, AR

Anything where 200 ms of network round-trip is a deal-breaker. Local inference returns first tokens in the tens of milliseconds; streaming feels instant.

Volume

High-throughput pipelines

Batch document processing, log analysis, content moderation, large-scale RAG ingestion. The marginal cost is electricity, not per-token billing.

Distribution

ISVs shipping AI inside a product

If your customers run your software, your AI runs on their machine. No proxy API to maintain, no per-customer cloud account, no margin going to a third-party model provider.

Positioning

Local AI, not just local LLM.

A common shortcut is to equate "local AI" with "local LLM". They are not the same. Most real applications need a mix of model classes; an LLM alone is rarely the right answer.

Local LLM is one part of the stack

LLMs and SLMs handle generation, reasoning, tool calls, agent orchestration. LM-Kit.NET runs the full catalog (Gemma, Qwen, Llama, Phi, GLM, GPT OSS) with streaming, KV-cache reuse, LoRA hot-swap, deterministic and grammar-constrained generation.

Local AI is the full stack

Vision-language models for image understanding, OCR engines for document parsing, embedding models for retrieval, classifiers for routing and moderation, on-device speech-to-text, layout analysers for PDFs. Each runs locally; all interoperate through one SDK.

Why the distinction matters

A "local LLM" pipeline that calls a cloud embedding or a cloud OCR is not a local pipeline; one network dependency cancels every benefit listed above. LM-Kit.NET ensures every step of every workflow runs on-device, end to end.

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

You are here

Run AI on your hardware.

Zero cloud calls. Five backends. One NuGet. The five-minute quickstart is the fastest way to feel the difference.

Start in 5 minutes Download LM-Kit.NET