Compress, fine-tune & deploy optimized AI models.
Full model optimization toolkit for edge deployment. Quantize models from FP32 to 2-bit precision, fine-tune with LoRA adapters, and dynamically switch adapters at runtime. Reduce model size by up to 75% while preserving quality. 100% on-device processing.
Optimize models for edge deployment.
LM-Kit.NET provides a complete model optimization toolkit for deploying AI on resource-constrained devices. Reduce model size through quantization, adapt models to specific tasks with LoRA fine-tuning, and dynamically switch between specialized adapters at runtime without reloading the base model.
Whether you're targeting mobile devices, IoT systems, or desktop applications with limited resources, LM-Kit.NET's optimization features let you balance model quality against computational constraints while keeping all processing 100% local and private.
Edge-first design: All optimization operations run entirely on-device. Quantize, fine-tune, and deploy models without any cloud dependencies or data transmission.
Compress
Quantization
Compress models from FP32 to 2-8 bit precision. Reduce size up to 75% with minimal quality loss.
Adapt
Fine-tuning
Train task-specific adapters using LoRA. Efficient parameter updates without full retraining.
Switch
Adapter management
Dynamic loading and merging of LoRA adapters. Switch specialised behaviours at runtime.
Models, memory, and decoding control.
Beyond quantisation and fine-tuning sit the everyday levers of a production local-inference deployment: discovering and loading the right model, protecting proprietary weights, fitting big models onto small hardware, freeing idle context, and shaping every output token. Each capability has a dedicated page.
Catalog
Model catalog
Discover, download, and run models by ID. Capability filtering, progress callbacks, configurable storage. From LoadFromModelID to running inference in one line.
Backends
Hardware backends
CUDA 12, CUDA 13, Vulkan, Metal, AVX2, AVX precompiled into a single NuGet. Auto-detects the fastest path on each host; pin a backend explicitly when fleet policy requires it.
Explore backendsSecurity
Encrypted model loading
Stream-decrypted GGUF. No plaintext on disk, ever. Standards-based cipher, password-derived keys, drop-in replacement for the standard load path.
Explore encrypted modelsHardware
Multi-GPU & tensor overrides
Run large MoE and dense models on commodity hardware. Per-tensor device placement, MoE expert offloading, distributed inference across GPUs.
Explore multi-GPUMemory
Context hibernation
Pause an agent. Free the GPU. Resume hours later. Serialise multi-gigabyte inference contexts to disk; rehydrate transparently.
Explore hibernationDecoding
Sampling & generation controls
Dynamic sampling, logit biasing, repetition penalty, entropy-bounded sampling, speculative decoding. Same model, dramatically different output behaviour.
Explore samplingReduce model size and accelerate inference.
Reduce model size and accelerate inference by converting weights to lower precision formats.
30+ precision formats
LM-Kit.NET supports an extensive range of quantization formats, from 1-bit to 16-bit precision. Each format offers different tradeoffs between model size, inference speed, and output quality. K-means clustered formats (Q*_K_*) provide better quality retention at the same bit-width.
Quantization reduces memory footprint dramatically, enabling deployment of larger models on constrained hardware. A 7B parameter model at FP16 (~14GB) can be compressed to ~4GB at Q4_K_M while maintaining near-original quality for most tasks.
Quantization features
- Cluster-aware formats for quality retention
- Batch quantization to all formats
- Model validation before processing
- GGUF format output
- Preserves model metadata
LLM fine-tuning with LoRA.
Train task-specific adapters efficiently without modifying the base model weights.
Low-Rank Adaptation (LoRA) enables efficient fine-tuning by training small adapter layers while keeping base model weights frozen. Dramatically reduces compute and memory requirements.
Training configuration
- Configurable rank (r) and alpha scaling
- Per-tensor rank customization
- AdamW optimizer with decay control
- Gradient accumulation support
Training progress
Monitor training progress with real-time loss and accuracy metrics. Set early stopping conditions based on loss thresholds or maximum iterations.
- Loss and accuracy monitoring
- Early stopping conditions
- Convergence detection
Checkpointing
Save and restore training checkpoints to resume interrupted sessions. Preserve optimizer state and training progress across sessions.
- Automatic checkpoint saving
- Resume from any checkpoint
- Optimizer state preservation
using LMKit.Model; using LMKit.Finetuning; // Load base model for fine-tuning var model = new LM("path/to/base-model.gguf"); // Configure LoRA training parameters var trainingParams = new LoraTrainingParameters { LoraRank = 16, LoraAlpha = 32, AdamAlpha = 1e-4f, AdamBeta1 = 0.9f, AdamBeta2 = 0.999f, AdamDecay = 0.01f, GradientAccumulation = 4, MaxNoImprovement = 100, // Per-tensor rank customization RankWQ = 16, // Query weight RankWK = 16, // Key weight RankWV = 16, // Value weight RankWO = 8 // Output weight }; // Create trainer and subscribe to progress events var trainer = new LoraTrainer(model, trainingParams); trainer.Progress += (s, e) => { Console.WriteLine($"Iteration {e.Iteration}: Loss={e.Loss:F4}, Accuracy={e.Accuracy:P1}"); }; // Train on your dataset await trainer.TrainAsync(trainingDataset); // Save the trained adapter trainer.SaveAdapter("sentiment-adapter.lora");
LoRA adapter integration.
Dynamically load, swap, and merge LoRA adapters without reloading the base model.
Hot-swap adapters at runtime
LM-Kit.NET enables dynamic LoRA adapter management at inference time. Load multiple adapters into a single model instance and control their influence through scale parameters. Switch between specialized behaviors (sentiment analysis, code generation, domain expertise) without the overhead of reloading the base model.
For permanent deployment, merge adapter weights directly into the base model using LoraMerger to create a single optimized model file with no runtime overhead.
Adapter operations
- Load adapters dynamically via
ApplyLoraAdapter - Scale-based activation control (0.0 to 1.0)
- Multiple adapters on single model instance
- Remove adapters with
RemoveLoraAdapter - Permanent merge via
LoraMerger.Merge
Load
ApplyLoraAdapter
Load LoRA adapter from file or LoraAdapterSource. Registers in model's Adapters collection.
Scale
Scale control
Adjust adapter influence with Scale property. Set to 0 to disable, 1 for full effect.
Unload
RemoveLoraAdapter
Unload adapter from model instance. Free memory and restore base model behavior.
Merge
LoraMerger.Merge
Permanently merge adapter weights into base model. Create single optimized model file.
using LMKit.Model; using LMKit.Finetuning; // Load base model var model = new LM("path/to/base-model.gguf"); // Dynamic adapter loading model.ApplyLoraAdapter("sentiment-adapter.lora", scale: 1.0f); model.ApplyLoraAdapter("code-adapter.lora", scale: 0.0f); // Loaded but inactive // Access loaded adapters foreach (var adapter in model.Adapters) { Console.WriteLine($"Adapter: {adapter.Name}, Scale: {adapter.Scale}"); } // Switch active adapter at runtime model.Adapters[0].Scale = 0.0f; // Disable sentiment model.Adapters[1].Scale = 1.0f; // Enable code // Remove adapter when no longer needed model.RemoveLoraAdapter(model.Adapters[0]); // Permanent merge for deployment var merger = new LoraMerger(model); merger.Merge("merged-model.gguf");
Optimization capabilities.
Comprehensive toolkit for model compression, adaptation, and deployment.
Cluster-aware
Importance-aware quantization
Cluster-aware quantisation preserves model quality at lower bit-widths. Q4_K_M and Q5_K_M formats offer excellent quality-size tradeoffs.
Per-tensor
Per-tensor LoRA ranks
Fine-grained control over LoRA rank for each tensor type: WQ, WK, WV, WO, feed-forward, and normalization layers. Optimize adapter size and effectiveness.
Gradient
Gradient accumulation
Train with larger effective batch sizes on memory-constrained hardware. Accumulate gradients across multiple forward passes before updating weights.
Cosine
Cosine learning rate
Cosine decay scheduling with warm restarts. Configure decay steps, minimum rate, and restart multipliers for optimal training dynamics.
Clip
Gradient clipping
Prevent exploding gradients during training with configurable gradient clipping. Stabilize training on challenging datasets.
Datasets
Training datasets
Built-in support for training data management with ChatTrainingSample, TrainingDataset, and ShareGptExporter for data preparation and export.
Optimization use cases.
Deploy optimized models across resource-constrained environments.
Mobile
Mobile applications
Deploy quantized models on iOS and Android devices. Q4_K_M provides excellent quality at 4x smaller size for on-device inference.
Edge
IoT and edge devices
Run AI on Raspberry Pi, NVIDIA Jetson, and other edge hardware. Aggressive quantization (Q2_K, Q3_K) enables deployment on minimal resources.
Domain
Domain-specific assistants
Fine-tune models for legal, medical, finance, or technical domains. LoRA adapters specialize generic models for industry-specific tasks.
Sentiment
Sentiment analysis
Fine-tune for sentiment classification with dramatically improved accuracy. Start at ~46% baseline, reach 95%+ with targeted training.
SaaS
Multi-tenant applications
Deploy one base model with multiple LoRA adapters for different customers or use cases. Switch adapters per-request without model reload.
Science
Scientific assistants
Create specialized chemistry, biology, or physics assistants. LoRA fine-tuning can improve domain accuracy from 17% to 40%+ baseline.
Key classes & methods.
Core components for model optimization workflows.
LM.Precision
Enumeration of 30+ quantization formats from 1-bit to FP32. Includes K-means variants (Q*_K_*) and importance quantization (IQ*) formats.
LoraTrainingParameters
Complete configuration for LoRA fine-tuning: ranks, alpha scaling, Adam optimizer settings, gradient control, and learning rate scheduling.
LM.ApplyLoraAdapter
Dynamically load LoRA adapters into a model instance. Supports file paths and LoraAdapterSource objects with scale-based activation.
LM.Adapters
Collection of currently loaded LoRA adapters on a model instance. Access individual adapters to adjust Scale or retrieve metadata.
LoraMerger
Permanently merge LoRA adapter weights into base model. Creates optimized single-file deployment with no runtime adapter overhead.
LoraAdapter
Represents a loaded LoRA adapter with Name, Scale, and source information. Control adapter influence through Scale property (0.0-1.0).
Build it. Read it. Try it.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Quantizer
Console demo: compress models from FP32 to 2-bit precision.
Open on GitHub → How-to guideQuantize a model
Choose precision, balance quality vs size, ship.
Read the guide → How-to guideLoad and merge LoRA adapters
Hot-swap adapters at runtime without reloading the base model.
Read the guide → How-to guidePrepare training datasets for LoRA finetuning
Data prep, validation, and quality controls for LoRA training.
Read the guide →Seven pillars, one foundation.
The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.
01 · AI Agents
Orchestration patterns
ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.
AI Agents02 · Document Intelligence
Parse PDFs, images, EML
PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.
Document Intelligence03 · Vision & Multimodal
VLMs, image classification, chat with image
Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.
Vision & Multimodal04 · RAG & Knowledge
Vector search and retrieval
Built-in vector store, Qdrant connector, embeddings, hybrid retrieval, document chunking, source citations.
RAG & Knowledge05 · Text Analysis
Classification, NER, PII, sentiment
Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.
Text Analysis06 · Speech & Audio
Audio transcription, STT
A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.
Speech & Audio07 · Text Generation
Conversations, rewriting, summaries
Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.
Text GenerationThe foundation
Every capability above runs on this runtime.
Foundation
Local Inference
The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.
Ready to optimize your models?
Compress, fine-tune, and deploy AI models optimized for your hardware. 100% local, 100% private.