Model Optimization for .NET, Quantization, LoRA & Fine-tuning, LM-Kit

Optimize models for edge deployment.

LM-Kit.NET provides a complete model optimization toolkit for deploying AI on resource-constrained devices. Reduce model size through quantization, adapt models to specific tasks with LoRA fine-tuning, and dynamically switch between specialized adapters at runtime without reloading the base model.

Whether you're targeting mobile devices, IoT systems, or desktop applications with limited resources, LM-Kit.NET's optimization features let you balance model quality against computational constraints while keeping all processing 100% local and private.

Edge-first design: All optimization operations run entirely on-device. Quantize, fine-tune, and deploy models without any cloud dependencies or data transmission.

Compress

Quantization

Compress models from FP32 to 2-8 bit precision. Reduce size up to 75% with minimal quality loss.

Adapt

Fine-tuning

Train task-specific adapters using LoRA. Efficient parameter updates without full retraining.

Switch

Adapter management

Dynamic loading and merging of LoRA adapters. Switch specialised behaviours at runtime.

Runtime essentials

Models, memory, and decoding control.

Beyond quantisation and fine-tuning sit the everyday levers of a production local-inference deployment: discovering and loading the right model, protecting proprietary weights, fitting big models onto small hardware, freeing idle context, and shaping every output token. Each capability has a dedicated page.

Catalog

Model catalog

Discover, download, and run models by ID. Capability filtering, progress callbacks, configurable storage. From LoadFromModelID to running inference in one line.

Explore catalog

Backends

Hardware backends

CUDA 12, CUDA 13, Vulkan, Metal, AVX2, AVX precompiled into a single NuGet. Auto-detects the fastest path on each host; pin a backend explicitly when fleet policy requires it.

Explore backends

Security

Encrypted model loading

Stream-decrypted GGUF. No plaintext on disk, ever. Standards-based cipher, password-derived keys, drop-in replacement for the standard load path.

Explore encrypted models

Hardware

Multi-GPU & tensor overrides

Run large MoE and dense models on commodity hardware. Per-tensor device placement, MoE expert offloading, distributed inference across GPUs.

Explore multi-GPU

Memory

Context hibernation

Pause an agent. Free the GPU. Resume hours later. Serialise multi-gigabyte inference contexts to disk; rehydrate transparently.

Explore hibernation

Decoding

Sampling & generation controls

Dynamic sampling, logit biasing, repetition penalty, entropy-bounded sampling, speculative decoding. Same model, dramatically different output behaviour.

Explore sampling

Model quantization

Reduce model size and accelerate inference.

Reduce model size and accelerate inference by converting weights to lower precision formats.

30+ precision formats

LM-Kit.NET supports an extensive range of quantization formats, from 1-bit to 16-bit precision. Each format offers different tradeoffs between model size, inference speed, and output quality. K-means clustered formats (Q*_K_*) provide better quality retention at the same bit-width.

Quantization reduces memory footprint dramatically, enabling deployment of larger models on constrained hardware. A 7B parameter model at FP16 (~14GB) can be compressed to ~4GB at Q4_K_M while maintaining near-original quality for most tasks.

Quantization features

Cluster-aware formats for quality retention
Batch quantization to all formats
Model validation before processing
GGUF format output
Preserves model metadata

LLM fine-tuning

LLM fine-tuning with LoRA.

Train task-specific adapters efficiently without modifying the base model weights.

Low-Rank Adaptation (LoRA) enables efficient fine-tuning by training small adapter layers while keeping base model weights frozen. Dramatically reduces compute and memory requirements.

Training configuration

Configurable rank (r) and alpha scaling
Per-tensor rank customization
AdamW optimizer with decay control
Gradient accumulation support

Training progress

Monitor training progress with real-time loss and accuracy metrics. Set early stopping conditions based on loss thresholds or maximum iterations.

Loss and accuracy monitoring
Early stopping conditions
Convergence detection

Checkpointing

Save and restore training checkpoints to resume interrupted sessions. Preserve optimizer state and training progress across sessions.

Automatic checkpoint saving
Resume from any checkpoint
Optimizer state preservation

LoraTraining.cs

using LMKit.Model;
using LMKit.Finetuning;

// Load base model for fine-tuning
var model = new LM("path/to/base-model.gguf");

// Configure LoRA training parameters
var trainingParams = new LoraTrainingParameters
{
    LoraRank = 16,
    LoraAlpha = 32,
    AdamAlpha = 1e-4f,
    AdamBeta1 = 0.9f,
    AdamBeta2 = 0.999f,
    AdamDecay = 0.01f,
    GradientAccumulation = 4,
    MaxNoImprovement = 100,
    // Per-tensor rank customization
    RankWQ = 16, // Query weight
    RankWK = 16, // Key weight
    RankWV = 16, // Value weight
    RankWO = 8   // Output weight
};

// Create trainer and subscribe to progress events
var trainer = new LoraTrainer(model, trainingParams);
trainer.Progress += (s, e) =>
{
    Console.WriteLine($"Iteration {e.Iteration}: Loss={e.Loss:F4}, Accuracy={e.Accuracy:P1}");
};

// Train on your dataset
await trainer.TrainAsync(trainingDataset);

// Save the trained adapter
trainer.SaveAdapter("sentiment-adapter.lora");

LoRA integration

LoRA adapter integration.

Dynamically load, swap, and merge LoRA adapters without reloading the base model.

Hot-swap adapters at runtime

LM-Kit.NET enables dynamic LoRA adapter management at inference time. Load multiple adapters into a single model instance and control their influence through scale parameters. Switch between specialized behaviors (sentiment analysis, code generation, domain expertise) without the overhead of reloading the base model.

For permanent deployment, merge adapter weights directly into the base model using LoraMerger to create a single optimized model file with no runtime overhead.

Adapter operations

Load adapters dynamically via ApplyLoraAdapter
Scale-based activation control (0.0 to 1.0)
Multiple adapters on single model instance
Remove adapters with RemoveLoraAdapter
Permanent merge via LoraMerger.Merge

Load

`ApplyLoraAdapter`

Load LoRA adapter from file or LoraAdapterSource. Registers in model's Adapters collection.

Scale

Scale control

Adjust adapter influence with Scale property. Set to 0 to disable, 1 for full effect.

Unload

`RemoveLoraAdapter`

Unload adapter from model instance. Free memory and restore base model behavior.

Merge

`LoraMerger.Merge`

Permanently merge adapter weights into base model. Create single optimized model file.

LoraIntegration.cs

using LMKit.Model;
using LMKit.Finetuning;

// Load base model
var model = new LM("path/to/base-model.gguf");

// Dynamic adapter loading
model.ApplyLoraAdapter("sentiment-adapter.lora", scale: 1.0f);
model.ApplyLoraAdapter("code-adapter.lora", scale: 0.0f); // Loaded but inactive

// Access loaded adapters
foreach (var adapter in model.Adapters)
{
    Console.WriteLine($"Adapter: {adapter.Name}, Scale: {adapter.Scale}");
}

// Switch active adapter at runtime
model.Adapters[0].Scale = 0.0f; // Disable sentiment
model.Adapters[1].Scale = 1.0f; // Enable code

// Remove adapter when no longer needed
model.RemoveLoraAdapter(model.Adapters[0]);

// Permanent merge for deployment
var merger = new LoraMerger(model);
merger.Merge("merged-model.gguf");

Capabilities

Optimization capabilities.

Comprehensive toolkit for model compression, adaptation, and deployment.

Cluster-aware

Importance-aware quantization

Cluster-aware quantisation preserves model quality at lower bit-widths. Q4_K_M and Q5_K_M formats offer excellent quality-size tradeoffs.

Per-tensor

Per-tensor LoRA ranks

Fine-grained control over LoRA rank for each tensor type: WQ, WK, WV, WO, feed-forward, and normalization layers. Optimize adapter size and effectiveness.

Gradient

Gradient accumulation

Train with larger effective batch sizes on memory-constrained hardware. Accumulate gradients across multiple forward passes before updating weights.

Cosine

Cosine learning rate

Cosine decay scheduling with warm restarts. Configure decay steps, minimum rate, and restart multipliers for optimal training dynamics.

Clip

Gradient clipping

Prevent exploding gradients during training with configurable gradient clipping. Stabilize training on challenging datasets.

Datasets

Training datasets

Built-in support for training data management with ChatTrainingSample, TrainingDataset, and ShareGptExporter for data preparation and export.

Applications

Optimization use cases.

Deploy optimized models across resource-constrained environments.

Mobile

Mobile applications

Deploy quantized models on iOS and Android devices. Q4_K_M provides excellent quality at 4x smaller size for on-device inference.

Edge

IoT and edge devices

Run AI on Raspberry Pi, NVIDIA Jetson, and other edge hardware. Aggressive quantization (Q2_K, Q3_K) enables deployment on minimal resources.

Domain

Domain-specific assistants

Fine-tune models for legal, medical, finance, or technical domains. LoRA adapters specialize generic models for industry-specific tasks.

Sentiment

Sentiment analysis

Fine-tune for sentiment classification with dramatically improved accuracy. Start at ~46% baseline, reach 95%+ with targeted training.

SaaS

Multi-tenant applications

Deploy one base model with multiple LoRA adapters for different customers or use cases. Switch adapters per-request without model reload.

Science

Scientific assistants

Create specialized chemistry, biology, or physics assistants. LoRA fine-tuning can improve domain accuracy from 17% to 40%+ baseline.

Developer Resources

Key classes & methods.

Core components for model optimization workflows.

`LM.Precision`

Enumeration of 30+ quantization formats from 1-bit to FP32. Includes K-means variants (Q*_K_*) and importance quantization (IQ*) formats.

View documentation

`LoraTrainingParameters`

Complete configuration for LoRA fine-tuning: ranks, alpha scaling, Adam optimizer settings, gradient control, and learning rate scheduling.

View documentation

`LM.ApplyLoraAdapter`

Dynamically load LoRA adapters into a model instance. Supports file paths and LoraAdapterSource objects with scale-based activation.

View documentation

`LM.Adapters`

Collection of currently loaded LoRA adapters on a model instance. Access individual adapters to adjust Scale or retrieve metadata.

View documentation

`LoraMerger`

Permanently merge LoRA adapter weights into base model. Creates optimized single-file deployment with no runtime adapter overhead.

View documentation

`LoraAdapter`

Represents a loaded LoRA adapter with Name, Scale, and source information. Control adapter influence through Scale property (0.0-1.0).

View documentation

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Quantizer

Console demo: compress models from FP32 to 2-bit precision.

Open on GitHub → How-to guide

Quantize a model

Choose precision, balance quality vs size, ship.

Read the guide → How-to guide

Load and merge LoRA adapters

Hot-swap adapters at runtime without reloading the base model.

Read the guide → How-to guide

Prepare training datasets for LoRA finetuning

Data prep, validation, and quality controls for LoRA training.

Read the guide →

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

01 · AI Agents

Orchestration patterns

ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.

AI Agents

02 · Document Intelligence

Parse PDFs, images, EML

PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.

Document Intelligence

03 · Vision & Multimodal

VLMs, image classification, chat with image

Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.

Vision & Multimodal

04 · RAG & Knowledge

Vector search and retrieval

Built-in vector store, Qdrant and pgvector connectors, embeddings, hybrid retrieval, document chunking, source citations.

RAG & Knowledge

05 · Text Analysis

Classification, NER, PII, sentiment

Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.

Text Analysis

06 · Speech & Audio

Audio transcription, STT

A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.

Speech & Audio

07 · Text Generation

Conversations, rewriting, summaries

Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.

Text Generation

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Ready to optimize your models?

Compress, fine-tune, and deploy AI models optimized for your hardware. 100% local, 100% private.

Download free API documentation

Compress, fine-tune & deploy optimized AI models.

Model quantization

LLM fine-tuning

LoRA integration

Quantization

Fine-tuning

Adapter management

Model catalog

Hardware backends

Encrypted model loading

Multi-GPU & tensor overrides

Context hibernation

Sampling & generation controls

Training configuration

Training progress

Checkpointing

ApplyLoraAdapter

Scale control

RemoveLoraAdapter

LoraMerger.Merge

Importance-aware quantization

Per-tensor LoRA ranks

Gradient accumulation

Cosine learning rate

Gradient clipping

Training datasets

Mobile applications

IoT and edge devices

Domain-specific assistants

Sentiment analysis

Multi-tenant applications

Scientific assistants

LM.Precision

LoraTrainingParameters

LM.ApplyLoraAdapter

LM.Adapters

LoraMerger

LoraAdapter

Quantizer

Quantize a model

Load and merge LoRA adapters

Prepare training datasets for LoRA finetuning

Orchestration patterns

Parse PDFs, images, EML

VLMs, image classification, chat with image

Vector search and retrieval

Classification, NER, PII, sentiment

Audio transcription, STT

Conversations, rewriting, summaries

Local Inference

`ApplyLoraAdapter`

`RemoveLoraAdapter`

`LoraMerger.Merge`

`LM.Precision`

`LoraTrainingParameters`

`LM.ApplyLoraAdapter`

`LM.Adapters`

`LoraMerger`

`LoraAdapter`