Solutions · Local Inference · Model Optimization

Compress, fine-tune & deploy optimized AI models.

Full model optimization toolkit for edge deployment. Quantize models from FP32 to 2-bit precision, fine-tune with LoRA adapters, and dynamically switch adapters at runtime. Reduce model size by up to 75% while preserving quality. 100% on-device processing.

30+ precision formats LoRA fine-tuning Hot-swap adapters

Optimize models for edge deployment.

LM-Kit.NET provides a complete model optimization toolkit for deploying AI on resource-constrained devices. Reduce model size through quantization, adapt models to specific tasks with LoRA fine-tuning, and dynamically switch between specialized adapters at runtime without reloading the base model.

Whether you're targeting mobile devices, IoT systems, or desktop applications with limited resources, LM-Kit.NET's optimization features let you balance model quality against computational constraints while keeping all processing 100% local and private.

Edge-first design: All optimization operations run entirely on-device. Quantize, fine-tune, and deploy models without any cloud dependencies or data transmission.

Compress

Quantization

Compress models from FP32 to 2-8 bit precision. Reduce size up to 75% with minimal quality loss.

Adapt

Fine-tuning

Train task-specific adapters using LoRA. Efficient parameter updates without full retraining.

Switch

Adapter management

Dynamic loading and merging of LoRA adapters. Switch specialised behaviours at runtime.

Runtime essentials

Models, memory, and decoding control.

Beyond quantisation and fine-tuning sit the everyday levers of a production local-inference deployment: discovering and loading the right model, protecting proprietary weights, fitting big models onto small hardware, freeing idle context, and shaping every output token. Each capability has a dedicated page.

Model quantization

Reduce model size and accelerate inference.

Reduce model size and accelerate inference by converting weights to lower precision formats.

30+ precision formats

LM-Kit.NET supports an extensive range of quantization formats, from 1-bit to 16-bit precision. Each format offers different tradeoffs between model size, inference speed, and output quality. K-means clustered formats (Q*_K_*) provide better quality retention at the same bit-width.

Quantization reduces memory footprint dramatically, enabling deployment of larger models on constrained hardware. A 7B parameter model at FP16 (~14GB) can be compressed to ~4GB at Q4_K_M while maintaining near-original quality for most tasks.

Quantization features

  • Cluster-aware formats for quality retention
  • Batch quantization to all formats
  • Model validation before processing
  • GGUF format output
  • Preserves model metadata
LLM fine-tuning

LLM fine-tuning with LoRA.

Train task-specific adapters efficiently without modifying the base model weights.

Low-Rank Adaptation (LoRA) enables efficient fine-tuning by training small adapter layers while keeping base model weights frozen. Dramatically reduces compute and memory requirements.

Training configuration

  • Configurable rank (r) and alpha scaling
  • Per-tensor rank customization
  • AdamW optimizer with decay control
  • Gradient accumulation support

Checkpointing

Save and restore training checkpoints to resume interrupted sessions. Preserve optimizer state and training progress across sessions.

  • Automatic checkpoint saving
  • Resume from any checkpoint
  • Optimizer state preservation
LoraTraining.cs
using LMKit.Model;
using LMKit.Finetuning;

// Load base model for fine-tuning
var model = new LM("path/to/base-model.gguf");

// Configure LoRA training parameters
var trainingParams = new LoraTrainingParameters
{
    LoraRank = 16,
    LoraAlpha = 32,
    AdamAlpha = 1e-4f,
    AdamBeta1 = 0.9f,
    AdamBeta2 = 0.999f,
    AdamDecay = 0.01f,
    GradientAccumulation = 4,
    MaxNoImprovement = 100,
    // Per-tensor rank customization
    RankWQ = 16, // Query weight
    RankWK = 16, // Key weight
    RankWV = 16, // Value weight
    RankWO = 8   // Output weight
};

// Create trainer and subscribe to progress events
var trainer = new LoraTrainer(model, trainingParams);
trainer.Progress += (s, e) =>
{
    Console.WriteLine($"Iteration {e.Iteration}: Loss={e.Loss:F4}, Accuracy={e.Accuracy:P1}");
};

// Train on your dataset
await trainer.TrainAsync(trainingDataset);

// Save the trained adapter
trainer.SaveAdapter("sentiment-adapter.lora");
LoRA integration

LoRA adapter integration.

Dynamically load, swap, and merge LoRA adapters without reloading the base model.

Hot-swap adapters at runtime

LM-Kit.NET enables dynamic LoRA adapter management at inference time. Load multiple adapters into a single model instance and control their influence through scale parameters. Switch between specialized behaviors (sentiment analysis, code generation, domain expertise) without the overhead of reloading the base model.

For permanent deployment, merge adapter weights directly into the base model using LoraMerger to create a single optimized model file with no runtime overhead.

Adapter operations

  • Load adapters dynamically via ApplyLoraAdapter
  • Scale-based activation control (0.0 to 1.0)
  • Multiple adapters on single model instance
  • Remove adapters with RemoveLoraAdapter
  • Permanent merge via LoraMerger.Merge

Load

ApplyLoraAdapter

Load LoRA adapter from file or LoraAdapterSource. Registers in model's Adapters collection.

Scale

Scale control

Adjust adapter influence with Scale property. Set to 0 to disable, 1 for full effect.

Unload

RemoveLoraAdapter

Unload adapter from model instance. Free memory and restore base model behavior.

Merge

LoraMerger.Merge

Permanently merge adapter weights into base model. Create single optimized model file.

LoraIntegration.cs
using LMKit.Model;
using LMKit.Finetuning;

// Load base model
var model = new LM("path/to/base-model.gguf");

// Dynamic adapter loading
model.ApplyLoraAdapter("sentiment-adapter.lora", scale: 1.0f);
model.ApplyLoraAdapter("code-adapter.lora", scale: 0.0f); // Loaded but inactive

// Access loaded adapters
foreach (var adapter in model.Adapters)
{
    Console.WriteLine($"Adapter: {adapter.Name}, Scale: {adapter.Scale}");
}

// Switch active adapter at runtime
model.Adapters[0].Scale = 0.0f; // Disable sentiment
model.Adapters[1].Scale = 1.0f; // Enable code

// Remove adapter when no longer needed
model.RemoveLoraAdapter(model.Adapters[0]);

// Permanent merge for deployment
var merger = new LoraMerger(model);
merger.Merge("merged-model.gguf");
Capabilities

Optimization capabilities.

Comprehensive toolkit for model compression, adaptation, and deployment.

Cluster-aware

Importance-aware quantization

Cluster-aware quantisation preserves model quality at lower bit-widths. Q4_K_M and Q5_K_M formats offer excellent quality-size tradeoffs.

Per-tensor

Per-tensor LoRA ranks

Fine-grained control over LoRA rank for each tensor type: WQ, WK, WV, WO, feed-forward, and normalization layers. Optimize adapter size and effectiveness.

Gradient

Gradient accumulation

Train with larger effective batch sizes on memory-constrained hardware. Accumulate gradients across multiple forward passes before updating weights.

Cosine

Cosine learning rate

Cosine decay scheduling with warm restarts. Configure decay steps, minimum rate, and restart multipliers for optimal training dynamics.

Clip

Gradient clipping

Prevent exploding gradients during training with configurable gradient clipping. Stabilize training on challenging datasets.

Datasets

Training datasets

Built-in support for training data management with ChatTrainingSample, TrainingDataset, and ShareGptExporter for data preparation and export.

Applications

Optimization use cases.

Deploy optimized models across resource-constrained environments.

Mobile

Mobile applications

Deploy quantized models on iOS and Android devices. Q4_K_M provides excellent quality at 4x smaller size for on-device inference.

Edge

IoT and edge devices

Run AI on Raspberry Pi, NVIDIA Jetson, and other edge hardware. Aggressive quantization (Q2_K, Q3_K) enables deployment on minimal resources.

Domain

Domain-specific assistants

Fine-tune models for legal, medical, finance, or technical domains. LoRA adapters specialize generic models for industry-specific tasks.

Sentiment

Sentiment analysis

Fine-tune for sentiment classification with dramatically improved accuracy. Start at ~46% baseline, reach 95%+ with targeted training.

SaaS

Multi-tenant applications

Deploy one base model with multiple LoRA adapters for different customers or use cases. Switch adapters per-request without model reload.

Science

Scientific assistants

Create specialized chemistry, biology, or physics assistants. LoRA fine-tuning can improve domain accuracy from 17% to 40%+ baseline.

Developer Resources

Key classes & methods.

Core components for model optimization workflows.

LM.Precision

Enumeration of 30+ quantization formats from 1-bit to FP32. Includes K-means variants (Q*_K_*) and importance quantization (IQ*) formats.

View documentation

LoraTrainingParameters

Complete configuration for LoRA fine-tuning: ranks, alpha scaling, Adam optimizer settings, gradient control, and learning rate scheduling.

View documentation

LM.ApplyLoraAdapter

Dynamically load LoRA adapters into a model instance. Supports file paths and LoraAdapterSource objects with scale-based activation.

View documentation

LM.Adapters

Collection of currently loaded LoRA adapters on a model instance. Access individual adapters to adjust Scale or retrieve metadata.

View documentation

LoraMerger

Permanently merge LoRA adapter weights into base model. Creates optimized single-file deployment with no runtime adapter overhead.

View documentation

LoraAdapter

Represents a loaded LoRA adapter with Name, Scale, and source information. Control adapter influence through Scale property (0.0-1.0).

View documentation

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Ready to optimize your models?

Compress, fine-tune, and deploy AI models optimized for your hardware. 100% local, 100% private.

Download free API documentation