Compress, Fine-tune &Deploy Optimized AI Models.
Full model optimization toolkit for edge deployment. Quantize models from FP32 to 2-bit precision, fine-tune with LoRA adapters, and dynamically switch adapters at runtime. Reduce model size by up to 75% while preserving quality. 100% on-device processing.
Optimize Models for Edge Deployment
LM-Kit.NET provides a complete model optimization toolkit for deploying AI on resource-constrained devices. Reduce model size through quantization, adapt models to specific tasks with LoRA fine-tuning, and dynamically switch between specialized adapters at runtime without reloading the base model.
Whether you're targeting mobile devices, IoT systems, or desktop applications with limited resources, LM-Kit.NET's optimization features let you balance model quality against computational constraints while keeping all processing 100% local and private.
Edge-first design: All optimization operations run entirely on-device. Quantize, fine-tune, and deploy models without any cloud dependencies or data transmission.
Quantization
LM.Precision
Compress models from FP32 to 2-8 bit precision. Reduce size up to 75% with minimal quality loss.
Fine-tuning
LoraTrainingParameters
Train task-specific adapters using LoRA technique. Efficient parameter updates without full retraining.
Adapter Management
LoraAdapter
Dynamic loading and merging of LoRA adapters. Switch specialized behaviors at runtime.
Model Quantization
Reduce model size and accelerate inference by converting weights to lower precision formats.
30+ Precision Formats
LM-Kit.NET supports an extensive range of quantization formats, from 1-bit to 16-bit precision. Each format offers different tradeoffs between model size, inference speed, and output quality. K-means clustered formats (Q*_K_*) provide better quality retention at the same bit-width.
Quantization reduces memory footprint dramatically, enabling deployment of larger models on constrained hardware. A 7B parameter model at FP16 (~14GB) can be compressed to ~4GB at Q4_K_M while maintaining near-original quality for most tasks.
- K-means clustering for quality retention
- Batch quantization to all formats
- Model validation before processing
- GGUF format output
- Preserves model metadata
LLM Fine-tuning with LoRA
Train task-specific adapters efficiently without modifying the base model weights.
Low-Rank Adaptation (LoRA) enables efficient fine-tuning by training small adapter layers while keeping base model weights frozen. Dramatically reduces compute and memory requirements.
- Configurable rank (r) and alpha scaling
- Per-tensor rank customization
- AdamW optimizer with decay control
- Gradient accumulation support
Monitor training progress with real-time loss and accuracy metrics. Set early stopping conditions based on loss thresholds or maximum iterations.
- Loss and accuracy monitoring
- Early stopping conditions
- Convergence detection
Save and restore training checkpoints to resume interrupted sessions. Preserve optimizer state and training progress across sessions.
- Automatic checkpoint saving
- Resume from any checkpoint
- Optimizer state preservation
using LMKit.Model; using LMKit.Finetuning; // Load base model for fine-tuning var model = new LM("path/to/base-model.gguf"); // Configure LoRA training parameters var trainingParams = new LoraTrainingParameters { LoraRank = 16, LoraAlpha = 32, AdamAlpha = 1e-4f, AdamBeta1 = 0.9f, AdamBeta2 = 0.999f, AdamDecay = 0.01f, GradientAccumulation = 4, MaxNoImprovement = 100, // Per-tensor rank customization RankWQ = 16, // Query weight RankWK = 16, // Key weight RankWV = 16, // Value weight RankWO = 8 // Output weight }; // Create trainer and subscribe to progress events var trainer = new LoraTrainer(model, trainingParams); trainer.Progress += (s, e) => { Console.WriteLine($"Iteration {e.Iteration}: Loss={e.Loss:F4}, Accuracy={e.Accuracy:P1}"); }; // Train on your dataset await trainer.TrainAsync(trainingDataset); // Save the trained adapter trainer.SaveAdapter("sentiment-adapter.lora");
LoRA Adapter Integration
Dynamically load, swap, and merge LoRA adapters without reloading the base model.
Hot-Swap Adapters at Runtime
LM-Kit.NET enables dynamic LoRA adapter management at inference time. Load multiple adapters into a single model instance and control their influence through scale parameters. Switch between specialized behaviors (sentiment analysis, code generation, domain expertise) without the overhead of reloading the base model.
For permanent deployment, merge adapter weights directly into the base model using
LoraMerger to create a single optimized model file with no runtime overhead.
- Load adapters dynamically via
ApplyLoraAdapter - Scale-based activation control (0.0 to 1.0)
- Multiple adapters on single model instance
- Remove adapters with
RemoveLoraAdapter - Permanent merge via
LoraMerger.Merge
ApplyLoraAdapter
Load LoRA adapter from file or LoraAdapterSource. Registers in model's Adapters collection.
Scale Control
Adjust adapter influence with Scale property. Set to 0 to disable, 1 for full effect.
RemoveLoraAdapter
Unload adapter from model instance. Free memory and restore base model behavior.
LoraMerger.Merge
Permanently merge adapter weights into base model. Create single optimized model file.
using LMKit.Model; using LMKit.Finetuning; // Load base model var model = new LM("path/to/base-model.gguf"); // Dynamic adapter loading model.ApplyLoraAdapter("sentiment-adapter.lora", scale: 1.0f); model.ApplyLoraAdapter("code-adapter.lora", scale: 0.0f); // Loaded but inactive // Access loaded adapters foreach (var adapter in model.Adapters) { Console.WriteLine($"Adapter: {adapter.Name}, Scale: {adapter.Scale}"); } // Switch active adapter at runtime model.Adapters[0].Scale = 0.0f; // Disable sentiment model.Adapters[1].Scale = 1.0f; // Enable code // Remove adapter when no longer needed model.RemoveLoraAdapter(model.Adapters[0]); // Permanent merge for deployment var merger = new LoraMerger(model); merger.Merge("merged-model.gguf");
Optimization Capabilities
Comprehensive toolkit for model compression, adaptation, and deployment.
K-Means Quantization
Advanced quantization with K-means clustering preserves model quality at lower bit-widths. Q4_K_M and Q5_K_M formats offer excellent quality-size tradeoffs.
Per-Tensor LoRA Ranks
Fine-grained control over LoRA rank for each tensor type: WQ, WK, WV, WO, feed-forward, and normalization layers. Optimize adapter size and effectiveness.
Gradient Accumulation
Train with larger effective batch sizes on memory-constrained hardware. Accumulate gradients across multiple forward passes before updating weights.
Cosine Learning Rate
Cosine decay scheduling with warm restarts. Configure decay steps, minimum rate, and restart multipliers for optimal training dynamics.
Gradient Clipping
Prevent exploding gradients during training with configurable gradient clipping. Stabilize training on challenging datasets.
Training Datasets
Built-in support for training data management with ChatTrainingSample, TrainingDataset, and ShareGptExporter for data preparation and export.
Optimization Use Cases
Deploy optimized models across resource-constrained environments.
Mobile Applications
Deploy quantized models on iOS and Android devices. Q4_K_M provides excellent quality at 4x smaller size for on-device inference.
IoT and Edge Devices
Run AI on Raspberry Pi, NVIDIA Jetson, and other edge hardware. Aggressive quantization (Q2_K, Q3_K) enables deployment on minimal resources.
Domain-Specific Assistants
Fine-tune models for legal, medical, finance, or technical domains. LoRA adapters specialize generic models for industry-specific tasks.
Sentiment Analysis
Fine-tune for sentiment classification with dramatically improved accuracy. Start at ~46% baseline, reach 95%+ with targeted training.
Multi-Tenant Applications
Deploy one base model with multiple LoRA adapters for different customers or use cases. Switch adapters per-request without model reload.
Scientific Assistants
Create specialized chemistry, biology, or physics assistants. LoRA fine-tuning can improve domain accuracy from 17% to 40%+ baseline.
Key Classes & Methods
Core components for model optimization workflows.
LM.Precision
Enumeration of 30+ quantization formats from 1-bit to FP32. Includes K-means variants (Q*_K_*) and importance quantization (IQ*) formats.
View DocumentationLoraTrainingParameters
Complete configuration for LoRA fine-tuning: ranks, alpha scaling, Adam optimizer settings, gradient control, and learning rate scheduling.
View DocumentationLM.ApplyLoraAdapter
Dynamically load LoRA adapters into a model instance. Supports file paths and LoraAdapterSource objects with scale-based activation.
View DocumentationLM.Adapters
Collection of currently loaded LoRA adapters on a model instance. Access individual adapters to adjust Scale or retrieve metadata.
View DocumentationLoraMerger
Permanently merge LoRA adapter weights into base model. Creates optimized single-file deployment with no runtime adapter overhead.
View DocumentationLoraAdapter
Represents a loaded LoRA adapter with Name, Scale, and source information. Control adapter influence through Scale property (0.0-1.0).
View DocumentationReady to Optimize Your Models?
Compress, fine-tune, and deploy AI models optimized for your hardware. 100% local, 100% private.