Get Free Community License
Model Optimization

Compress, Fine-tune &Deploy Optimized AI Models.

Full model optimization toolkit for edge deployment. Quantize models from FP32 to 2-bit precision, fine-tune with LoRA adapters, and dynamically switch adapters at runtime. Reduce model size by up to 75% while preserving quality. 100% on-device processing.

30+ Quantization Formats LoRA Fine-tuning Runtime Adapter Swap
Model Quantization
Compress FP32 models to 2-8 bit precision with K-means clustering
Up to 75% smaller
LoRA Fine-tuning
Train task-specific adapters with minimal compute
Efficient
Dynamic Adapter Loading
Hot-swap LoRA adapters at runtime without reloading the base model
Runtime
Adapter Merging
Permanently merge LoRA weights into base model for deployment
30+
Precision Formats
100%
Local
5x
Faster Inference

Optimize Models for Edge Deployment

LM-Kit.NET provides a complete model optimization toolkit for deploying AI on resource-constrained devices. Reduce model size through quantization, adapt models to specific tasks with LoRA fine-tuning, and dynamically switch between specialized adapters at runtime without reloading the base model.

Whether you're targeting mobile devices, IoT systems, or desktop applications with limited resources, LM-Kit.NET's optimization features let you balance model quality against computational constraints while keeping all processing 100% local and private.

Edge-first design: All optimization operations run entirely on-device. Quantize, fine-tune, and deploy models without any cloud dependencies or data transmission.

Quantization

LM.Precision

Compress models from FP32 to 2-8 bit precision. Reduce size up to 75% with minimal quality loss.

Fine-tuning

LoraTrainingParameters

Train task-specific adapters using LoRA technique. Efficient parameter updates without full retraining.

Adapter Management

LoraAdapter

Dynamic loading and merging of LoRA adapters. Switch specialized behaviors at runtime.

Model Quantization

Reduce model size and accelerate inference by converting weights to lower precision formats.

30+ Precision Formats

LM-Kit.NET supports an extensive range of quantization formats, from 1-bit to 16-bit precision. Each format offers different tradeoffs between model size, inference speed, and output quality. K-means clustered formats (Q*_K_*) provide better quality retention at the same bit-width.

Quantization reduces memory footprint dramatically, enabling deployment of larger models on constrained hardware. A 7B parameter model at FP16 (~14GB) can be compressed to ~4GB at Q4_K_M while maintaining near-original quality for most tasks.

  • K-means clustering for quality retention
  • Batch quantization to all formats
  • Model validation before processing
  • GGUF format output
  • Preserves model metadata
View Quantization Demo
Format Description Size Quality
Q2_K 2-bit K-means Smallest Lower
Q3_K_M 3-bit K-means Medium Very Small Moderate
Q4_K_S 4-bit K-means Small Small Good
Q4_K_M 4-bit K-means Medium Medium Recommended
Q5_K_S 5-bit K-means Small Large Recommended
Q5_K_M 5-bit K-means Medium Large Recommended
Q6_K 6-bit K-means Very Large Excellent
Q8_0 8-bit Integer Very Large Near Original
F16 16-bit Float Largest Original

LLM Fine-tuning with LoRA

Train task-specific adapters efficiently without modifying the base model weights.

Progress Tracking
Real-time Metrics

Monitor training progress with real-time loss and accuracy metrics. Set early stopping conditions based on loss thresholds or maximum iterations.

  • Loss and accuracy monitoring
  • Early stopping conditions
  • Convergence detection
Checkpointing
Resume Training

Save and restore training checkpoints to resume interrupted sessions. Preserve optimizer state and training progress across sessions.

  • Automatic checkpoint saving
  • Resume from any checkpoint
  • Optimizer state preservation
LoraFinetuning.cs
using LMKit.Model;
using LMKit.Finetuning;

// Load base model for fine-tuning
var model = new LM("path/to/base-model.gguf");

// Configure LoRA training parameters
var trainingParams = new LoraTrainingParameters
{
    LoraRank = 16,
    LoraAlpha = 32,
    AdamAlpha = 1e-4f,
    AdamBeta1 = 0.9f,
    AdamBeta2 = 0.999f,
    AdamDecay = 0.01f,
    GradientAccumulation = 4,
    MaxNoImprovement = 100,
    // Per-tensor rank customization
    RankWQ = 16,  // Query weight
    RankWK = 16,  // Key weight
    RankWV = 16,  // Value weight
    RankWO = 8    // Output weight
};

// Create trainer and subscribe to progress events
var trainer = new LoraTrainer(model, trainingParams);
trainer.Progress += (s, e) =>
{
    Console.WriteLine($"Iteration {e.Iteration}: Loss={e.Loss:F4}, Accuracy={e.Accuracy:P1}");
};

// Train on your dataset
await trainer.TrainAsync(trainingDataset);

// Save the trained adapter
trainer.SaveAdapter("sentiment-adapter.lora");

LoRA Adapter Integration

Dynamically load, swap, and merge LoRA adapters without reloading the base model.

Hot-Swap Adapters at Runtime

LM-Kit.NET enables dynamic LoRA adapter management at inference time. Load multiple adapters into a single model instance and control their influence through scale parameters. Switch between specialized behaviors (sentiment analysis, code generation, domain expertise) without the overhead of reloading the base model.

For permanent deployment, merge adapter weights directly into the base model using LoraMerger to create a single optimized model file with no runtime overhead.

  • Load adapters dynamically via ApplyLoraAdapter
  • Scale-based activation control (0.0 to 1.0)
  • Multiple adapters on single model instance
  • Remove adapters with RemoveLoraAdapter
  • Permanent merge via LoraMerger.Merge
View Fine-tuning Demo

ApplyLoraAdapter

Load LoRA adapter from file or LoraAdapterSource. Registers in model's Adapters collection.

Scale Control

Adjust adapter influence with Scale property. Set to 0 to disable, 1 for full effect.

RemoveLoraAdapter

Unload adapter from model instance. Free memory and restore base model behavior.

LoraMerger.Merge

Permanently merge adapter weights into base model. Create single optimized model file.

LoraIntegration.cs
using LMKit.Model;
using LMKit.Finetuning;

// Load base model
var model = new LM("path/to/base-model.gguf");

// Dynamic adapter loading
model.ApplyLoraAdapter("sentiment-adapter.lora", scale: 1.0f);
model.ApplyLoraAdapter("code-adapter.lora", scale: 0.0f);  // Loaded but inactive

// Access loaded adapters
foreach (var adapter in model.Adapters)
{
    Console.WriteLine($"Adapter: {adapter.Name}, Scale: {adapter.Scale}");
}

// Switch active adapter at runtime
model.Adapters[0].Scale = 0.0f;  // Disable sentiment
model.Adapters[1].Scale = 1.0f;  // Enable code

// Remove adapter when no longer needed
model.RemoveLoraAdapter(model.Adapters[0]);

// Permanent merge for deployment
var merger = new LoraMerger(model);
merger.Merge("merged-model.gguf");

Optimization Capabilities

Comprehensive toolkit for model compression, adaptation, and deployment.

K-Means Quantization

Advanced quantization with K-means clustering preserves model quality at lower bit-widths. Q4_K_M and Q5_K_M formats offer excellent quality-size tradeoffs.

Per-Tensor LoRA Ranks

Fine-grained control over LoRA rank for each tensor type: WQ, WK, WV, WO, feed-forward, and normalization layers. Optimize adapter size and effectiveness.

Gradient Accumulation

Train with larger effective batch sizes on memory-constrained hardware. Accumulate gradients across multiple forward passes before updating weights.

Cosine Learning Rate

Cosine decay scheduling with warm restarts. Configure decay steps, minimum rate, and restart multipliers for optimal training dynamics.

Gradient Clipping

Prevent exploding gradients during training with configurable gradient clipping. Stabilize training on challenging datasets.

Training Datasets

Built-in support for training data management with ChatTrainingSample, TrainingDataset, and ShareGptExporter for data preparation and export.

Optimization Use Cases

Deploy optimized models across resource-constrained environments.

Mobile Applications

Deploy quantized models on iOS and Android devices. Q4_K_M provides excellent quality at 4x smaller size for on-device inference.

IoT and Edge Devices

Run AI on Raspberry Pi, NVIDIA Jetson, and other edge hardware. Aggressive quantization (Q2_K, Q3_K) enables deployment on minimal resources.

Domain-Specific Assistants

Fine-tune models for legal, medical, finance, or technical domains. LoRA adapters specialize generic models for industry-specific tasks.

Sentiment Analysis

Fine-tune for sentiment classification with dramatically improved accuracy. Start at ~46% baseline, reach 95%+ with targeted training.

Multi-Tenant Applications

Deploy one base model with multiple LoRA adapters for different customers or use cases. Switch adapters per-request without model reload.

Scientific Assistants

Create specialized chemistry, biology, or physics assistants. LoRA fine-tuning can improve domain accuracy from 17% to 40%+ baseline.

Key Classes & Methods

Core components for model optimization workflows.

LM.Precision

Enumeration of 30+ quantization formats from 1-bit to FP32. Includes K-means variants (Q*_K_*) and importance quantization (IQ*) formats.

View Documentation
LoraTrainingParameters

Complete configuration for LoRA fine-tuning: ranks, alpha scaling, Adam optimizer settings, gradient control, and learning rate scheduling.

View Documentation
LM.ApplyLoraAdapter

Dynamically load LoRA adapters into a model instance. Supports file paths and LoraAdapterSource objects with scale-based activation.

View Documentation
LM.Adapters

Collection of currently loaded LoRA adapters on a model instance. Access individual adapters to adjust Scale or retrieve metadata.

View Documentation
LoraMerger

Permanently merge LoRA adapter weights into base model. Creates optimized single-file deployment with no runtime adapter overhead.

View Documentation
LoraAdapter

Represents a loaded LoRA adapter with Name, Scale, and source information. Control adapter influence through Scale property (0.0-1.0).

View Documentation

Ready to Optimize Your Models?

Compress, fine-tune, and deploy AI models optimized for your hardware. 100% local, 100% private.