Solutions · Local Inference · Model quantization

Compress models without compromise.

The Quantizer class turns a full-precision GGUF model into a deployable artefact in any of 30+ precision formats: from 1-bit importance quantisation for extreme edge deployment to BF16 for maximum fidelity. Cluster-aware formats, importance variants, and granular control over output-tensor and metadata handling.

30+ formats Cluster-aware Importance quantization

Quantizer.Quantize(...)

One-call quantization with destination path + Precision.

LM.Precision

Enum of every supported quantization level.

quantizeOutputTensor

Toggle output-tensor quantization for accuracy / size trade-off.

metadataOverrides

Inject or override GGUF metadata during quantization.

Precision landscape

Pick the right format for your hardware budget.

Quantization trades memory and speed for accuracy. The right format depends on the target device and how much quality drop you can tolerate. The table below shows the key tiers in LM.Precision.

FormatBits / weightTypical use
High fidelity
MOSTLY_BF1616Reference / fine-tuning baselines, Apple Silicon training.
MOSTLY_F1616CUDA inference at full precision.
MOSTLY_Q8_0~8.5Near-lossless. Use when memory permits and accuracy matters.
Production sweet spot (K-means clustered)
MOSTLY_Q6_K~6.6Very high quality. Excellent recall on RAG and tool-call tasks.
MOSTLY_Q5_K_M~5.7Strong default for desktop GPU deployment.
MOSTLY_Q4_K_M~4.8Balanced default. ~75% size reduction vs FP16. Most LM-Kit production paths use this.
MOSTLY_Q4_K_S~4.6Smaller K-means variant. Slight quality drop vs Q4_K_M.
MOSTLY_Q3_K_M~3.9Mid 3-bit. Acceptable for chat; weaker on reasoning.
MOSTLY_Q2_K~3.0Aggressive. Use only when nothing else fits VRAM.
Importance quantization (IQ)
MOSTLY_IQ4_XS~4.3Smaller than Q4_K_M with similar quality on many models.
MOSTLY_IQ3_M~3.4Best 3-bit quality at the cost of some inference speed.
MOSTLY_IQ2_M~2.7Edge / mobile target. Verify quality on your eval set.
MOSTLY_IQ1_S~1.7Extreme compression. Mostly experimental.
Ternary (research)
MOSTLY_TQ1_0 / MOSTLY_TQ2_01 to 2Ternary representations. Deploy only with eval coverage.
Code sample

A complete quantization pass.

QuantizeModel.cs
using LMKit.Model;
using LMKit.Quantization;

// Source GGUF, typically the FP16 release of the base model.
var quantizer = new Quantizer("models/qwen3-4b-fp16.gguf")
{
    ThreadCount = Environment.ProcessorCount
};

// Pick the target precision. Q4_K_M is the production sweet spot.
quantizer.Quantize(
    dstFileName: "models/qwen3-4b-Q4_K_M.gguf",
    modelPrecision: LM.Precision.MOSTLY_Q4_K_M,
    quantizeOutputTensor: true,
    metadataOverrides: null);

Console.WriteLine("Done. Source ~7.5 GB, output ~2.4 GB.");
Deployment guidance

Match precision to target hardware.

Server (24+ GB VRAM)

Q5_K_M or Q6_K for highest quality with comfortable memory headroom. Use BF16 when retraining on the same hardware.

Desktop GPU (12 to 16 GB)

Q4_K_M for a 7B model, Q5_K_M for a 4B. Ships with CUDA, Vulkan, or Metal acceleration.

Laptop / Apple Silicon

Q4_K_M on Metal works smoothly for 4B-class models. Drop to IQ4_XS for tighter memory.

Edge / NVIDIA Jetson

Q3_K_M or IQ3_M for 4B on 8 GB Jetson hardware. Verify task accuracy before shipping.

Mobile / IoT

Q2_K or IQ2_M for 1B-3B parameter models. Combine with task-specific LoRA fine-tuning to recover accuracy.

CPU-only fallback

Q4_K_M with ThreadCount = physical cores. Modern 8-core x64 runs 4B at conversational latency.

Applications

Why teams quantize locally.

CI/CD model packaging

Quantize as part of your build pipeline. Version the resulting GGUF alongside your application binary.

Multi-target releases

Ship one base model in three precision tiers (Q4_K_M, Q5_K_M, Q6_K). Pick at install time based on detected hardware.

Edge / embedded distribution

Generate Q2 / IQ2 builds tailored to the exact NPU or microcontroller specs of your fleet. No cloud build farm required.

Internal model cards

Override GGUF metadata at quantization time to inject license, training-cutoff, and version info that your runtime can surface.

Custom fine-tuned distributions

After a LoRA merge with LoraMerger, requantize the resulting model for production deployment in one step.

A/B precision evaluation

Generate Q4_K_M, IQ4_XS, and Q3_K_M variants of the same model. Compare on your eval set before locking the production format.

Developer Resources

API reference.

Quantizer

Sealed class. Construct with a source GGUF path, set ThreadCount, call Quantize with destination path and LM.Precision.

View documentation

LM.Precision

Enumeration of all supported quantization formats. K-means variants (Q*_K_*), importance quantization (IQ*), ternary (TQ*), legacy (Q4_0/Q5_0/Q5_1).

View documentation

MetadataCollection

Override or inject GGUF metadata during quantization. Use to bake license terms, version tags, or custom identifiers into the artifact.

View documentation

LoraMerger

Companion class. Merge a LoRA adapter back into the base model, then quantize the merged result for deployment.

LoRA Integration page

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Ship the right model on every device.

Get Community Edition Download