Solutions · Local Inference · Model quantization

Compress models without compromise.

The Quantizer class turns a full-precision GGUF model into a deployable artefact in any of 30+ precision formats: from 1-bit importance quantisation for extreme edge deployment to BF16 for maximum fidelity. Cluster-aware formats, importance variants, and granular control over output-tensor and metadata handling.

Start building free API reference

30+ formats Cluster-aware Importance quantization

`Quantizer.Quantize(...)`

One-call quantization with destination path + Precision.

`LM.Precision`

Enum of every supported quantization level.

`quantizeOutputTensor`

Toggle output-tensor quantization for accuracy / size trade-off.

`metadataOverrides`

Inject or override GGUF metadata during quantization.

Precision landscape

Pick the right format for your hardware budget.

Quantization trades memory and speed for accuracy. The right format depends on the target device and how much quality drop you can tolerate. The table below shows the key tiers in LM.Precision.

Format	Bits / weight	Typical use
High fidelity
`MOSTLY_BF16`	16	Reference / fine-tuning baselines, Apple Silicon training.
`MOSTLY_F16`	16	CUDA inference at full precision.
`MOSTLY_Q8_0`	~8.5	Near-lossless. Use when memory permits and accuracy matters.
Production sweet spot (K-means clustered)
`MOSTLY_Q6_K`	~6.6	Very high quality. Excellent recall on RAG and tool-call tasks.
`MOSTLY_Q5_K_M`	~5.7	Strong default for desktop GPU deployment.
`MOSTLY_Q4_K_M`	~4.8	Balanced default. ~75% size reduction vs FP16. Most LM-Kit production paths use this.
`MOSTLY_Q4_K_S`	~4.6	Smaller K-means variant. Slight quality drop vs Q4_K_M.
`MOSTLY_Q3_K_M`	~3.9	Mid 3-bit. Acceptable for chat; weaker on reasoning.
`MOSTLY_Q2_K`	~3.0	Aggressive. Use only when nothing else fits VRAM.
Importance quantization (IQ)
`MOSTLY_IQ4_XS`	~4.3	Smaller than Q4_K_M with similar quality on many models.
`MOSTLY_IQ3_M`	~3.4	Best 3-bit quality at the cost of some inference speed.
`MOSTLY_IQ2_M`	~2.7	Edge / mobile target. Verify quality on your eval set.
`MOSTLY_IQ1_S`	~1.7	Extreme compression. Mostly experimental.
Ternary (research)
`MOSTLY_TQ1_0` / `MOSTLY_TQ2_0`	1 to 2	Ternary representations. Deploy only with eval coverage.

Code sample

A complete quantization pass.

QuantizeModel.cs

using LMKit.Model;
using LMKit.Quantization;

// Source GGUF, typically the FP16 release of the base model.
var quantizer = new Quantizer("models/qwen3-4b-fp16.gguf")
{
    ThreadCount = Environment.ProcessorCount
};

// Pick the target precision. Q4_K_M is the production sweet spot.
quantizer.Quantize(
    dstFileName: "models/qwen3-4b-Q4_K_M.gguf",
    modelPrecision: LM.Precision.MOSTLY_Q4_K_M,
    quantizeOutputTensor: true,
    metadataOverrides: null);

Console.WriteLine("Done. Source ~7.5 GB, output ~2.4 GB.");

Deployment guidance

Match precision to target hardware.

Server (24+ GB VRAM)

Q5_K_M or Q6_K for highest quality with comfortable memory headroom. Use BF16 when retraining on the same hardware.

Desktop GPU (12 to 16 GB)

Q4_K_M for a 7B model, Q5_K_M for a 4B. Ships with CUDA, Vulkan, or Metal acceleration.

Laptop / Apple Silicon

Q4_K_M on Metal works smoothly for 4B-class models. Drop to IQ4_XS for tighter memory.

Edge / NVIDIA Jetson

Q3_K_M or IQ3_M for 4B on 8 GB Jetson hardware. Verify task accuracy before shipping.

Mobile / IoT

Q2_K or IQ2_M for 1B-3B parameter models. Combine with task-specific LoRA fine-tuning to recover accuracy.

CPU-only fallback

Q4_K_M with ThreadCount = physical cores. Modern 8-core x64 runs 4B at conversational latency.

Applications

Why teams quantize locally.

CI/CD model packaging

Quantize as part of your build pipeline. Version the resulting GGUF alongside your application binary.

Multi-target releases

Ship one base model in three precision tiers (Q4_K_M, Q5_K_M, Q6_K). Pick at install time based on detected hardware.

Edge / embedded distribution

Generate Q2 / IQ2 builds tailored to the exact NPU or microcontroller specs of your fleet. No cloud build farm required.

Internal model cards

Override GGUF metadata at quantization time to inject license, training-cutoff, and version info that your runtime can surface.

Custom fine-tuned distributions

After a LoRA merge with LoraMerger, requantize the resulting model for production deployment in one step.

A/B precision evaluation

Generate Q4_K_M, IQ4_XS, and Q3_K_M variants of the same model. Compare on your eval set before locking the production format.

Developer Resources

API reference.

`Quantizer`

Sealed class. Construct with a source GGUF path, set ThreadCount, call Quantize with destination path and LM.Precision.

View documentation

`LM.Precision`

Enumeration of all supported quantization formats. K-means variants (Q*_K_*), importance quantization (IQ*), ternary (TQ*), legacy (Q4_0/Q5_0/Q5_1).

View documentation

`MetadataCollection`

Override or inject GGUF metadata during quantization. Use to bake license terms, version tags, or custom identifiers into the artifact.

View documentation

`LoraMerger`

Companion class. Merge a LoRA adapter back into the base model, then quantize the merged result for deployment.

LoRA Integration page

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Ship the right model on every device.

Get Community Edition Download

Compress models without compromise.

`Quantizer.Quantize(...)`

`LM.Precision`

`quantizeOutputTensor`

`metadataOverrides`

Pick the right format for your hardware budget.

A complete quantization pass.

Match precision to target hardware.

Server (24+ GB VRAM)

Desktop GPU (12 to 16 GB)

Laptop / Apple Silicon

Edge / NVIDIA Jetson

Mobile / IoT

CPU-only fallback

Why teams quantize locally.

CI/CD model packaging

Multi-target releases

Edge / embedded distribution

Internal model cards

Custom fine-tuned distributions

A/B precision evaluation

API reference.

`Quantizer`

`LM.Precision`

`MetadataCollection`

`LoraMerger`

Build it. Read it. Try it.

Quantizer

Quantizer walkthrough

Quantize a model

Ship the right model on every device.