Quantizer.Quantize(...)
One-call quantization with destination path + Precision.
The Quantizer class turns a full-precision GGUF model into
a deployable artefact in any of 30+ precision formats: from 1-bit
importance quantisation for extreme edge deployment to BF16 for maximum
fidelity. Cluster-aware formats, importance variants, and granular
control over output-tensor and metadata handling.
Quantizer.Quantize(...)One-call quantization with destination path + Precision.
LM.PrecisionEnum of every supported quantization level.
quantizeOutputTensorToggle output-tensor quantization for accuracy / size trade-off.
metadataOverridesInject or override GGUF metadata during quantization.
Quantization trades memory and speed for accuracy. The right format depends
on the target device and how much quality drop you can tolerate. The table
below shows the key tiers in LM.Precision.
| Format | Bits / weight | Typical use |
|---|---|---|
| High fidelity | ||
MOSTLY_BF16 | 16 | Reference / fine-tuning baselines, Apple Silicon training. |
MOSTLY_F16 | 16 | CUDA inference at full precision. |
MOSTLY_Q8_0 | ~8.5 | Near-lossless. Use when memory permits and accuracy matters. |
| Production sweet spot (K-means clustered) | ||
MOSTLY_Q6_K | ~6.6 | Very high quality. Excellent recall on RAG and tool-call tasks. |
MOSTLY_Q5_K_M | ~5.7 | Strong default for desktop GPU deployment. |
MOSTLY_Q4_K_M | ~4.8 | Balanced default. ~75% size reduction vs FP16. Most LM-Kit production paths use this. |
MOSTLY_Q4_K_S | ~4.6 | Smaller K-means variant. Slight quality drop vs Q4_K_M. |
MOSTLY_Q3_K_M | ~3.9 | Mid 3-bit. Acceptable for chat; weaker on reasoning. |
MOSTLY_Q2_K | ~3.0 | Aggressive. Use only when nothing else fits VRAM. |
| Importance quantization (IQ) | ||
MOSTLY_IQ4_XS | ~4.3 | Smaller than Q4_K_M with similar quality on many models. |
MOSTLY_IQ3_M | ~3.4 | Best 3-bit quality at the cost of some inference speed. |
MOSTLY_IQ2_M | ~2.7 | Edge / mobile target. Verify quality on your eval set. |
MOSTLY_IQ1_S | ~1.7 | Extreme compression. Mostly experimental. |
| Ternary (research) | ||
MOSTLY_TQ1_0 / MOSTLY_TQ2_0 | 1 to 2 | Ternary representations. Deploy only with eval coverage. |
using LMKit.Model; using LMKit.Quantization; // Source GGUF, typically the FP16 release of the base model. var quantizer = new Quantizer("models/qwen3-4b-fp16.gguf") { ThreadCount = Environment.ProcessorCount }; // Pick the target precision. Q4_K_M is the production sweet spot. quantizer.Quantize( dstFileName: "models/qwen3-4b-Q4_K_M.gguf", modelPrecision: LM.Precision.MOSTLY_Q4_K_M, quantizeOutputTensor: true, metadataOverrides: null); Console.WriteLine("Done. Source ~7.5 GB, output ~2.4 GB.");
Q5_K_M or Q6_K for highest quality with comfortable memory headroom. Use BF16 when retraining on the same hardware.
Q4_K_M for a 7B model, Q5_K_M for a 4B. Ships with CUDA, Vulkan, or Metal acceleration.
Q4_K_M on Metal works smoothly for 4B-class models. Drop to IQ4_XS for tighter memory.
Q3_K_M or IQ3_M for 4B on 8 GB Jetson hardware. Verify task accuracy before shipping.
Q2_K or IQ2_M for 1B-3B parameter models. Combine with task-specific LoRA fine-tuning to recover accuracy.
Q4_K_M with ThreadCount = physical cores. Modern 8-core x64 runs 4B at conversational latency.
Quantize as part of your build pipeline. Version the resulting GGUF alongside your application binary.
Ship one base model in three precision tiers (Q4_K_M, Q5_K_M, Q6_K). Pick at install time based on detected hardware.
Generate Q2 / IQ2 builds tailored to the exact NPU or microcontroller specs of your fleet. No cloud build farm required.
Override GGUF metadata at quantization time to inject license, training-cutoff, and version info that your runtime can surface.
After a LoRA merge with LoraMerger, requantize the resulting model for production deployment in one step.
Generate Q4_K_M, IQ4_XS, and Q3_K_M variants of the same model. Compare on your eval set before locking the production format.
QuantizerSealed class. Construct with a source GGUF path, set ThreadCount, call Quantize with destination path and LM.Precision.
LM.PrecisionEnumeration of all supported quantization formats. K-means variants (Q*_K_*), importance quantization (IQ*), ternary (TQ*), legacy (Q4_0/Q5_0/Q5_1).
MetadataCollectionOverride or inject GGUF metadata during quantization. Use to bake license terms, version tags, or custom identifiers into the artifact.
LoraMergerCompanion class. Merge a LoRA adapter back into the base model, then quantize the merged result for deployment.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.