Dynamic Sampling
Adaptive multi-strategy sampling that adjusts per token. Default-on for most workloads.
Default sampling is a one-size-fits-nobody compromise. The LM-Kit sampling stack exposes the full lever set: Dynamic Sampling for adaptive multi-strategy decoding, logit biasing for vocabulary control and grammar enforcement, Mirostat 2 for entropy-bounded quality, speculative decoding for latency wins, and full repetition-penalty tuning. Same model, radically different output behaviour.
Adaptive multi-strategy sampling that adjusts per token. Default-on for most workloads.
Boost or suppress specific tokens, phrases, or vocabularies at decode time. The fastest path to format compliance.
Draft-model acceleration for latency-critical paths. 2-3x token throughput on favourable workloads.
Two systems running the same model can produce wildly different outputs depending on how they sample. The classic levers (temperature, top-k, top-p) cover the basics. Real production needs more: bounded entropy, vocabulary restriction, repetition control, latency acceleration. The sampling stack is where quality and cost both live.
Mirostat 2 holds output entropy in a target range, producing consistent quality across long generations. Avoids both the sterility of low-temperature and the chaos of high-temperature drift.
Logit biasing strongly boosts allowed tokens and suppresses banned ones at decode time. Pair with grammar-constrained generation for hard guarantees.
Speculative decoding generates candidate tokens with a draft model and verifies them in parallel with the full model. Configurable speculation depth and rejection strategy.
Token-penalty policy applies graduated penalties to recently emitted tokens. Choose static, decaying, or context-aware modes per workload.
Fix the seed, fix the strategy, get byte-identical output across reruns. Critical for diff-based testing and regression detection.
Sampling parameters layer on top of grammar constraints, prompt templates, and conversation state. Each lever independent, each lever tunable per call.
DefaultSamplingParameters
Temperature, top-k, top-p, min-p, seed. Default-tuned for natural text. Override per call when you need something specific.
Dynamic Sampling
Combines several sampling strategies and adjusts behaviour per token based on local context. Default-on for chat workloads. One flag to disable for full manual control.
LogitBias
Boost or suppress individual token IDs or text fragments. Hard suppression effectively bans output. Useful for vocabulary restriction, brand guidelines, format compliance.
Mirostat 2
Targets a specific output entropy ("surprise level") and adapts on the fly. Stays in a quality band that pure temperature control cannot guarantee.
TokenPenaltyPolicy
Penalise tokens already emitted in the recent window. Choose static, decaying, or context-aware modes. Stops loops without flattening output.
Speculative
A smaller draft model proposes a few tokens; the full model verifies in parallel. Accepted tokens pass through; rejected ones trigger a fall-back. Latency wins on aligned model pairs.
Token Healing
Corrects tokeniser-edge cases at the prompt boundary so the model picks up exactly where the prompt left off. Quietly fixes a class of subtle generation bugs.
Reasoning Level
For reasoning-capable models, set the depth of internal thinking with a typed enum (None, Low, Medium, High). One knob for the speed-quality tradeoff.
Deterministic output for structured workloads. Temperature 0, top-k 1, a fixed seed, and a logit-bias profile that pushes JSON-friendly tokens up and suppresses conversational fillers.
using LMKit.TextGeneration; using LMKit.TextGeneration.Sampling; // Deterministic JSON output: zero temperature, fixed seed, no surprises. chat.SamplingMode = new DefaultSamplingParameters { Temperature = 0.0f, TopK = 1, Seed = 42, DynamicSampling = false, }; // Boost JSON-friendly tokens, suppress conversational fillers. chat.LogitBias = new LogitBias() .Boost("{", 2.5f) .Boost("\"", 1.5f) .Suppress("Sure") .Suppress("Here") .Suppress("I'd");
Long-form narrative. Mirostat 2 holds entropy at a target level (smoother prose than top-p sampling); decaying repetition penalty stops loops without flattening the voice.
// Creative narrative: Mirostat 2 holds entropy, repetition penalty stops loops. chat.SamplingMode = new Mirostat2Sampling { Tau = 5.0f, // target entropy Eta = 0.1f, // learning rate }; chat.TokenPenaltyPolicy = new TokenPenaltyPolicy { RepeatPenalty = 1.15f, PenaltyLastN = 256, Mode = PenaltyMode.Decaying };
Speculative decoding for latency-critical UIs. A small draft model proposes tokens, the full model verifies them. Verified runs ship in a single forward pass, so first-token latency drops without changing output quality.
// Latency-critical: small draft model proposes, full model verifies. var draft = LM.LoadFromModelID("qwen3.5:0.8b"); var full = LM.LoadFromModelID("qwen3.5:9b"); chat = new MultiTurnConversation(full) { SpeculativeDecoding = new SpeculativeOptions { DraftModel = draft, SpeculationDepth = 5, } };
Temperature 0, fixed seed, logit bias on schema tokens. Pair with grammar-constrained generation for hard guarantees.
Mirostat 2 for consistent voice, repetition penalty in decaying mode, dynamic sampling on. Stable quality across long sessions.
Speculative decoding with an aligned draft model. Tokens stream visibly faster.
Logit-bias suppression of banned terms and competitor names. Boost on preferred phrasing.
Token Healing for clean continuation, decaying repetition penalty, Mirostat 2 to bound entropy drift.
ReasoningLevel.High for hard problems, Medium for default tasks, Low for high-throughput queries. One knob, three regimes.
Grammar-constrained decoding for hard schema guarantees. Pair with logit biasing for soft preferences on top of hard constraints.
A clean prompt deserves clean sampling. Templates compose with sampling parameters per call.
Speculative decoding pairs naturally with hardware-aware model placement. Draft on CPU, full on GPU; or both on GPU with layer split.
ReasoningLevel propagates to every agent in an orchestration. One enum, whole-pipeline thinking budget.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Console demo: tune temperature, top-k, top-p per use case.
Open on GitHub → How-to guideSwitch sampling strategies per turn for deterministic vs creative output.
Read the guide → How-to guideJSON schema + BNF for deterministic shape.
Read the guide →