Solutions · Local inference · Sampling controls

Shape every token.

Default sampling is a one-size-fits-nobody compromise. The LM-Kit sampling stack exposes the full lever set: Dynamic Sampling for adaptive multi-strategy decoding, logit biasing for vocabulary control and grammar enforcement, Mirostat 2 for entropy-bounded quality, speculative decoding for latency wins, and full repetition-penalty tuning. Same model, radically different output behaviour.

Start building free Sampling API

Dynamic multi-strategy Mirostat 2 entropy control Speculative decoding

Dynamic Sampling

Adaptive multi-strategy sampling that adjusts per token. Default-on for most workloads.

Logit Biasing

Boost or suppress specific tokens, phrases, or vocabularies at decode time. The fastest path to format compliance.

Speculative Decoding

Draft-model acceleration for latency-critical paths. 2-3x token throughput on favourable workloads.

Multi-Token Prediction

Self-speculative decoding using prediction heads built into the model. Roughly 2x throughput, no second checkpoint, lossless under greedy.

Why sampling matters

The model is the easy part. Decoding is the hard part.

Two systems running the same model can produce wildly different outputs depending on how they sample. The classic levers (temperature, top-k, top-p) cover the basics. Real production needs more: bounded entropy, vocabulary restriction, repetition control, latency acceleration. The sampling stack is where quality and cost both live.

Output quality

Mirostat 2 holds output entropy in a target range, producing consistent quality across long generations. Avoids both the sterility of low-temperature and the chaos of high-temperature drift.

Format compliance

Logit biasing strongly boosts allowed tokens and suppresses banned ones at decode time. Pair with grammar-constrained generation for hard guarantees.

Latency reduction

Speculative decoding generates candidate tokens with a draft model and verifies them in parallel with the full model. Configurable speculation depth and rejection strategy.

Repetition control

Token-penalty policy applies graduated penalties to recently emitted tokens. Choose static, decaying, or context-aware modes per workload.

Determinism

Fix the seed, fix the strategy, get byte-identical output across reruns. Critical for diff-based testing and regression detection.

Composability

Sampling parameters layer on top of grammar constraints, prompt templates, and conversation state. Each lever independent, each lever tunable per call.

The control surface

Every lever, named.

DefaultSamplingParameters

The classic levers

Temperature, top-k, top-p, min-p, seed. Default-tuned for natural text. Override per call when you need something specific.

Dynamic Sampling

Adaptive multi-strategy

Combines several sampling strategies and adjusts behaviour per token based on local context. Default-on for chat workloads. One flag to disable for full manual control.

LogitBias

Per-token bias

Boost or suppress individual token IDs or text fragments. Hard suppression effectively bans output. Useful for vocabulary restriction, brand guidelines, format compliance.

Mirostat 2

Entropy-bounded sampling

Targets a specific output entropy ("surprise level") and adapts on the fly. Stays in a quality band that pure temperature control cannot guarantee.

TokenPenaltyPolicy

Repetition control

Penalise tokens already emitted in the recent window. Choose static, decaying, or context-aware modes. Stops loops without flattening output.

Speculative

Draft-model acceleration

A smaller draft model proposes a few tokens; the full model verifies in parallel. Accepted tokens pass through; rejected ones trigger a fall-back. Latency wins on aligned model pairs.

MTP

Multi-Token Prediction

Self-speculative decoding using auxiliary prediction heads trained into the model itself. No second checkpoint, no vocabulary alignment, lossless under greedy. Roughly 2x generation throughput on supported checkpoints; a zero-cost no-op on the others. Toggled at load time via LM.LoadingOptions.EnableMultiTokenPrediction (on by default; set to false to skip loading the MTP head tensors entirely). Glossary.

Token Healing

Tokenisation repair

Corrects tokeniser-edge cases at the prompt boundary so the model picks up exactly where the prompt left off. Quietly fixes a class of subtle generation bugs.

Reasoning Level

Thinking budget

For reasoning-capable models, set the depth of internal thinking with a typed enum (None, Low, Medium, High). One knob for the speed-quality tradeoff.

Real configurations

Tune for the workload.

Deterministic output for structured workloads. Temperature 0, top-k 1, a fixed seed, and a logit-bias profile that pushes JSON-friendly tokens up and suppresses conversational fillers.

DeterministicJson.cs

using LMKit.TextGeneration;
using LMKit.TextGeneration.Sampling;

// Deterministic JSON output: zero temperature, fixed seed, no surprises.
chat.SamplingMode = new DefaultSamplingParameters
{
    Temperature      = 0.0f,
    TopK             = 1,
    Seed             = 42,
    DynamicSampling  = false,
};

// Boost JSON-friendly tokens, suppress conversational fillers.
chat.LogitBias = new LogitBias()
    .Boost("{", 2.5f)
    .Boost("\"", 1.5f)
    .Suppress("Sure")
    .Suppress("Here")
    .Suppress("I'd");

Long-form narrative. Mirostat 2 holds entropy at a target level (smoother prose than top-p sampling); decaying repetition penalty stops loops without flattening the voice.

CreativeWriting.cs

// Creative narrative: Mirostat 2 holds entropy, repetition penalty stops loops.
chat.SamplingMode = new Mirostat2Sampling
{
    Tau              = 5.0f,    // target entropy
    Eta              = 0.1f,    // learning rate
};

chat.TokenPenaltyPolicy = new TokenPenaltyPolicy
{
    RepeatPenalty    = 1.15f,
    PenaltyLastN     = 256,
    Mode             = PenaltyMode.Decaying
};

Speculative decoding for latency-critical UIs. A small draft model proposes tokens, the full model verifies them. Verified runs ship in a single forward pass, so first-token latency drops without changing output quality.

SpeculativeForLatency.cs

// Latency-critical: small draft model proposes, full model verifies.
var draft = LM.LoadFromModelID("qwen3.5:0.8b");
var full  = LM.LoadFromModelID("qwen3.5:9b");

chat = new MultiTurnConversation(full)
{
    SpeculativeDecoding = new SpeculativeOptions
    {
        DraftModel       = draft,
        SpeculationDepth = 5,
    }
};

Where sampling pays off

Five common tunings.

Strict structured output

Temperature 0, fixed seed, logit bias on schema tokens. Pair with grammar-constrained generation for hard guarantees.

Customer-facing chat

Mirostat 2 for consistent voice, repetition penalty in decaying mode, dynamic sampling on. Stable quality across long sessions.

Real-time voice assistants

Speculative decoding with an aligned draft model. Tokens stream visibly faster.

Brand-aligned content

Logit-bias suppression of banned terms and competitor names. Boost on preferred phrasing.

Long-form generation

Token Healing for clean continuation, decaying repetition penalty, Mirostat 2 to bound entropy drift.

Reasoning workloads

ReasoningLevel.High for hard problems, Medium for default tasks, Low for high-throughput queries. One knob, three regimes.

Demos and guides

Working references.

DemoMulti-turn chat with custom sampling GuideControl sampling with dynamic strategies GlossaryDynamic sampling GlossarySpeculative decoding APILogitBias APIMirostat2Sampling

Related capabilities

Sampling plus the rest.

Structured content creation

Grammar-constrained decoding for hard schema guarantees. Pair with logit biasing for soft preferences on top of hard constraints.

Structured generation

Prompt templates

A clean prompt deserves clean sampling. Templates compose with sampling parameters per call.

Prompt templates

Multi-GPU & tensor overrides

Speculative decoding pairs naturally with hardware-aware model placement. Draft on CPU, full on GPU; or both on GPU with layer split.

Multi-GPU

Agent reasoning

ReasoningLevel propagates to every agent in an orchestration. One enum, whole-pipeline thinking budget.

Agent reasoning

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Same model. Better output.

Get Community Edition Download

Shape every token.

Dynamic Sampling

Logit Biasing

Speculative Decoding

Multi-Token Prediction

Output quality

Format compliance

Latency reduction

Repetition control

Determinism

Composability

The classic levers

Adaptive multi-strategy

Per-token bias

Entropy-bounded sampling

Repetition control

Draft-model acceleration

Multi-Token Prediction

Tokenisation repair

Thinking budget

Strict structured output

Customer-facing chat

Real-time voice assistants

Brand-aligned content

Long-form generation

Reasoning workloads

Structured content creation

Prompt templates

Multi-GPU & tensor overrides

Agent reasoning

Sampling and Generation Controls

Sampling and Generation Controls walkthrough

Multi-turn chat with custom sampling

Multi-turn chat with custom sampling walkthrough

Control token sampling with dynamic strategies

Enforce structured output with grammar