Solutions · Local inference · Sampling controls

Shape every token.

Default sampling is a one-size-fits-nobody compromise. The LM-Kit sampling stack exposes the full lever set: Dynamic Sampling for adaptive multi-strategy decoding, logit biasing for vocabulary control and grammar enforcement, Mirostat 2 for entropy-bounded quality, speculative decoding for latency wins, and full repetition-penalty tuning. Same model, radically different output behaviour.

Dynamic multi-strategy Mirostat 2 entropy control Speculative decoding

Dynamic Sampling

Adaptive multi-strategy sampling that adjusts per token. Default-on for most workloads.

Logit Biasing

Boost or suppress specific tokens, phrases, or vocabularies at decode time. The fastest path to format compliance.

Speculative Decoding

Draft-model acceleration for latency-critical paths. 2-3x token throughput on favourable workloads.

Why sampling matters

The model is the easy part. Decoding is the hard part.

Two systems running the same model can produce wildly different outputs depending on how they sample. The classic levers (temperature, top-k, top-p) cover the basics. Real production needs more: bounded entropy, vocabulary restriction, repetition control, latency acceleration. The sampling stack is where quality and cost both live.

Output quality

Mirostat 2 holds output entropy in a target range, producing consistent quality across long generations. Avoids both the sterility of low-temperature and the chaos of high-temperature drift.

Format compliance

Logit biasing strongly boosts allowed tokens and suppresses banned ones at decode time. Pair with grammar-constrained generation for hard guarantees.

Latency reduction

Speculative decoding generates candidate tokens with a draft model and verifies them in parallel with the full model. Configurable speculation depth and rejection strategy.

Repetition control

Token-penalty policy applies graduated penalties to recently emitted tokens. Choose static, decaying, or context-aware modes per workload.

Determinism

Fix the seed, fix the strategy, get byte-identical output across reruns. Critical for diff-based testing and regression detection.

Composability

Sampling parameters layer on top of grammar constraints, prompt templates, and conversation state. Each lever independent, each lever tunable per call.

The control surface

Every lever, named.

DefaultSamplingParameters

The classic levers

Temperature, top-k, top-p, min-p, seed. Default-tuned for natural text. Override per call when you need something specific.

Dynamic Sampling

Adaptive multi-strategy

Combines several sampling strategies and adjusts behaviour per token based on local context. Default-on for chat workloads. One flag to disable for full manual control.

LogitBias

Per-token bias

Boost or suppress individual token IDs or text fragments. Hard suppression effectively bans output. Useful for vocabulary restriction, brand guidelines, format compliance.

Mirostat 2

Entropy-bounded sampling

Targets a specific output entropy ("surprise level") and adapts on the fly. Stays in a quality band that pure temperature control cannot guarantee.

TokenPenaltyPolicy

Repetition control

Penalise tokens already emitted in the recent window. Choose static, decaying, or context-aware modes. Stops loops without flattening output.

Speculative

Draft-model acceleration

A smaller draft model proposes a few tokens; the full model verifies in parallel. Accepted tokens pass through; rejected ones trigger a fall-back. Latency wins on aligned model pairs.

Token Healing

Tokenisation repair

Corrects tokeniser-edge cases at the prompt boundary so the model picks up exactly where the prompt left off. Quietly fixes a class of subtle generation bugs.

Reasoning Level

Thinking budget

For reasoning-capable models, set the depth of internal thinking with a typed enum (None, Low, Medium, High). One knob for the speed-quality tradeoff.

Real configurations

Tune for the workload.

Deterministic output for structured workloads. Temperature 0, top-k 1, a fixed seed, and a logit-bias profile that pushes JSON-friendly tokens up and suppresses conversational fillers.

DeterministicJson.cs
using LMKit.TextGeneration;
using LMKit.TextGeneration.Sampling;

// Deterministic JSON output: zero temperature, fixed seed, no surprises.
chat.SamplingMode = new DefaultSamplingParameters
{
    Temperature      = 0.0f,
    TopK             = 1,
    Seed             = 42,
    DynamicSampling  = false,
};

// Boost JSON-friendly tokens, suppress conversational fillers.
chat.LogitBias = new LogitBias()
    .Boost("{", 2.5f)
    .Boost("\"", 1.5f)
    .Suppress("Sure")
    .Suppress("Here")
    .Suppress("I'd");
Where sampling pays off

Five common tunings.

Strict structured output

Temperature 0, fixed seed, logit bias on schema tokens. Pair with grammar-constrained generation for hard guarantees.

Customer-facing chat

Mirostat 2 for consistent voice, repetition penalty in decaying mode, dynamic sampling on. Stable quality across long sessions.

Real-time voice assistants

Speculative decoding with an aligned draft model. Tokens stream visibly faster.

Brand-aligned content

Logit-bias suppression of banned terms and competitor names. Boost on preferred phrasing.

Long-form generation

Token Healing for clean continuation, decaying repetition penalty, Mirostat 2 to bound entropy drift.

Reasoning workloads

ReasoningLevel.High for hard problems, Medium for default tasks, Low for high-throughput queries. One knob, three regimes.

Related capabilities

Sampling plus the rest.

Structured content creation

Grammar-constrained decoding for hard schema guarantees. Pair with logit biasing for soft preferences on top of hard constraints.

Structured generation

Prompt templates

A clean prompt deserves clean sampling. Templates compose with sampling parameters per call.

Prompt templates

Multi-GPU & tensor overrides

Speculative decoding pairs naturally with hardware-aware model placement. Draft on CPU, full on GPU; or both on GPU with layer split.

Multi-GPU

Agent reasoning

ReasoningLevel propagates to every agent in an orchestration. One enum, whole-pipeline thinking budget.

Agent reasoning

Same model. Better output.

Get Community Edition Download