Solutions · Local inference · Context hibernation

Pause an agent. Free the GPU.

A long-running conversation can hold a multi-gigabyte inference context: KV-cache, attention state, full session history. Most of the time the user is idle. The cache sits in RAM or VRAM, blocking other workloads. IKVCache.HibernateAsync serialises that entire state to disk and frees the native handle in seconds. The next call rehydrates it transparently. The conversation never notices.

Start building free IKVCache API

One line to hibernate Transparent rehydration Concurrency-safe

`NotCreated`

No context allocated yet. Lazy creation on first call.

`InMemory`

Active in RAM or VRAM. Hot path. Inference runs at full speed.

`Hibernated`

Serialised to disk. Native handle freed. Next call rehydrates.

Why hibernation matters

Idle conversations are not free.

Multi-tenant chat applications, long-running document review sessions, always-on desktop assistants, and per-user agent personas all share the same problem: the model context grows over time and rarely shrinks. Without hibernation the only options are "keep it loaded forever" or "drop it and rebuild from scratch on the next message". Hibernation gives you a third path: drop the bytes, keep the meaning.

Memory pressure relief

Long sessions can occupy gigabytes of native memory. Hibernating idle ones lets the GPU schedule active workloads. RAM stays available for other processes.

No context loss

Full KV-cache plus session history serialised. Rehydration restores byte-identical state. The conversation continues exactly where it left off.

Fire-and-forget API

HibernateAsync returns a Task. Coalesces concurrent requests. Defers until active usage locks release. Safe to call while the session is busy.

Transparent rehydration

The next SubmitAsync call restores the cache automatically. No special-casing in caller code. The hibernation file is deleted on success.

Configurable storage

Default location is Configuration.ContextHibernationDirectory. Pass an explicit path for per-tenant separation, fast SSD targeting, or encrypted volumes.

Production hygiene

Files are cleaned up automatically after rehydration. Cleaned up on dispose if the context never reactivates. No leaked artefacts.

One line to hibernate

Cast and call.

Cast the conversation to IKVCache, call HibernateAsync, free the GPU. The next SubmitAsync rehydrates the session transparently and deletes the hibernation file on success. No special handling in caller code.

QuickHibernate.cs

using LMKit.Inference;
using LMKit.TextGeneration;

var chat = new MultiTurnConversation(model);
await chat.SubmitAsync("Walk me through the Q3 financials.");
await chat.SubmitAsync("What changed in operating expenses?");

// User goes idle. Free the GPU; keep the conversation.
if (chat is IKVCache cache && cache.Residency == ContextResidency.InMemory)
{
    _ = cache.HibernateAsync();   // fire-and-forget
}

// 90 minutes later. Same chat object. No special handling.
await chat.SubmitAsync("And the gross margin?");
// Cache rehydrates transparently. Hibernation file is deleted on success.

Redirect hibernation files to encrypted per-tenant storage instead of the system temp folder. Use a global Configuration default or override the destination per call.

PerTenantStorage.cs

using LMKit.Global;

// Redirect hibernation files to per-tenant encrypted storage.
Configuration.ContextHibernationDirectory = @"D:\encrypted\hibernate";

// Or override per-call with an explicit path.
if (chat is IKVCache cache)
{
    await cache.HibernateAsync($@"D:\tenants\{tenantId}\sessions\{sessionId}.lmk-state");
}

Production reaper. Walk every active session and hibernate any that have been idle past the threshold. Reclaims memory in the background; live sessions are untouched.

IdleHibernation.cs

// Production pattern: hibernate sessions idle for more than N minutes.
var sessions = _sessionRegistry.GetAll();
var threshold = TimeSpan.FromMinutes(15);

foreach (var session in sessions)
{
    if (session.IdleFor < threshold) continue;
    if (session.Chat is not IKVCache cache) continue;
    if (cache.Residency != ContextResidency.InMemory) continue;

    _ = cache.HibernateAsync();   // reclaim memory in the background
}

Where hibernation pays off

Long sessions at scale.

Multi-tenant SaaS chat

Hundreds of concurrent users; only a fraction active at any moment. Hibernate the inactive ones; serve the active ones at full speed.

Per-user agent personas

A productivity assistant per employee. Idle during meetings, active in inboxes. Hibernation lets one box host many personas without resident-memory blowup.

Long document review

A reviewer reads, asks, leaves the chat open for hours. Hibernation between questions. Every "and what about clause 17?" rehydrates instantly.

Always-on desktop

The user opens the assistant once and never closes it. Idle hibernation means the assistant does not block other GPU workloads on the same machine.

Scheduled batch agents

Background agents wake on a cron, process a batch, hibernate. Memory footprint stays flat across daily cycles.

Conversation archival

End-of-day hibernation archives every conversation. The next morning's first message rehydrates whatever the user was working on.

Versus the alternatives

Two bad choices, or this one.

Keep it loaded forever

Each long-lived session occupies its full context indefinitely. Memory grows linearly with users. Hardware cost grows with it.

Drop and rebuild

Discard the context after N minutes idle. Next message has to replay the entire history token-by-token. First response after a pause is painfully slow.

Hibernate

Serialise the cache to disk in seconds. Free the native handle immediately. Rehydrate transparently on the next message at near-instant speed. Best of both worlds.

Demos and guides

Working references.

DemoMulti-turn chat with persistent session GuideOptimise memory with context recycling APIIKVCache APIContextResidency APIConfiguration.ContextHibernationDirectory GlossaryKV-cache

Related capabilities

Hibernation plus the rest.

Agent memory

Hibernation preserves session state. Agent memory preserves long-term knowledge. Use both: hot conversations hibernate, durable facts persist in memory.

Agent memory

Resilience

Bulkheads cap concurrent live sessions; hibernation reclaims idle ones. Together they keep the GPU schedulable under bursty load.

Resilience

Multi-GPU & tensor overrides

Pair with hibernation to run more sessions per node. Hibernated sessions release their layer slice; new sessions claim it.

Multi-GPU

Observability

Trace hibernate / rehydrate events alongside agent calls. Spot regressions in idle-recovery latency before users do.

Observability

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Idle should not cost memory.

Get Community Edition Download

Pause an agent. Free the GPU.

`NotCreated`

`InMemory`

`Hibernated`

Idle conversations are not free.

Memory pressure relief

No context loss

Fire-and-forget API

Transparent rehydration

Configurable storage

Production hygiene

Cast and call.

Long sessions at scale.

Multi-tenant SaaS chat

Per-user agent personas

Long document review

Always-on desktop

Scheduled batch agents

Conversation archival

Two bad choices, or this one.

Keep it loaded forever

Drop and rebuild

Hibernate

Working references.

Hibernation plus the rest.

Agent memory

Resilience

Multi-GPU & tensor overrides

Observability

Build it. Read it. Try it.

Context Hibernation

Context Hibernation walkthrough

Multi-turn chat with persistent session

Multi-turn chat with persistent session walkthrough

Save and restore conversation sessions

Optimize memory with context recycling

Idle should not cost memory.