Solutions · Local inference · Context hibernation

Pause an agent. Free the GPU.

A long-running conversation can hold a multi-gigabyte inference context: KV-cache, attention state, full session history. Most of the time the user is idle. The cache sits in RAM or VRAM, blocking other workloads. IKVCache.HibernateAsync serialises that entire state to disk and frees the native handle in seconds. The next call rehydrates it transparently. The conversation never notices.

One line to hibernate Transparent rehydration Concurrency-safe

NotCreated

No context allocated yet. Lazy creation on first call.

InMemory

Active in RAM or VRAM. Hot path. Inference runs at full speed.

Hibernated

Serialised to disk. Native handle freed. Next call rehydrates.

Why hibernation matters

Idle conversations are not free.

Multi-tenant chat applications, long-running document review sessions, always-on desktop assistants, and per-user agent personas all share the same problem: the model context grows over time and rarely shrinks. Without hibernation the only options are "keep it loaded forever" or "drop it and rebuild from scratch on the next message". Hibernation gives you a third path: drop the bytes, keep the meaning.

Memory pressure relief

Long sessions can occupy gigabytes of native memory. Hibernating idle ones lets the GPU schedule active workloads. RAM stays available for other processes.

No context loss

Full KV-cache plus session history serialised. Rehydration restores byte-identical state. The conversation continues exactly where it left off.

Fire-and-forget API

HibernateAsync returns a Task. Coalesces concurrent requests. Defers until active usage locks release. Safe to call while the session is busy.

Transparent rehydration

The next SubmitAsync call restores the cache automatically. No special-casing in caller code. The hibernation file is deleted on success.

Configurable storage

Default location is Configuration.ContextHibernationDirectory. Pass an explicit path for per-tenant separation, fast SSD targeting, or encrypted volumes.

Production hygiene

Files are cleaned up automatically after rehydration. Cleaned up on dispose if the context never reactivates. No leaked artefacts.

One line to hibernate

Cast and call.

Cast the conversation to IKVCache, call HibernateAsync, free the GPU. The next SubmitAsync rehydrates the session transparently and deletes the hibernation file on success. No special handling in caller code.

QuickHibernate.cs
using LMKit.Inference;
using LMKit.TextGeneration;

var chat = new MultiTurnConversation(model);
await chat.SubmitAsync("Walk me through the Q3 financials.");
await chat.SubmitAsync("What changed in operating expenses?");

// User goes idle. Free the GPU; keep the conversation.
if (chat is IKVCache cache && cache.Residency == ContextResidency.InMemory)
{
    _ = cache.HibernateAsync();   // fire-and-forget
}

// 90 minutes later. Same chat object. No special handling.
await chat.SubmitAsync("And the gross margin?");
// Cache rehydrates transparently. Hibernation file is deleted on success.
Where hibernation pays off

Long sessions at scale.

Multi-tenant SaaS chat

Hundreds of concurrent users; only a fraction active at any moment. Hibernate the inactive ones; serve the active ones at full speed.

Per-user agent personas

A productivity assistant per employee. Idle during meetings, active in inboxes. Hibernation lets one box host many personas without resident-memory blowup.

Long document review

A reviewer reads, asks, leaves the chat open for hours. Hibernation between questions. Every "and what about clause 17?" rehydrates instantly.

Always-on desktop

The user opens the assistant once and never closes it. Idle hibernation means the assistant does not block other GPU workloads on the same machine.

Scheduled batch agents

Background agents wake on a cron, process a batch, hibernate. Memory footprint stays flat across daily cycles.

Conversation archival

End-of-day hibernation archives every conversation. The next morning's first message rehydrates whatever the user was working on.

Versus the alternatives

Two bad choices, or this one.

Keep it loaded forever

Each long-lived session occupies its full context indefinitely. Memory grows linearly with users. Hardware cost grows with it.

Drop and rebuild

Discard the context after N minutes idle. Next message has to replay the entire history token-by-token. First response after a pause is painfully slow.

Hibernate

Serialise the cache to disk in seconds. Free the native handle immediately. Rehydrate transparently on the next message at near-instant speed. Best of both worlds.

Related capabilities

Hibernation plus the rest.

Agent memory

Hibernation preserves session state. Agent memory preserves long-term knowledge. Use both: hot conversations hibernate, durable facts persist in memory.

Agent memory

Resilience

Bulkheads cap concurrent live sessions; hibernation reclaims idle ones. Together they keep the GPU schedulable under bursty load.

Resilience

Multi-GPU & tensor overrides

Pair with hibernation to run more sessions per node. Hibernated sessions release their layer slice; new sessions claim it.

Multi-GPU

Observability

Trace hibernate / rehydrate events alongside agent calls. Spot regressions in idle-recovery latency before users do.

Observability

Idle should not cost memory.

Get Community Edition Download