NotCreated
No context allocated yet. Lazy creation on first call.
A long-running conversation can hold a multi-gigabyte inference
context: KV-cache, attention state, full session history. Most of
the time the user is idle. The cache sits in RAM or VRAM, blocking
other workloads. IKVCache.HibernateAsync serialises
that entire state to disk and frees the native handle in seconds.
The next call rehydrates it transparently. The conversation never
notices.
NotCreatedNo context allocated yet. Lazy creation on first call.
InMemoryActive in RAM or VRAM. Hot path. Inference runs at full speed.
HibernatedSerialised to disk. Native handle freed. Next call rehydrates.
Multi-tenant chat applications, long-running document review sessions, always-on desktop assistants, and per-user agent personas all share the same problem: the model context grows over time and rarely shrinks. Without hibernation the only options are "keep it loaded forever" or "drop it and rebuild from scratch on the next message". Hibernation gives you a third path: drop the bytes, keep the meaning.
Long sessions can occupy gigabytes of native memory. Hibernating idle ones lets the GPU schedule active workloads. RAM stays available for other processes.
Full KV-cache plus session history serialised. Rehydration restores byte-identical state. The conversation continues exactly where it left off.
HibernateAsync returns a Task. Coalesces concurrent requests. Defers until active usage locks release. Safe to call while the session is busy.
The next SubmitAsync call restores the cache automatically. No special-casing in caller code. The hibernation file is deleted on success.
Default location is Configuration.ContextHibernationDirectory. Pass an explicit path for per-tenant separation, fast SSD targeting, or encrypted volumes.
Files are cleaned up automatically after rehydration. Cleaned up on dispose if the context never reactivates. No leaked artefacts.
Cast the conversation to IKVCache, call HibernateAsync, free the GPU. The next SubmitAsync rehydrates the session transparently and deletes the hibernation file on success. No special handling in caller code.
using LMKit.Inference; using LMKit.TextGeneration; var chat = new MultiTurnConversation(model); await chat.SubmitAsync("Walk me through the Q3 financials."); await chat.SubmitAsync("What changed in operating expenses?"); // User goes idle. Free the GPU; keep the conversation. if (chat is IKVCache cache && cache.Residency == ContextResidency.InMemory) { _ = cache.HibernateAsync(); // fire-and-forget } // 90 minutes later. Same chat object. No special handling. await chat.SubmitAsync("And the gross margin?"); // Cache rehydrates transparently. Hibernation file is deleted on success.
Redirect hibernation files to encrypted per-tenant storage instead of the system temp folder. Use a global Configuration default or override the destination per call.
using LMKit.Global; // Redirect hibernation files to per-tenant encrypted storage. Configuration.ContextHibernationDirectory = @"D:\encrypted\hibernate"; // Or override per-call with an explicit path. if (chat is IKVCache cache) { await cache.HibernateAsync($@"D:\tenants\{tenantId}\sessions\{sessionId}.lmk-state"); }
Production reaper. Walk every active session and hibernate any that have been idle past the threshold. Reclaims memory in the background; live sessions are untouched.
// Production pattern: hibernate sessions idle for more than N minutes. var sessions = _sessionRegistry.GetAll(); var threshold = TimeSpan.FromMinutes(15); foreach (var session in sessions) { if (session.IdleFor < threshold) continue; if (session.Chat is not IKVCache cache) continue; if (cache.Residency != ContextResidency.InMemory) continue; _ = cache.HibernateAsync(); // reclaim memory in the background }
Hundreds of concurrent users; only a fraction active at any moment. Hibernate the inactive ones; serve the active ones at full speed.
A productivity assistant per employee. Idle during meetings, active in inboxes. Hibernation lets one box host many personas without resident-memory blowup.
A reviewer reads, asks, leaves the chat open for hours. Hibernation between questions. Every "and what about clause 17?" rehydrates instantly.
The user opens the assistant once and never closes it. Idle hibernation means the assistant does not block other GPU workloads on the same machine.
Background agents wake on a cron, process a batch, hibernate. Memory footprint stays flat across daily cycles.
End-of-day hibernation archives every conversation. The next morning's first message rehydrates whatever the user was working on.
Each long-lived session occupies its full context indefinitely. Memory grows linearly with users. Hardware cost grows with it.
Discard the context after N minutes idle. Next message has to replay the entire history token-by-token. First response after a pause is painfully slow.
Serialise the cache to disk in seconds. Free the native handle immediately. Rehydrate transparently on the next message at near-instant speed. Best of both worlds.
Hibernation preserves session state. Agent memory preserves long-term knowledge. Use both: hot conversations hibernate, durable facts persist in memory.
Bulkheads cap concurrent live sessions; hibernation reclaims idle ones. Together they keep the GPU schedulable under bursty load.
Pair with hibernation to run more sessions per node. Hibernated sessions release their layer slice; new sessions claim it.
Trace hibernate / rehydrate events alongside agent calls. Spot regressions in idle-recovery latency before users do.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Console demo: hibernate context, restart process, resume conversation.
Open on GitHub → How-to guideHibernate KV-cache + history to disk; rehydrate on next call.
Read the guide → How-to guideReuse KV-cache and context allocations across calls.
Read the guide →