RetryPolicy
Exponential backoff retries with maximum attempts and jitter.
Demo agents work fine. Production agents face flaky tools, throttled APIs, malformed model output, and the occasional out-of-memory event. LM-Kit ships a Polly-style resilience namespace built specifically for agent execution: retries, circuit breakers, timeouts, fallbacks, bulkheads, rate limits, and composite policies.
RetryPolicyExponential backoff retries with maximum attempts and jitter.
CircuitBreakerPolicyTrip the circuit after N failures, half-open after a cooldown.
FallbackPolicySwitch to a smaller model or a deterministic responder on failure.
Agent execution has failure modes you do not see in regular HTTP code. A model can produce malformed output. A tool can return an unexpected shape. A planning loop can stall. A delegated worker can time out. Generic resilience libraries miss these because they treat everything as a request/response. Agent-specific resilience handles them properly.
A flaky HTTP tool retries with exponential backoff. The agent never sees the transient failure.
AgentExecutionOptions.MaxIterations stops a stalled planning loop. Combined with TimeoutPolicy, runaway agents become impossible.
FallbackPolicy swaps in a smaller model or a deterministic responder when the primary fails. Quality degrades gracefully.
BulkheadPolicy caps concurrent agent runs per pool. One runaway tenant does not exhaust the GPU.
RateLimitPolicy enforces token-bucket caps per agent or per user. Useful when downstream tools have quotas.
AgentHealthCheck reports Healthy, Degraded, or Unhealthy. Wires straight into ASP.NET Core health endpoints.
Policies compose with CompositePolicy. Wrap the
ResilientAgentExecutor around any agent and run as normal.
Failures are caught, retried, traced, or routed to a fallback per policy.
Stack timeout, retry, circuit breaker, and fallback policies around a primary agent, then run as normal.
using LMKit.Agents; using LMKit.Agents.Resilience; var primary = Agent.CreateBuilder(model).Build(); var fallback = Agent.CreateBuilder(smallerModel).Build(); var policy = new CompositePolicy( new TimeoutPolicy(TimeSpan.FromSeconds(30)), new RetryPolicy(maxAttempts: 3, baseDelay: TimeSpan.FromMilliseconds(500)), new CircuitBreakerPolicy(failureThreshold: 5, cooldown: TimeSpan.FromMinutes(1)), new FallbackPolicy(fallback) ); var executor = new ResilientAgentExecutor(primary, policy); // Same call site as a regular agent. var result = await executor.RunAsync("Summarise the day's incidents");
Cap concurrent agent runs to protect the host, with built-in health checks for liveness probes.
using LMKit.Agents.Resilience; // Cap concurrency to 4 agent runs at a time. Excess requests queue. var bulkhead = new BulkheadPolicy(maxConcurrent: 4, maxQueued: 16); var executor = new ResilientAgentExecutor(agent, bulkhead); // Health check exposes Healthy / Degraded / Unhealthy. var health = await executor.HealthCheck.CheckAsync(); Console.WriteLine(health.State); // HealthState.Degraded if half the queue is full
Excellent for HTTP. Generic for AI. You write the integration with the agent execution loop, the iteration counters, the model fallback semantics yourself.
Tool-call retries exist but are simple. Circuit breaking, bulkheads, health checks are bring-your-own.
Built specifically for agent execution. Retries, circuit breakers, timeouts, fallbacks, bulkheads, rate limits, health checks, all wired into ResilientAgentExecutor.
Each retry, breaker trip, and fallback emits a span. Find regressions before customers do.
Layer permission policies on top of resilience policies. Failure semantics and security semantics are both first-class.
Use middleware to add domain-specific failure handling: redact, normalise, salvage.
Step-by-step guide combining timeouts, retries, fallbacks, and tracing.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.