Solutions · AI agents · Resilience

Agents that survive Tuesday.

Demo agents work fine. Production agents face flaky tools, throttled APIs, malformed model output, and the occasional out-of-memory event. LM-Kit ships a Polly-style resilience namespace built specifically for agent execution: retries, circuit breakers, timeouts, fallbacks, bulkheads, rate limits, and composite policies.

Start building free API reference

7 policy types Composable Health checks

`RetryPolicy`

Exponential backoff retries with maximum attempts and jitter.

`CircuitBreakerPolicy`

Trip the circuit after N failures, half-open after a cooldown.

`FallbackPolicy`

Switch to a smaller model or a deterministic responder on failure.

Why bake resilience in

The failure modes are not the same.

Agent execution has failure modes you do not see in regular HTTP code. A model can produce malformed output. A tool can return an unexpected shape. A planning loop can stall. A delegated worker can time out. Generic resilience libraries miss these because they treat everything as a request/response. Agent-specific resilience handles them properly.

Tool-level retry

A flaky HTTP tool retries with exponential backoff. The agent never sees the transient failure.

Iteration limits

AgentExecutionOptions.MaxIterations stops a stalled planning loop. Combined with TimeoutPolicy, runaway agents become impossible.

Model fallback

FallbackPolicy swaps in a smaller model or a deterministic responder when the primary fails. Quality degrades gracefully.

Resource isolation

BulkheadPolicy caps concurrent agent runs per pool. One runaway tenant does not exhaust the GPU.

Rate limiting

RateLimitPolicy enforces token-bucket caps per agent or per user. Useful when downstream tools have quotas.

Health checks

AgentHealthCheck reports Healthy, Degraded, or Unhealthy. Wires straight into ASP.NET Core health endpoints.

Composing policies

Wrap an agent in production armour.

Policies compose with CompositePolicy. Wrap the ResilientAgentExecutor around any agent and run as normal. Failures are caught, retried, traced, or routed to a fallback per policy.

Stack timeout, retry, circuit breaker, and fallback policies around a primary agent, then run as normal.

ResilientAgent.cs

using LMKit.Agents;
using LMKit.Agents.Resilience;

var primary  = Agent.CreateBuilder(model).Build();
var fallback = Agent.CreateBuilder(smallerModel).Build();

var policy = new CompositePolicy(
    new TimeoutPolicy(TimeSpan.FromSeconds(30)),
    new RetryPolicy(maxAttempts: 3, baseDelay: TimeSpan.FromMilliseconds(500)),
    new CircuitBreakerPolicy(failureThreshold: 5, cooldown: TimeSpan.FromMinutes(1)),
    new FallbackPolicy(fallback)
);

var executor = new ResilientAgentExecutor(primary, policy);

// Same call site as a regular agent.
var result = await executor.RunAsync("Summarise the day's incidents");

Cap concurrent agent runs to protect the host, with built-in health checks for liveness probes.

Bulkheaded.cs

using LMKit.Agents.Resilience;

// Cap concurrency to 4 agent runs at a time. Excess requests queue.
var bulkhead = new BulkheadPolicy(maxConcurrent: 4, maxQueued: 16);

var executor = new ResilientAgentExecutor(agent, bulkhead);

// Health check exposes Healthy / Degraded / Unhealthy.
var health = await executor.HealthCheck.CheckAsync();
Console.WriteLine(health.State);  // HealthState.Degraded if half the queue is full

Versus the alternatives

Generic resilience does not know about agents.

Polly (raw)

Excellent for HTTP. Generic for AI. You write the integration with the agent execution loop, the iteration counters, the model fallback semantics yourself.

LangChain retries

Tool-call retries exist but are simple. Circuit breaking, bulkheads, health checks are bring-your-own.

LM-Kit Resilience

Built specifically for agent execution. Retries, circuit breakers, timeouts, fallbacks, bulkheads, rate limits, health checks, all wired into ResilientAgentExecutor.

Related capabilities

Pair with observability and policies.

Observability

Each retry, breaker trip, and fallback emits a span. Find regressions before customers do.

Observability page

Permissions

Layer permission policies on top of resilience policies. Failure semantics and security semantics are both first-class.

Permissions page

Filter pipeline

Use middleware to add domain-specific failure handling: redact, normalise, salvage.

Filter pipeline page

Build resilient production agent

Step-by-step guide combining timeouts, retries, fallbacks, and tracing.

How-to guide

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Demo agents versus production agents.

Get Community Edition Download