Tools: 4 Fault Tolerance Patterns Every AI Agent Needs In Production

Tools: 4 Fault Tolerance Patterns Every AI Agent Needs In Production

Your AI agent handles 50 requests in development. Every one succeeds. You deploy to production and within 72 hours, the provider rate-limits you, a tool returns malformed JSON, and your agent enters an infinite retry loop that burns $200 before anyone notices.

This is not a hypothetical. We built a multi-agent system that ran 14 teams of autonomous agents. Three of those teams crashed in the first week — not because the logic was wrong, but because nothing in the system knew how to fail gracefully.

Here are 4 fault tolerance patterns we implemented to fix it. Each one uses production-tested code with LangGraph and LangChain.

The simplest failure mode: a transient error. The API returns a 503, a database connection drops, a model provider hiccups. Most developers handle this with a bare try/except and a fixed retry. That creates thundering herds during outages.

LangGraph has built-in retry policies that handle this correctly — exponential backoff with jitter, configurable per node.

The default retry_on function is smart: it retries on most exceptions but skips ValueError, TypeError, and ImportError — errors that won't resolve on retry. For HTTP requests, it specifically retries on 5xx status codes.

Why this matters: Fixed-interval retries during a provider outage create a spike of simultaneous requests when the service recovers. Exponential backoff with jitter spreads those retries across time, reducing the chance of cascading failure.

Your primary model goes down. Without a fallback, your entire agent stops. With LangChain's middleware system, you can define a fallback chain that switches models automatically.

The middleware tries each model in order. If gpt-4.1 throws an error, it falls to gpt-4.1-mini. If that fails too, it tries Claude. The agent's tool calls, system prompt, and conversation history all carry over — only the model changes.

You can combine this with the retry middleware for defense in depth:

Source: Dev.to