Tools

Tools: 4 Fault Tolerance Patterns Every AI Agent Needs In Production

2026-03-02 0 views admin

Your AI agent handles 50 requests in development. Every one succeeds. You deploy to production and within 72 hours, the provider rate-limits you, a tool returns malformed JSON, and your agent enters an infinite retry loop that burns $200 before anyone notices.

This is not a hypothetical. We built a multi-agent system that ran 14 teams of autonomous agents. Three of those teams crashed in the first week — not because the logic was wrong, but because nothing in the system knew how to fail gracefully.

Here are 4 fault tolerance patterns we implemented to fix it. Each one uses production-tested code with LangGraph and LangChain.

The simplest failure mode: a transient error. The API returns a 503, a database connection drops, a model provider hiccups. Most developers handle this with a bare try/except and a fixed retry. That creates thundering herds during outages.

LangGraph has built-in retry policies that handle this correctly — exponential backoff with jitter, configurable per node.

The default retry_on function is smart: it retries on most exceptions but skips ValueError, TypeError, and ImportError — errors that won't resolve on retry. For HTTP requests, it specifically retries on 5xx status codes.

Why this matters: Fixed-interval retries during a provider outage create a spike of simultaneous requests when the service recovers. Exponential backoff with jitter spreads those retries across time, reducing the chance of cascading failure.

Your primary model goes down. Without a fallback, your entire agent stops. With LangChain's middleware system, you can define a fallback chain that switches models automatically.

The middleware tries each model in order. If gpt-4.1 throws an error, it falls to gpt-4.1-mini. If that fails too, it tries Claude. The agent's tool calls, system prompt, and conversation history all carry over — only the model changes.

You can combine this with the retry middleware for defense in depth:

Source: Dev.to

🏷️ Tags

tooldeveloperapi

Tools: 4 Fault Tolerance Patterns Every AI Agent Needs In Production

🏷️ Tags

More from Tools

Tools: Vim commands that handle the detail work without friction

Tools: Kubernetes'te TimescaleDB Retention Testleri

Tools: I Made a Product Demo Video Entirely with AI

Tools: Scoring HN discussions by practitioner depth, not popularity

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting