Tools: Latest: LiteLLM Proxy: The Open-Source Alternative for Multi-Provider LLM Failover and Load Balancing

Tools: Latest: LiteLLM Proxy: The Open-Source Alternative for Multi-Provider LLM Failover and Load Balancing

Introduction: What If You Could Use ANY LLM Provider?

What is LiteLLM Proxy?

Architecture: LiteLLM Proxy vs Azure APIM

Azure APIM Architecture (Previous Article)

LiteLLM Proxy Architecture

Getting Started: 5-Minute Setup

Option 1: Docker (Recommended for Production)

Option 2: Python (Quick Testing)

The Configuration File

The Magic: How Failover Actually Works

Automatic 429 Handling

Load Balancing Strategies

Streaming Support: It Just Works

Production Configuration: Enterprise-Ready Setup

High Availability Deployment

Nginx Load Balancer Configuration

Advanced Features

1. Budget & Rate Limiting

2. Request Caching

3. Custom Callbacks & Logging

4. Guardrails & Content Moderation

Comparing Results: LiteLLM vs Azure APIM

When to Use Which?

Choose Azure APIM + Front Door When:

Choose LiteLLM Proxy When:

Production Checklist

Conclusion: The Right Tool for the Job In my previous article, I walked through building a multi-region failover architecture for Azure OpenAI using Azure Front Door and APIM. It works brilliantly - but it's also Azure-specific, requires significant infrastructure, and locks you into a single provider ecosystem. Enter LiteLLM Proxy - an open-source unified gateway that gives you all of this out of the box. LiteLLM is an open-source Python library and proxy server that provides: The beauty? Your application code doesn't change. You point your OpenAI SDK at LiteLLM Proxy, and it handles the rest. Here's how LiteLLM Proxy compares to the Azure-native approach: Pros: Native Azure integration, enterprise compliance, WAF protection

Cons: Azure-only, complex policies, expensive at scale Supported Providers: Azure OpenAI, OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, and 100+ more Pros: Provider-agnostic, simple configuration, open-source, runs anywhere

Cons: Self-managed infrastructure, requires containerization Create litellm_config.yaml: When Azure OpenAI returns a 429 (rate limit), LiteLLM automatically: LiteLLM supports multiple routing strategies: Configure in your YAML: Unlike the Azure APIM approach where streaming requires special handling, LiteLLM Proxy handles SSE (Server-Sent Events) natively: If the primary provider fails mid-stream, LiteLLM will: For production, deploy multiple LiteLLM instances behind a load balancer: Control spending and prevent runaway costs: Create users with specific limits: Reduce costs and latency with semantic caching using Redis: Track every request for observability: I ran the same load test from my Azure article against both architectures: If you're deploying LiteLLM Proxy to production: Both Azure APIM and LiteLLM Proxy solve the same fundamental problem - making LLM services reliable at scale. The choice depends on your constraints: Azure APIM is the enterprise choice when you're committed to Azure and need the full power of the platform's security and compliance features. LiteLLM Proxy is the pragmatic choice when you need flexibility, multi-provider support, or a simpler operational model. The best part? These aren't mutually exclusive. You can run LiteLLM Proxy behind Azure Front Door to get the best of both worlds - enterprise edge security with flexible provider routing. 📦 LiteLLM GitHub: github.com/BerriAI/litellm 📄 LiteLLM Docs: docs.litellm.ai The days of single-provider dependency are over. Whether you choose managed Azure services or open-source flexibility, the key is building resilience into your AI infrastructure from day one. Your 3 AM self will thank you. Templates let you quickly answer FAQs or store snippets for re-use. A fascinating aspect of implementing multi-provider LLM setups is how often teams overlook agents' roles in managing load and failover strategies. In practice, we found leveraging custom agents for task-specific routing can dramatically enhance the efficiency of your LiteLLM Proxy setup. These agents aren't just about distributing load; they're about dynamically adapting to each provider's strengths, optimizing performance in real-time. - Ali Muwwakkil (ali-muwwakkil on LinkedIn) Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ Client -> Azure Front Door -> Regional APIM -> Azure OpenAI (Primary) -> Azure OpenAI (Secondary) Client -> Azure Front Door -> Regional APIM -> Azure OpenAI (Primary) -> Azure OpenAI (Secondary) Client -> Azure Front Door -> Regional APIM -> Azure OpenAI (Primary) -> Azure OpenAI (Secondary) Client -> Load Balancer -> LiteLLM Proxy -> Azure OpenAI -> OpenAI Direct -> Anthropic Claude -> Google Gemini -> AWS Bedrock -> Any LLM Provider Client -> Load Balancer -> LiteLLM Proxy -> Azure OpenAI -> OpenAI Direct -> Anthropic Claude -> Google Gemini -> AWS Bedrock -> Any LLM Provider Client -> Load Balancer -> LiteLLM Proxy -> Azure OpenAI -> OpenAI Direct -> Anthropic Claude -> Google Gemini -> AWS Bedrock -> Any LLM Provider # Pull the official image -weight: 500;">docker pull ghcr.io/berriai/litellm:main-latest # Run with your config -weight: 500;">docker run -d \ --name litellm-proxy \ -p 4000:4000 \ -v $(pwd)/litellm_config.yaml:/app/config.yaml \ -e AZURE_API_KEY="your-azure-key" \ -e OPENAI_API_KEY="your-openai-key" \ -e ANTHROPIC_API_KEY="your-anthropic-key" \ ghcr.io/berriai/litellm:main-latest \ --config /app/config.yaml # Pull the official image -weight: 500;">docker pull ghcr.io/berriai/litellm:main-latest # Run with your config -weight: 500;">docker run -d \ --name litellm-proxy \ -p 4000:4000 \ -v $(pwd)/litellm_config.yaml:/app/config.yaml \ -e AZURE_API_KEY="your-azure-key" \ -e OPENAI_API_KEY="your-openai-key" \ -e ANTHROPIC_API_KEY="your-anthropic-key" \ ghcr.io/berriai/litellm:main-latest \ --config /app/config.yaml # Pull the official image -weight: 500;">docker pull ghcr.io/berriai/litellm:main-latest # Run with your config -weight: 500;">docker run -d \ --name litellm-proxy \ -p 4000:4000 \ -v $(pwd)/litellm_config.yaml:/app/config.yaml \ -e AZURE_API_KEY="your-azure-key" \ -e OPENAI_API_KEY="your-openai-key" \ -e ANTHROPIC_API_KEY="your-anthropic-key" \ ghcr.io/berriai/litellm:main-latest \ --config /app/config.yaml -weight: 500;">pip -weight: 500;">install 'litellm[proxy]' litellm --config litellm_config.yaml -weight: 500;">pip -weight: 500;">install 'litellm[proxy]' litellm --config litellm_config.yaml -weight: 500;">pip -weight: 500;">install 'litellm[proxy]' litellm --config litellm_config.yaml model_list: # Primary: Azure OpenAI GPT-4o (West US) - model_name: gpt-4o litellm_params: model: azure/gpt-4o api_base: https://westus-primary.openai.azure.com/ api_key: os.environ/AZURE_API_KEY api_version: "2024-08-01-preview" model_info: id: azure-westus-gpt4o # Failover 1: Azure OpenAI GPT-4o (East US) - model_name: gpt-4o litellm_params: model: azure/gpt-4o api_base: https://eastus-secondary.openai.azure.com/ api_key: os.environ/AZURE_API_KEY_SECONDARY api_version: "2024-08-01-preview" model_info: id: azure-eastus-gpt4o # Failover 2: OpenAI Direct - model_name: gpt-4o litellm_params: model: gpt-4o api_key: os.environ/OPENAI_API_KEY model_info: id: openai-direct-gpt4o # Failover 3: Anthropic Claude (ultimate backup) - model_name: gpt-4o litellm_params: model: anthropic/claude-3-5-sonnet-20241022 api_key: os.environ/ANTHROPIC_API_KEY model_info: id: anthropic-claude-sonnet litellm_settings: # Enable automatic failover num_retries: 3 retry_after: 5 # Fallback configuration fallbacks: - gpt-4o: [gpt-4o] # Retry across all gpt-4o deployments # Request timeout request_timeout: 120 # Enable streaming stream: true router_settings: # Load balancing strategy routing_strategy: least-busy # Enable rate limit awareness enable_pre_call_checks: true # Cooldown failed deployments cooldown_time: 60 # Number of retries per deployment num_retries: 2 # Retry on these -weight: 500;">status codes retry_after: 5 allowed_fails: 3 general_settings: # Master key for proxy authentication master_key: os.environ/LITELLM_MASTER_KEY # Database for tracking (optional) database_url: os.environ/DATABASE_URL model_list: # Primary: Azure OpenAI GPT-4o (West US) - model_name: gpt-4o litellm_params: model: azure/gpt-4o api_base: https://westus-primary.openai.azure.com/ api_key: os.environ/AZURE_API_KEY api_version: "2024-08-01-preview" model_info: id: azure-westus-gpt4o # Failover 1: Azure OpenAI GPT-4o (East US) - model_name: gpt-4o litellm_params: model: azure/gpt-4o api_base: https://eastus-secondary.openai.azure.com/ api_key: os.environ/AZURE_API_KEY_SECONDARY api_version: "2024-08-01-preview" model_info: id: azure-eastus-gpt4o # Failover 2: OpenAI Direct - model_name: gpt-4o litellm_params: model: gpt-4o api_key: os.environ/OPENAI_API_KEY model_info: id: openai-direct-gpt4o # Failover 3: Anthropic Claude (ultimate backup) - model_name: gpt-4o litellm_params: model: anthropic/claude-3-5-sonnet-20241022 api_key: os.environ/ANTHROPIC_API_KEY model_info: id: anthropic-claude-sonnet litellm_settings: # Enable automatic failover num_retries: 3 retry_after: 5 # Fallback configuration fallbacks: - gpt-4o: [gpt-4o] # Retry across all gpt-4o deployments # Request timeout request_timeout: 120 # Enable streaming stream: true router_settings: # Load balancing strategy routing_strategy: least-busy # Enable rate limit awareness enable_pre_call_checks: true # Cooldown failed deployments cooldown_time: 60 # Number of retries per deployment num_retries: 2 # Retry on these -weight: 500;">status codes retry_after: 5 allowed_fails: 3 general_settings: # Master key for proxy authentication master_key: os.environ/LITELLM_MASTER_KEY # Database for tracking (optional) database_url: os.environ/DATABASE_URL model_list: # Primary: Azure OpenAI GPT-4o (West US) - model_name: gpt-4o litellm_params: model: azure/gpt-4o api_base: https://westus-primary.openai.azure.com/ api_key: os.environ/AZURE_API_KEY api_version: "2024-08-01-preview" model_info: id: azure-westus-gpt4o # Failover 1: Azure OpenAI GPT-4o (East US) - model_name: gpt-4o litellm_params: model: azure/gpt-4o api_base: https://eastus-secondary.openai.azure.com/ api_key: os.environ/AZURE_API_KEY_SECONDARY api_version: "2024-08-01-preview" model_info: id: azure-eastus-gpt4o # Failover 2: OpenAI Direct - model_name: gpt-4o litellm_params: model: gpt-4o api_key: os.environ/OPENAI_API_KEY model_info: id: openai-direct-gpt4o # Failover 3: Anthropic Claude (ultimate backup) - model_name: gpt-4o litellm_params: model: anthropic/claude-3-5-sonnet-20241022 api_key: os.environ/ANTHROPIC_API_KEY model_info: id: anthropic-claude-sonnet litellm_settings: # Enable automatic failover num_retries: 3 retry_after: 5 # Fallback configuration fallbacks: - gpt-4o: [gpt-4o] # Retry across all gpt-4o deployments # Request timeout request_timeout: 120 # Enable streaming stream: true router_settings: # Load balancing strategy routing_strategy: least-busy # Enable rate limit awareness enable_pre_call_checks: true # Cooldown failed deployments cooldown_time: 60 # Number of retries per deployment num_retries: 2 # Retry on these -weight: 500;">status codes retry_after: 5 allowed_fails: 3 general_settings: # Master key for proxy authentication master_key: os.environ/LITELLM_MASTER_KEY # Database for tracking (optional) database_url: os.environ/DATABASE_URL # Your code stays simple - LiteLLM handles everything from openai import OpenAI client = OpenAI( api_key="your-litellm-key", base_url="http://localhost:4000" # Point to LiteLLM Proxy ) # This request automatically fails over if needed response = client.chat.completions.create( model="gpt-4o", # LiteLLM routes to best available messages=[{"role": "user", "content": "Hello!"}] ) # Your code stays simple - LiteLLM handles everything from openai import OpenAI client = OpenAI( api_key="your-litellm-key", base_url="http://localhost:4000" # Point to LiteLLM Proxy ) # This request automatically fails over if needed response = client.chat.completions.create( model="gpt-4o", # LiteLLM routes to best available messages=[{"role": "user", "content": "Hello!"}] ) # Your code stays simple - LiteLLM handles everything from openai import OpenAI client = OpenAI( api_key="your-litellm-key", base_url="http://localhost:4000" # Point to LiteLLM Proxy ) # This request automatically fails over if needed response = client.chat.completions.create( model="gpt-4o", # LiteLLM routes to best available messages=[{"role": "user", "content": "Hello!"}] ) router_settings: routing_strategy: latency-based-routing # For latency-based routing, set expected latencies model_group_alias: gpt-4o: - model: azure/gpt-4o weight: 0.7 # 70% of traffic - model: openai/gpt-4o weight: 0.3 # 30% of traffic router_settings: routing_strategy: latency-based-routing # For latency-based routing, set expected latencies model_group_alias: gpt-4o: - model: azure/gpt-4o weight: 0.7 # 70% of traffic - model: openai/gpt-4o weight: 0.3 # 30% of traffic router_settings: routing_strategy: latency-based-routing # For latency-based routing, set expected latencies model_group_alias: gpt-4o: - model: azure/gpt-4o weight: 0.7 # 70% of traffic - model: openai/gpt-4o weight: 0.3 # 30% of traffic from openai import OpenAI client = OpenAI( api_key="your-litellm-key", base_url="http://localhost:4000" ) # Streaming works exactly like direct OpenAI stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Write a poem about resilience"}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) from openai import OpenAI client = OpenAI( api_key="your-litellm-key", base_url="http://localhost:4000" ) # Streaming works exactly like direct OpenAI stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Write a poem about resilience"}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) from openai import OpenAI client = OpenAI( api_key="your-litellm-key", base_url="http://localhost:4000" ) # Streaming works exactly like direct OpenAI stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Write a poem about resilience"}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) # -weight: 500;">docker-compose.yml version: '3.8' services: litellm-1: image: ghcr.io/berriai/litellm:main-latest ports: - "4001:4000" volumes: - ./litellm_config.yaml:/app/config.yaml environment: - AZURE_API_KEY=${AZURE_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=${DATABASE_URL} command: --config /app/config.yaml -weight: 500;">restart: always healthcheck: test: ["CMD", "-weight: 500;">curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3 litellm-2: image: ghcr.io/berriai/litellm:main-latest ports: - "4002:4000" volumes: - ./litellm_config.yaml:/app/config.yaml environment: - AZURE_API_KEY=${AZURE_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=${DATABASE_URL} command: --config /app/config.yaml -weight: 500;">restart: always healthcheck: test: ["CMD", "-weight: 500;">curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3 nginx: image: nginx:alpine ports: - "4000:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf depends_on: - litellm-1 - litellm-2 -weight: 500;">restart: always # -weight: 500;">docker-compose.yml version: '3.8' services: litellm-1: image: ghcr.io/berriai/litellm:main-latest ports: - "4001:4000" volumes: - ./litellm_config.yaml:/app/config.yaml environment: - AZURE_API_KEY=${AZURE_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=${DATABASE_URL} command: --config /app/config.yaml -weight: 500;">restart: always healthcheck: test: ["CMD", "-weight: 500;">curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3 litellm-2: image: ghcr.io/berriai/litellm:main-latest ports: - "4002:4000" volumes: - ./litellm_config.yaml:/app/config.yaml environment: - AZURE_API_KEY=${AZURE_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=${DATABASE_URL} command: --config /app/config.yaml -weight: 500;">restart: always healthcheck: test: ["CMD", "-weight: 500;">curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3 nginx: image: nginx:alpine ports: - "4000:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf depends_on: - litellm-1 - litellm-2 -weight: 500;">restart: always # -weight: 500;">docker-compose.yml version: '3.8' services: litellm-1: image: ghcr.io/berriai/litellm:main-latest ports: - "4001:4000" volumes: - ./litellm_config.yaml:/app/config.yaml environment: - AZURE_API_KEY=${AZURE_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=${DATABASE_URL} command: --config /app/config.yaml -weight: 500;">restart: always healthcheck: test: ["CMD", "-weight: 500;">curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3 litellm-2: image: ghcr.io/berriai/litellm:main-latest ports: - "4002:4000" volumes: - ./litellm_config.yaml:/app/config.yaml environment: - AZURE_API_KEY=${AZURE_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=${DATABASE_URL} command: --config /app/config.yaml -weight: 500;">restart: always healthcheck: test: ["CMD", "-weight: 500;">curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3 nginx: image: nginx:alpine ports: - "4000:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf depends_on: - litellm-1 - litellm-2 -weight: 500;">restart: always # nginx.conf events { worker_connections 1024; } http { upstream litellm { least_conn; server litellm-1:4000 weight=1; server litellm-2:4000 weight=1; } server { listen 80; location / { proxy_pass http://litellm; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "-weight: 500;">upgrade"; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_read_timeout 300s; proxy_buffering off; # Important for streaming } location /health { proxy_pass http://litellm; proxy_connect_timeout 5s; proxy_read_timeout 5s; } } } # nginx.conf events { worker_connections 1024; } http { upstream litellm { least_conn; server litellm-1:4000 weight=1; server litellm-2:4000 weight=1; } server { listen 80; location / { proxy_pass http://litellm; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "-weight: 500;">upgrade"; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_read_timeout 300s; proxy_buffering off; # Important for streaming } location /health { proxy_pass http://litellm; proxy_connect_timeout 5s; proxy_read_timeout 5s; } } } # nginx.conf events { worker_connections 1024; } http { upstream litellm { least_conn; server litellm-1:4000 weight=1; server litellm-2:4000 weight=1; } server { listen 80; location / { proxy_pass http://litellm; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "-weight: 500;">upgrade"; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_read_timeout 300s; proxy_buffering off; # Important for streaming } location /health { proxy_pass http://litellm; proxy_connect_timeout 5s; proxy_read_timeout 5s; } } } general_settings: master_key: sk-your-master-key # User-level budgets litellm_settings: max_budget: 100.00 # $100 max per user budget_duration: monthly general_settings: master_key: sk-your-master-key # User-level budgets litellm_settings: max_budget: 100.00 # $100 max per user budget_duration: monthly general_settings: master_key: sk-your-master-key # User-level budgets litellm_settings: max_budget: 100.00 # $100 max per user budget_duration: monthly -weight: 500;">curl -X POST 'http://localhost:4000/user/new' \ -H 'Authorization: Bearer sk-your-master-key' \ -H 'Content-Type: application/json' \ -d '{ "user_id": "user-123", "max_budget": 50.00, "budget_duration": "monthly", "models": ["gpt-4o", "gpt-3.5-turbo"] }' -weight: 500;">curl -X POST 'http://localhost:4000/user/new' \ -H 'Authorization: Bearer sk-your-master-key' \ -H 'Content-Type: application/json' \ -d '{ "user_id": "user-123", "max_budget": 50.00, "budget_duration": "monthly", "models": ["gpt-4o", "gpt-3.5-turbo"] }' -weight: 500;">curl -X POST 'http://localhost:4000/user/new' \ -H 'Authorization: Bearer sk-your-master-key' \ -H 'Content-Type: application/json' \ -d '{ "user_id": "user-123", "max_budget": 50.00, "budget_duration": "monthly", "models": ["gpt-4o", "gpt-3.5-turbo"] }' litellm_settings: cache: true cache_params: type: redis host: localhost port: 6379 ttl: 3600 # 1 hour cache litellm_settings: cache: true cache_params: type: redis host: localhost port: 6379 ttl: 3600 # 1 hour cache litellm_settings: cache: true cache_params: type: redis host: localhost port: 6379 ttl: 3600 # 1 hour cache litellm_settings: success_callback: ["langfuse", "prometheus"] # Langfuse & Prometheus integrations failure_callback: ["langfuse", "slack"] # Langfuse integration langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY litellm_settings: success_callback: ["langfuse", "prometheus"] # Langfuse & Prometheus integrations failure_callback: ["langfuse", "slack"] # Langfuse integration langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY litellm_settings: success_callback: ["langfuse", "prometheus"] # Langfuse & Prometheus integrations failure_callback: ["langfuse", "slack"] # Langfuse integration langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY litellm_settings: guardrails: - guardrail_name: "content-filter" litellm_params: guardrail: openai_moderation mode: pre_call # Check before sending to LLM litellm_settings: guardrails: - guardrail_name: "content-filter" litellm_params: guardrail: openai_moderation mode: pre_call # Check before sending to LLM litellm_settings: guardrails: - guardrail_name: "content-filter" litellm_params: guardrail: openai_moderation mode: pre_call # Check before sending to LLM - Multi-provider failover (Azure OpenAI -> OpenAI -> Anthropic -> Gemini) - A simpler deployment without managing APIM policies - Provider-agnostic architecture that works anywhere - Open-source flexibility with no vendor lock-in - Unified API: One OpenAI-compatible endpoint for 100+ LLM providers - Built-in Load Balancing: Distribute requests across multiple deployments - Automatic Failover: Seamlessly retry on different models/providers when one fails - Rate Limit Handling: Intelligent retry with exponential backoff for 429 errors - Cost Tracking: Monitor spend across all providers in one place - Streaming Support: Full SSE (Server-Sent Events) support with proper failover - Reads the Retry-After header - Marks that deployment as "cooling down" - Routes the request to the next available deployment - Continues until a successful response or all deployments exhausted - Detect the connection failure - Automatically retry on the next provider - Return an error only if all providers fail - LiteLLM showed slightly better latency due to simpler request pipeline - Both achieved similar reliability with proper configuration - LiteLLM's multi-provider fallback provided an extra safety net - Cost difference is significant for smaller teams - You're all-in on Azure and need native integration - Enterprise compliance requirements mandate Azure services - You need WAF/DDoS protection at the edge - Your organization has existing APIM expertise - Audit logging must stay within Azure ecosystem - You need multi-provider failover (not just multi-region) - Cost optimization is a priority - You want provider flexibility to switch easily - Your team prefers simple YAML configuration over XML policies - You're running on Kubernetes, AWS, GCP, or on-prem - You need rapid prototyping and iteration - [ ] Deploy Multiple Instances: At least 2 behind a load balancer - [ ] Enable Health Checks: Configure /health endpoint monitoring - [ ] Set Up Database: PostgreSQL for persistence and analytics - [ ] Configure Caching: Redis for semantic caching - [ ] Add Monitoring: Prometheus + Grafana or Langfuse - [ ] Set Budget Limits: Prevent runaway costs - [ ] Secure the Proxy: Use master key authentication - [ ] Enable TLS: HTTPS in production (via nginx or cloud LB) - [ ] Configure Alerts: Slack/PagerDuty for failures - [ ] Test Failover: Deliberately fail providers to verify behavior - Location Wylie, TX - Work Managing Director at Colaberry — focused on AI training and enterprise deployment - Joined Mar 22, 2026