$ Client -> Azure Front Door -> Regional APIM -> Azure OpenAI (Primary) -> Azure OpenAI (Secondary)
Client -> Azure Front Door -> Regional APIM -> Azure OpenAI (Primary) -> Azure OpenAI (Secondary)
Client -> Azure Front Door -> Regional APIM -> Azure OpenAI (Primary) -> Azure OpenAI (Secondary)
Client -> Load Balancer -> LiteLLM Proxy -> Azure OpenAI -> OpenAI Direct -> Anthropic Claude -> Google Gemini -> AWS Bedrock -> Any LLM Provider
Client -> Load Balancer -> LiteLLM Proxy -> Azure OpenAI -> OpenAI Direct -> Anthropic Claude -> Google Gemini -> AWS Bedrock -> Any LLM Provider
Client -> Load Balancer -> LiteLLM Proxy -> Azure OpenAI -> OpenAI Direct -> Anthropic Claude -> Google Gemini -> AWS Bedrock -> Any LLM Provider
# Pull the official image
-weight: 500;">docker pull ghcr.io/berriai/litellm:main-latest # Run with your config
-weight: 500;">docker run -d \ --name litellm-proxy \ -p 4000:4000 \ -v $(pwd)/litellm_config.yaml:/app/config.yaml \ -e AZURE_API_KEY="your-azure-key" \ -e OPENAI_API_KEY="your-openai-key" \ -e ANTHROPIC_API_KEY="your-anthropic-key" \ ghcr.io/berriai/litellm:main-latest \ --config /app/config.yaml
# Pull the official image
-weight: 500;">docker pull ghcr.io/berriai/litellm:main-latest # Run with your config
-weight: 500;">docker run -d \ --name litellm-proxy \ -p 4000:4000 \ -v $(pwd)/litellm_config.yaml:/app/config.yaml \ -e AZURE_API_KEY="your-azure-key" \ -e OPENAI_API_KEY="your-openai-key" \ -e ANTHROPIC_API_KEY="your-anthropic-key" \ ghcr.io/berriai/litellm:main-latest \ --config /app/config.yaml
# Pull the official image
-weight: 500;">docker pull ghcr.io/berriai/litellm:main-latest # Run with your config
-weight: 500;">docker run -d \ --name litellm-proxy \ -p 4000:4000 \ -v $(pwd)/litellm_config.yaml:/app/config.yaml \ -e AZURE_API_KEY="your-azure-key" \ -e OPENAI_API_KEY="your-openai-key" \ -e ANTHROPIC_API_KEY="your-anthropic-key" \ ghcr.io/berriai/litellm:main-latest \ --config /app/config.yaml
-weight: 500;">pip -weight: 500;">install 'litellm[proxy]'
litellm --config litellm_config.yaml
-weight: 500;">pip -weight: 500;">install 'litellm[proxy]'
litellm --config litellm_config.yaml
-weight: 500;">pip -weight: 500;">install 'litellm[proxy]'
litellm --config litellm_config.yaml
model_list: # Primary: Azure OpenAI GPT-4o (West US) - model_name: gpt-4o litellm_params: model: azure/gpt-4o api_base: https://westus-primary.openai.azure.com/ api_key: os.environ/AZURE_API_KEY api_version: "2024-08-01-preview" model_info: id: azure-westus-gpt4o # Failover 1: Azure OpenAI GPT-4o (East US) - model_name: gpt-4o litellm_params: model: azure/gpt-4o api_base: https://eastus-secondary.openai.azure.com/ api_key: os.environ/AZURE_API_KEY_SECONDARY api_version: "2024-08-01-preview" model_info: id: azure-eastus-gpt4o # Failover 2: OpenAI Direct - model_name: gpt-4o litellm_params: model: gpt-4o api_key: os.environ/OPENAI_API_KEY model_info: id: openai-direct-gpt4o # Failover 3: Anthropic Claude (ultimate backup) - model_name: gpt-4o litellm_params: model: anthropic/claude-3-5-sonnet-20241022 api_key: os.environ/ANTHROPIC_API_KEY model_info: id: anthropic-claude-sonnet litellm_settings: # Enable automatic failover num_retries: 3 retry_after: 5 # Fallback configuration fallbacks: - gpt-4o: [gpt-4o] # Retry across all gpt-4o deployments # Request timeout request_timeout: 120 # Enable streaming stream: true router_settings: # Load balancing strategy routing_strategy: least-busy # Enable rate limit awareness enable_pre_call_checks: true # Cooldown failed deployments cooldown_time: 60 # Number of retries per deployment num_retries: 2 # Retry on these -weight: 500;">status codes retry_after: 5 allowed_fails: 3 general_settings: # Master key for proxy authentication master_key: os.environ/LITELLM_MASTER_KEY # Database for tracking (optional) database_url: os.environ/DATABASE_URL
model_list: # Primary: Azure OpenAI GPT-4o (West US) - model_name: gpt-4o litellm_params: model: azure/gpt-4o api_base: https://westus-primary.openai.azure.com/ api_key: os.environ/AZURE_API_KEY api_version: "2024-08-01-preview" model_info: id: azure-westus-gpt4o # Failover 1: Azure OpenAI GPT-4o (East US) - model_name: gpt-4o litellm_params: model: azure/gpt-4o api_base: https://eastus-secondary.openai.azure.com/ api_key: os.environ/AZURE_API_KEY_SECONDARY api_version: "2024-08-01-preview" model_info: id: azure-eastus-gpt4o # Failover 2: OpenAI Direct - model_name: gpt-4o litellm_params: model: gpt-4o api_key: os.environ/OPENAI_API_KEY model_info: id: openai-direct-gpt4o # Failover 3: Anthropic Claude (ultimate backup) - model_name: gpt-4o litellm_params: model: anthropic/claude-3-5-sonnet-20241022 api_key: os.environ/ANTHROPIC_API_KEY model_info: id: anthropic-claude-sonnet litellm_settings: # Enable automatic failover num_retries: 3 retry_after: 5 # Fallback configuration fallbacks: - gpt-4o: [gpt-4o] # Retry across all gpt-4o deployments # Request timeout request_timeout: 120 # Enable streaming stream: true router_settings: # Load balancing strategy routing_strategy: least-busy # Enable rate limit awareness enable_pre_call_checks: true # Cooldown failed deployments cooldown_time: 60 # Number of retries per deployment num_retries: 2 # Retry on these -weight: 500;">status codes retry_after: 5 allowed_fails: 3 general_settings: # Master key for proxy authentication master_key: os.environ/LITELLM_MASTER_KEY # Database for tracking (optional) database_url: os.environ/DATABASE_URL
model_list: # Primary: Azure OpenAI GPT-4o (West US) - model_name: gpt-4o litellm_params: model: azure/gpt-4o api_base: https://westus-primary.openai.azure.com/ api_key: os.environ/AZURE_API_KEY api_version: "2024-08-01-preview" model_info: id: azure-westus-gpt4o # Failover 1: Azure OpenAI GPT-4o (East US) - model_name: gpt-4o litellm_params: model: azure/gpt-4o api_base: https://eastus-secondary.openai.azure.com/ api_key: os.environ/AZURE_API_KEY_SECONDARY api_version: "2024-08-01-preview" model_info: id: azure-eastus-gpt4o # Failover 2: OpenAI Direct - model_name: gpt-4o litellm_params: model: gpt-4o api_key: os.environ/OPENAI_API_KEY model_info: id: openai-direct-gpt4o # Failover 3: Anthropic Claude (ultimate backup) - model_name: gpt-4o litellm_params: model: anthropic/claude-3-5-sonnet-20241022 api_key: os.environ/ANTHROPIC_API_KEY model_info: id: anthropic-claude-sonnet litellm_settings: # Enable automatic failover num_retries: 3 retry_after: 5 # Fallback configuration fallbacks: - gpt-4o: [gpt-4o] # Retry across all gpt-4o deployments # Request timeout request_timeout: 120 # Enable streaming stream: true router_settings: # Load balancing strategy routing_strategy: least-busy # Enable rate limit awareness enable_pre_call_checks: true # Cooldown failed deployments cooldown_time: 60 # Number of retries per deployment num_retries: 2 # Retry on these -weight: 500;">status codes retry_after: 5 allowed_fails: 3 general_settings: # Master key for proxy authentication master_key: os.environ/LITELLM_MASTER_KEY # Database for tracking (optional) database_url: os.environ/DATABASE_URL
# Your code stays simple - LiteLLM handles everything
from openai import OpenAI client = OpenAI( api_key="your-litellm-key", base_url="http://localhost:4000" # Point to LiteLLM Proxy
) # This request automatically fails over if needed
response = client.chat.completions.create( model="gpt-4o", # LiteLLM routes to best available messages=[{"role": "user", "content": "Hello!"}]
)
# Your code stays simple - LiteLLM handles everything
from openai import OpenAI client = OpenAI( api_key="your-litellm-key", base_url="http://localhost:4000" # Point to LiteLLM Proxy
) # This request automatically fails over if needed
response = client.chat.completions.create( model="gpt-4o", # LiteLLM routes to best available messages=[{"role": "user", "content": "Hello!"}]
)
# Your code stays simple - LiteLLM handles everything
from openai import OpenAI client = OpenAI( api_key="your-litellm-key", base_url="http://localhost:4000" # Point to LiteLLM Proxy
) # This request automatically fails over if needed
response = client.chat.completions.create( model="gpt-4o", # LiteLLM routes to best available messages=[{"role": "user", "content": "Hello!"}]
)
router_settings: routing_strategy: latency-based-routing # For latency-based routing, set expected latencies model_group_alias: gpt-4o: - model: azure/gpt-4o weight: 0.7 # 70% of traffic - model: openai/gpt-4o weight: 0.3 # 30% of traffic
router_settings: routing_strategy: latency-based-routing # For latency-based routing, set expected latencies model_group_alias: gpt-4o: - model: azure/gpt-4o weight: 0.7 # 70% of traffic - model: openai/gpt-4o weight: 0.3 # 30% of traffic
router_settings: routing_strategy: latency-based-routing # For latency-based routing, set expected latencies model_group_alias: gpt-4o: - model: azure/gpt-4o weight: 0.7 # 70% of traffic - model: openai/gpt-4o weight: 0.3 # 30% of traffic
from openai import OpenAI client = OpenAI( api_key="your-litellm-key", base_url="http://localhost:4000"
) # Streaming works exactly like direct OpenAI
stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Write a poem about resilience"}], stream=True
) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)
from openai import OpenAI client = OpenAI( api_key="your-litellm-key", base_url="http://localhost:4000"
) # Streaming works exactly like direct OpenAI
stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Write a poem about resilience"}], stream=True
) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)
from openai import OpenAI client = OpenAI( api_key="your-litellm-key", base_url="http://localhost:4000"
) # Streaming works exactly like direct OpenAI
stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Write a poem about resilience"}], stream=True
) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)
# -weight: 500;">docker-compose.yml
version: '3.8' services: litellm-1: image: ghcr.io/berriai/litellm:main-latest ports: - "4001:4000" volumes: - ./litellm_config.yaml:/app/config.yaml environment: - AZURE_API_KEY=${AZURE_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=${DATABASE_URL} command: --config /app/config.yaml -weight: 500;">restart: always healthcheck: test: ["CMD", "-weight: 500;">curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3 litellm-2: image: ghcr.io/berriai/litellm:main-latest ports: - "4002:4000" volumes: - ./litellm_config.yaml:/app/config.yaml environment: - AZURE_API_KEY=${AZURE_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=${DATABASE_URL} command: --config /app/config.yaml -weight: 500;">restart: always healthcheck: test: ["CMD", "-weight: 500;">curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3 nginx: image: nginx:alpine ports: - "4000:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf depends_on: - litellm-1 - litellm-2 -weight: 500;">restart: always
# -weight: 500;">docker-compose.yml
version: '3.8' services: litellm-1: image: ghcr.io/berriai/litellm:main-latest ports: - "4001:4000" volumes: - ./litellm_config.yaml:/app/config.yaml environment: - AZURE_API_KEY=${AZURE_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=${DATABASE_URL} command: --config /app/config.yaml -weight: 500;">restart: always healthcheck: test: ["CMD", "-weight: 500;">curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3 litellm-2: image: ghcr.io/berriai/litellm:main-latest ports: - "4002:4000" volumes: - ./litellm_config.yaml:/app/config.yaml environment: - AZURE_API_KEY=${AZURE_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=${DATABASE_URL} command: --config /app/config.yaml -weight: 500;">restart: always healthcheck: test: ["CMD", "-weight: 500;">curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3 nginx: image: nginx:alpine ports: - "4000:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf depends_on: - litellm-1 - litellm-2 -weight: 500;">restart: always
# -weight: 500;">docker-compose.yml
version: '3.8' services: litellm-1: image: ghcr.io/berriai/litellm:main-latest ports: - "4001:4000" volumes: - ./litellm_config.yaml:/app/config.yaml environment: - AZURE_API_KEY=${AZURE_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=${DATABASE_URL} command: --config /app/config.yaml -weight: 500;">restart: always healthcheck: test: ["CMD", "-weight: 500;">curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3 litellm-2: image: ghcr.io/berriai/litellm:main-latest ports: - "4002:4000" volumes: - ./litellm_config.yaml:/app/config.yaml environment: - AZURE_API_KEY=${AZURE_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=${DATABASE_URL} command: --config /app/config.yaml -weight: 500;">restart: always healthcheck: test: ["CMD", "-weight: 500;">curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3 nginx: image: nginx:alpine ports: - "4000:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf depends_on: - litellm-1 - litellm-2 -weight: 500;">restart: always
# nginx.conf
events { worker_connections 1024;
} http { upstream litellm { least_conn; server litellm-1:4000 weight=1; server litellm-2:4000 weight=1; } server { listen 80; location / { proxy_pass http://litellm; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "-weight: 500;">upgrade"; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_read_timeout 300s; proxy_buffering off; # Important for streaming } location /health { proxy_pass http://litellm; proxy_connect_timeout 5s; proxy_read_timeout 5s; } }
}
# nginx.conf
events { worker_connections 1024;
} http { upstream litellm { least_conn; server litellm-1:4000 weight=1; server litellm-2:4000 weight=1; } server { listen 80; location / { proxy_pass http://litellm; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "-weight: 500;">upgrade"; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_read_timeout 300s; proxy_buffering off; # Important for streaming } location /health { proxy_pass http://litellm; proxy_connect_timeout 5s; proxy_read_timeout 5s; } }
}
# nginx.conf
events { worker_connections 1024;
} http { upstream litellm { least_conn; server litellm-1:4000 weight=1; server litellm-2:4000 weight=1; } server { listen 80; location / { proxy_pass http://litellm; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "-weight: 500;">upgrade"; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_read_timeout 300s; proxy_buffering off; # Important for streaming } location /health { proxy_pass http://litellm; proxy_connect_timeout 5s; proxy_read_timeout 5s; } }
}
general_settings: master_key: sk-your-master-key # User-level budgets
litellm_settings: max_budget: 100.00 # $100 max per user budget_duration: monthly
general_settings: master_key: sk-your-master-key # User-level budgets
litellm_settings: max_budget: 100.00 # $100 max per user budget_duration: monthly
general_settings: master_key: sk-your-master-key # User-level budgets
litellm_settings: max_budget: 100.00 # $100 max per user budget_duration: monthly
-weight: 500;">curl -X POST 'http://localhost:4000/user/new' \ -H 'Authorization: Bearer sk-your-master-key' \ -H 'Content-Type: application/json' \ -d '{ "user_id": "user-123", "max_budget": 50.00, "budget_duration": "monthly", "models": ["gpt-4o", "gpt-3.5-turbo"] }'
-weight: 500;">curl -X POST 'http://localhost:4000/user/new' \ -H 'Authorization: Bearer sk-your-master-key' \ -H 'Content-Type: application/json' \ -d '{ "user_id": "user-123", "max_budget": 50.00, "budget_duration": "monthly", "models": ["gpt-4o", "gpt-3.5-turbo"] }'
-weight: 500;">curl -X POST 'http://localhost:4000/user/new' \ -H 'Authorization: Bearer sk-your-master-key' \ -H 'Content-Type: application/json' \ -d '{ "user_id": "user-123", "max_budget": 50.00, "budget_duration": "monthly", "models": ["gpt-4o", "gpt-3.5-turbo"] }'
litellm_settings: cache: true cache_params: type: redis host: localhost port: 6379 ttl: 3600 # 1 hour cache
litellm_settings: cache: true cache_params: type: redis host: localhost port: 6379 ttl: 3600 # 1 hour cache
litellm_settings: cache: true cache_params: type: redis host: localhost port: 6379 ttl: 3600 # 1 hour cache
litellm_settings: success_callback: ["langfuse", "prometheus"] # Langfuse & Prometheus integrations failure_callback: ["langfuse", "slack"] # Langfuse integration langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY
litellm_settings: success_callback: ["langfuse", "prometheus"] # Langfuse & Prometheus integrations failure_callback: ["langfuse", "slack"] # Langfuse integration langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY
litellm_settings: success_callback: ["langfuse", "prometheus"] # Langfuse & Prometheus integrations failure_callback: ["langfuse", "slack"] # Langfuse integration langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY
litellm_settings: guardrails: - guardrail_name: "content-filter" litellm_params: guardrail: openai_moderation mode: pre_call # Check before sending to LLM
litellm_settings: guardrails: - guardrail_name: "content-filter" litellm_params: guardrail: openai_moderation mode: pre_call # Check before sending to LLM
litellm_settings: guardrails: - guardrail_name: "content-filter" litellm_params: guardrail: openai_moderation mode: pre_call # Check before sending to LLM - Multi-provider failover (Azure OpenAI -> OpenAI -> Anthropic -> Gemini)
- A simpler deployment without managing APIM policies
- Provider-agnostic architecture that works anywhere
- Open-source flexibility with no vendor lock-in - Unified API: One OpenAI-compatible endpoint for 100+ LLM providers
- Built-in Load Balancing: Distribute requests across multiple deployments
- Automatic Failover: Seamlessly retry on different models/providers when one fails
- Rate Limit Handling: Intelligent retry with exponential backoff for 429 errors
- Cost Tracking: Monitor spend across all providers in one place
- Streaming Support: Full SSE (Server-Sent Events) support with proper failover - Reads the Retry-After header
- Marks that deployment as "cooling down"
- Routes the request to the next available deployment
- Continues until a successful response or all deployments exhausted - Detect the connection failure
- Automatically retry on the next provider
- Return an error only if all providers fail - LiteLLM showed slightly better latency due to simpler request pipeline
- Both achieved similar reliability with proper configuration
- LiteLLM's multi-provider fallback provided an extra safety net
- Cost difference is significant for smaller teams - You're all-in on Azure and need native integration
- Enterprise compliance requirements mandate Azure services
- You need WAF/DDoS protection at the edge
- Your organization has existing APIM expertise
- Audit logging must stay within Azure ecosystem - You need multi-provider failover (not just multi-region)
- Cost optimization is a priority
- You want provider flexibility to switch easily
- Your team prefers simple YAML configuration over XML policies
- You're running on Kubernetes, AWS, GCP, or on-prem
- You need rapid prototyping and iteration - [ ] Deploy Multiple Instances: At least 2 behind a load balancer
- [ ] Enable Health Checks: Configure /health endpoint monitoring
- [ ] Set Up Database: PostgreSQL for persistence and analytics
- [ ] Configure Caching: Redis for semantic caching
- [ ] Add Monitoring: Prometheus + Grafana or Langfuse
- [ ] Set Budget Limits: Prevent runaway costs
- [ ] Secure the Proxy: Use master key authentication
- [ ] Enable TLS: HTTPS in production (via nginx or cloud LB)
- [ ] Configure Alerts: Slack/PagerDuty for failures
- [ ] Test Failover: Deliberately fail providers to verify behavior - Location Wylie, TX
- Work Managing Director at Colaberry — focused on AI training and enterprise deployment
- Joined Mar 22, 2026