Here’s what actually happened.
Table of Contents
Architecture Overview
Stack Components
Sentinel Integration (Django + Celery)
Observability with Prometheus
Failover Drill Walkthrough
Initial State
Induced Failure
Sentinel Election
Celery Behavior During Failover
Timeline
Observed Task
Performance Impact
Production Readiness Assessment
What Works
What Needs Attention
When This Architecture Is Production-Ready
When This Is Not Enough
How to Reduce Failover Latency
Key Takeaway
Final Thoughts Redis Sentinel + Celery Failover: What Actually Happens in Production Most tutorials on Redis Sentinel stop at “it elects a new master”.
Very few show what happens to a real system under failover pressure. I ran a failover drill on a Django + Celery stack backed by Redis Sentinel and Prometheus monitoring. All services were switched to Sentinel using environment configuration: At this stage, the system is fully Sentinel-aware After pointing redis_exporter to Sentinel: This confirms monitoring is tracking cluster state, not a single node. Failover was immediate and correct Avoid this setup (as-is) if you need: To push recovery closer to 10–15 seconds: Redis Sentinel ensures infrastructure recovery.
Celery determines how fast your system actually resumes work. That gap is the real engineering challenge. If you're using Redis Sentinel with Celery: “How long until my system behaves normally again?” Because that’s what production users experience. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse
$ flowchart LR Client --> Django Django -->|Cache| Sentinel Django -->|Tasks| Celery Celery -->|Broker| Sentinel Celery -->|Result Backend| Sentinel Sentinel --> RedisMaster Sentinel --> RedisReplica1 Sentinel --> RedisReplica2 Prometheus --> RedisExporter RedisExporter --> Sentinel
flowchart LR Client --> Django Django -->|Cache| Sentinel Django -->|Tasks| Celery Celery -->|Broker| Sentinel Celery -->|Result Backend| Sentinel Sentinel --> RedisMaster Sentinel --> RedisReplica1 Sentinel --> RedisReplica2 Prometheus --> RedisExporter RedisExporter --> Sentinel
flowchart LR Client --> Django Django -->|Cache| Sentinel Django -->|Tasks| Celery Celery -->|Broker| Sentinel Celery -->|Result Backend| Sentinel Sentinel --> RedisMaster Sentinel --> RedisReplica1 Sentinel --> RedisReplica2 Prometheus --> RedisExporter RedisExporter --> Sentinel
REDIS_ADDR=redis://host.-weight: 500;">docker.internal:26379
REDIS_ADDR=redis://host.-weight: 500;">docker.internal:26379
REDIS_ADDR=redis://host.-weight: 500;">docker.internal:26379
pytest tests/test_settings_redis_sentinel.py
pytest tests/test_settings_redis_sentinel.py
pytest tests/test_settings_redis_sentinel.py
redis_instance_info{redis_mode="sentinel", tcp_port="26379"}
redis_instance_info{redis_mode="sentinel", tcp_port="26379"}
redis_instance_info{redis_mode="sentinel", tcp_port="26379"}
flowchart LR Sentinel -->|Master| Redis1["172.20.0.3:6379"] Sentinel --> Redis2["Replica"] Sentinel --> Redis3["Replica"]
flowchart LR Sentinel -->|Master| Redis1["172.20.0.3:6379"] Sentinel --> Redis2["Replica"] Sentinel --> Redis3["Replica"]
flowchart LR Sentinel -->|Master| Redis1["172.20.0.3:6379"] Sentinel --> Redis2["Replica"] Sentinel --> Redis3["Replica"]
master_address="172.20.0.3:6379"
master_address="172.20.0.3:6379"
master_address="172.20.0.3:6379"
flowchart LR Sentinel -->|New Master| Redis2["172.20.0.2:6379"] Sentinel --> Redis3["Replica"] Sentinel --> Redis1["Down"]
flowchart LR Sentinel -->|New Master| Redis2["172.20.0.2:6379"] Sentinel --> Redis3["Replica"] Sentinel --> Redis1["Down"]
flowchart LR Sentinel -->|New Master| Redis2["172.20.0.2:6379"] Sentinel --> Redis3["Replica"] Sentinel --> Redis1["Down"]
sequenceDiagram participant App as Django App participant Celery participant Sentinel participant Redis App->>Celery: Submit Task Celery->>Redis: Send to Master Redis-->>Celery: Connection Lost Sentinel->>Sentinel: Elect New Master Celery->>Sentinel: Retry Connection Note over Celery: ~54.7s delay Celery->>Redis: Reconnect to New Master Redis-->>Celery: OK Celery-->>App: Task SUCCESS
sequenceDiagram participant App as Django App participant Celery participant Sentinel participant Redis App->>Celery: Submit Task Celery->>Redis: Send to Master Redis-->>Celery: Connection Lost Sentinel->>Sentinel: Elect New Master Celery->>Sentinel: Retry Connection Note over Celery: ~54.7s delay Celery->>Redis: Reconnect to New Master Redis-->>Celery: OK Celery-->>App: Task SUCCESS
sequenceDiagram participant App as Django App participant Celery participant Sentinel participant Redis App->>Celery: Submit Task Celery->>Redis: Send to Master Redis-->>Celery: Connection Lost Sentinel->>Sentinel: Elect New Master Celery->>Sentinel: Retry Connection Note over Celery: ~54.7s delay Celery->>Redis: Reconnect to New Master Redis-->>Celery: OK Celery-->>App: Task SUCCESS - Architecture Overview
- Sentinel Integration (Django + Celery)
- Observability with Prometheus
- Failover Drill Walkthrough
- Celery Behavior During Failover
- Performance Impact
- Production Readiness Assessment
- How to Reduce Failover Latency - Django → Redis cache via Sentinel
- Celery → Broker + result backend via Sentinel
- Redis Sentinel → High availability + failover
- Prometheus + redis_exporter → Monitoring - Django cache → successful round-trip
- Celery broker → connected via Sentinel
- Celery result backend → SentinelBackend initialized
- Test suite passed: - redis_sentinel_master_status
- redis_sentinel_master_ok_sentinels
- redis_sentinel_master_ok_slaves
- redis_sentinel_masters - Current master was stopped manually - New master elected on first poll
- Prometheus updated on next scrape - Task ID: 9b57ba3b-a707-4c13-9255-d74de411b64b
- Status during failover: PENDING
- Delay: ~54.7 seconds
- Final state: SUCCESS - Redis Sentinel failover is reliable
- Prometheus reflects cluster changes correctly
- Django cache survives failover
- No task loss in Celery - Celery introduces significant delay during failover
- Reconnection is not instantaneous - Tasks are asynchronous/background
- Eventual completion is acceptable
- Temporary latency spikes are tolerable - Real-time task execution
- Sub-10s failover recovery
- User-facing async operations - Tune Celery broker retry settings
- Reduce reconnect backoff intervals
- Optimize worker heartbeat and visibility timeout
- Re-run failover drills with timing instrumentation - Sentinel recovery: instant
- Application recovery: ~55 seconds