Tools: The Mystery of the Redis Read-Only Error in a Single-Node Setup

Tools: The Mystery of the Redis Read-Only Error in a Single-Node Setup

Step 1: Evaluating the Infrastructure

Step 2: Checking the Current Redis State

Step 3: Ruling Out the Red Herrings

Step 5: Piecing Together the Root Cause

Step 6: The Fix and Future-Proofing

1. Hardening the Memory Config

2. Disabling Dangerous Commands

3. Creating a Debug Playbook

4. Rethinking the Architecture

Conclusion If you manage a realtime application, you know that Redis is often the beating heart of your infrastructure. Recently, our production application—which relies heavily on Redis for both backend caching and realtime collaboration (via Hocuspocus/Yjs)—experienced a bizarre and catastrophic outage. Every few months, out of nowhere, Redis would randomly crash our system. The logs were flooded with a single, confusing error: READONLY You can't write against a read only replica The symptoms were severe: writes failed entirely, reads stopped working, and the entire realtime system came to a grinding halt. Restarting the Docker container fixed the issue immediately, but without a root cause, it was only a matter of time before it happened again. Here is a step-by-step breakdown of how I investigated, debugged, and ultimately solved this elusive Redis bug. Before diving into logs, I needed to confirm exactly what our architecture looked like. This is where the mystery deepened. If there was only one Redis node, how could it possibly think it was a "read-only replica"? My first move was to check the current role of the Redis instance. I connected to the server and ran: The output was telling: Redis was clearly functioning as a master with no connected replicas. Whatever had caused the READONLY error wasn't a permanent state change. When debugging distributed systems, it's easy to go down the wrong rabbit hole. Here is what I evaluated and quickly ruled out: Could the server be buckling under memory pressure? I checked the system and Redis memory stats: The results were eye-opening, but not in the way I expected: Our actual dataset was only about 672 KB! Redis was using a fraction of a percent of the VM's RAM. It wasn't an Out-Of-Memory (OOM) crash. However, I discovered a massive production risk in our configuration: With no memory limit and noeviction set, if Redis ever did fill up, it would refuse all writes. While this wasn't the root cause of the current bug, it was a ticking time bomb that needed immediate fixing. With OOM and Cluster failovers ruled out, the evidence pointed toward a few highly probable culprits for a single-node setup: The temporary nature of the issue, combined with both reads and writes failing, strongly pointed to a combination of stale client connections combined with a transient Docker or network interruption. Restarting the container severed those dead connections and forced a clean reconnect. To stabilize the system and ensure this doesn't happen again, I implemented a multi-layered fix. First, I patched the memory risk by adding proper limits to /etc/redis/redis.conf: To prevent any accidental role changes in our single-node setup, I locked down the replication commands in redis.conf: I established a strict rule: Next time it fails, do not restart immediately. Instead, run these diagnostics to capture the exact failure state: While a single Redis node is fine for basic caching, heavy realtime workloads (like Hocuspocus Pub/Sub) demand high availability. Our long-term fix isn't to overcomplicate things with Redis Cluster, but rather to migrate to a standard Primary + Replica + Sentinel setup. This will give us automatic failover and separate the realtime collaboration load from the standard cache. Sometimes the most intimidating errors—like an impossible READONLY replica state on a single node—are symptoms of deeper infrastructural quirks rather than actual state changes. By methodically checking the actual Redis state, analyzing memory limits, and ruling out red herrings, we not only diagnosed the immediate issue but uncovered hidden risks that made our production environment infinitely stronger. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

redis-cli INFO replication redis-cli INFO replication redis-cli INFO replication role:master connected_slaves:0 master_failover_state:no-failover role:master connected_slaves:0 master_failover_state:no-failover role:master connected_slaves:0 master_failover_state:no-failover redis-cli INFO memory redis-cli INFO memory redis-cli INFO memory maxmemory:0 maxmemory_policy:noeviction maxmemory:0 maxmemory_policy:noeviction maxmemory:0 maxmemory_policy:noeviction maxmemory 2gb maxmemory-policy allkeys-lru maxmemory 2gb maxmemory-policy allkeys-lru maxmemory 2gb maxmemory-policy allkeys-lru rename-command REPLICAOF "" rename-command SLAVEOF "" rename-command REPLICAOF "" rename-command SLAVEOF "" rename-command REPLICAOF "" rename-command SLAVEOF "" redis-cli INFO replication redis-cli INFO stats redis-cli CONFIG GET replica-read-only docker logs <redis-container-name> --tail 200 redis-cli INFO replication redis-cli INFO stats redis-cli CONFIG GET replica-read-only docker logs <redis-container-name> --tail 200 redis-cli INFO replication redis-cli INFO stats redis-cli CONFIG GET replica-read-only docker logs <redis-container-name> --tail 200 - Hosting: A single Google Cloud Platform (GCP) VM (t2d-standard-1 with Debian 12, 1 vCPU, 4 GB RAM). - Deployment: Redis running inside a Docker container. - Topology: A single Redis node. No Redis Cluster. No Sentinel. No intentional replicas. - Redis Cluster & Sentinel Failovers: I wondered if an automated failover had demoted our primary node. However, since we weren't running Cluster or Sentinel mode, there was no orchestration tool present to trigger a failover or slot migration. - Redlock / Distributed Lock Split-Brain: While distributed locks can cause chaos, they don't change a server's replication role. - The "Read" Clue: If Redis had truly become a standard replica, reads should still have worked. The fact that reads and writes both failed suggested this wasn't just a simple case of a node functioning as a healthy replica. - used_memory_human: 1.60M - used_memory_rss_human: 15.85M - total_system_memory_human: 3.83G - Accidental REPLICAOF Execution: A rogue script, automation, or network blip might have accidentally sent a REPLICAOF host port command, temporarily turning the node into a replica. - Stale Node.js Client Connections: Our Node.js backend and Hocuspocus websocket server maintain long-lived TCP connections. If the network dropped or the Docker container glitched, the client connection pool might have ended up in a stale state, misinterpreting the connection status. - Docker/Network Instability: Temporary network partitions or disk IO blocks (during AOF/RDB saves) might have forced Redis into a protective mode that the application clients misinterpreted.