Tools: 5 Production Incidents Every DevOps Engineer Should Know How to Debug

Tools: 5 Production Incidents Every DevOps Engineer Should Know How to Debug

1. "No Space Left on Device"

The story

Why it happens

How to debug it

The fix

Key takeaway

2. Database Connection Pool Exhaustion

The story

Why it happens

How to debug it

The fix

Key takeaway

3. Kubernetes CrashLoopBackOff - The Missing Secret

The story

Why it happens

How to debug it

The fix

Key takeaway

4. The Node.js Memory Leak (WebSocket Edition)

The story

Why it happens

How to debug it

The fix

Key takeaway

5. Cache Stampede After Redis Restart

The story

Why it happens

How to debug it

The fix

Key takeaway

The Pattern Across All Five

Practice Before the Pager Goes Off It's 2 AM. Your phone is screaming. The dashboard is red. Users are tweeting. You have been on call long enough to know that the gap between "I think I know what's wrong" and "I know exactly what's wrong" can cost your company thousands of dollars per minute. The engineers who close that gap fast are not smarter than everyone else. They have just seen these patterns before. Here are 5 production incidents that every DevOps engineer will encounter at some point - what they look like, why they happen, and how to debug them. A developer was chasing a gnarly bug in production. To get more visibility, they temporarily cranked the application log level to DEBUG. They fixed the bug, merged the PR, and completely forgot to revert the log level setting. Three weeks later, at 3 AM on a Tuesday, your monitoring fires. Every service on that host is returning 500s. The database is refusing writes. Nothing makes sense until you SSH in and run the one command that tells you everything: Full. The disk is completely full. /var/log has grown to 64GB of verbose debug output nobody was watching. Debug logging is chatty by design. It logs every function call, every query parameter, every header. In a high-traffic service, debug logs can generate gigabytes per hour. Combine that with a missing or misconfigured log rotation policy and you have a slow-motion disaster playing out in the background while everyone is focused on feature work. Set the log level back to INFO or WARN. Fix your logrotate config to enforce retention limits. Then add a disk space alert at 80% - not 95%. By the time you hit 95%, you probably have minutes, not hours. Your logging infrastructure needs to be monitored too. The tool you use to diagnose outages can itself cause outages if you ignore it. Traffic is normal. CPU is normal. The database server is completely idle. But your application is throwing errors that look like this: And users are getting 503s. This one is maddening the first time you see it because every instinct tells you to look at the database. The database is fine. The problem is how your application is managing its connection to the database. Most database drivers give you a connection pool - a fixed set of reusable connections shared across your application's threads or async workers. When a request needs to run a query, it borrows a connection from the pool. When it's done, it returns the connection. The failure mode that is easy to miss: what happens when a request throws an exception before it returns the connection? Under normal traffic, the leak is slow enough that the pool replenishes. Under higher load, or when errors spike, the pool drains faster than it fills. Then everything queues up waiting for a connection that never comes back. Look for connections stuck in idle in transaction state. That is almost always a leak. The connection was borrowed, a transaction started, and it was never committed or rolled back. Audit your error handling paths. Every connection acquire must have a matching release in a finally block or equivalent. Set a connectionTimeoutMillis on your pool so leaked connections get reclaimed automatically. Add an alert when active connection count exceeds 80% of pool size. Low database CPU during an "outage" is a red flag pointing to connection management, not query performance. Always check pg_stat_activity before assuming the database is healthy. You deploy a new version of your application to Kubernetes. Instead of the pods coming up healthy, you see this in kubectl get pods: The pod starts, crashes almost immediately, Kubernetes restarts it, it crashes again, and the backoff timer grows exponentially. Within 10 minutes the pod is waiting 5 minutes between restart attempts. This specific variant is one of the more frustrating ones: the app crashes on startup because it cannot find a required configuration value. It is looking for a secret - maybe a database password, maybe an API key - via an environment variable mounted from a Kubernetes Secret. But the Secret does not exist in this namespace. The deployment referenced a Secret that was never created in the target namespace. It exists in staging. It does not exist in production. The deployment YAML was copy-pasted and nobody noticed. Create the missing Secret in the correct namespace. For the longer-term fix, use a tool like helm diff, kubectl diff, or a GitOps pipeline that validates all referenced resources exist before allowing a deployment to proceed. CrashLoopBackOff means the pod keeps dying. kubectl logs --previous shows why it died. kubectl describe pod shows what Kubernetes tried and failed to do. Always check both. Your Node.js service is running fine after deployment. Memory usage is at 200MB, which is normal. Over the next 18 hours, you watch it creep up. 300MB. 400MB. 600MB. Then the process gets OOMKilled by the container runtime and restarts. The whole cycle starts again. You check your code for obvious leaks - giant arrays, global caches growing unbounded. Nothing jumps out. This one hides. The classic Node.js memory leak pattern that catches even experienced engineers: adding event listeners inside a function that gets called repeatedly, without removing them. Every time a new WebSocket connection comes in, a new listener is added to process. When the connection closes, the listener is not removed. The listener holds a reference to the socket object. The socket object cannot be garbage collected. After thousands of connections, you have thousands of dead socket references sitting in memory. Node.js will even warn you about this - but the warning often gets lost in log noise: That warning is not something to suppress. It is a canary telling you there is a leak. Look for object types with counts growing over time. In this case you would see Socket instances accumulating far beyond the number of active connections. The clinic.js tool is excellent for this: Always clean up listeners when the associated resource goes away: Memory leaks in Node.js are almost always about retaining references longer than necessary. Event listeners are the most common culprit. Take MaxListenersExceededWarning seriously - it is not noise. Your Redis cache went down for planned maintenance. You brought it back up. Simple, right? Sixty seconds later your database server is on fire. CPU is pegged at 100%. Query latency went from 5ms to 8 seconds. The database is drowning. What happened? Every single cache key expired at the same moment - because they all had the same TTL set from the last cache warming cycle - and every single application server tried to rebuild the cache simultaneously by hitting the database. This is a cache stampede, also called a thundering herd. Consider what happens when your cache is empty after a restart and you have 50 application servers: The database - which normally handles 200 queries per second because the cache absorbs the rest - suddenly receives 20,000 queries per second. It collapses. The diagnosis is usually visible in the metrics: Correlate the timeline with the Redis restart event. If the metrics cliff happened right when Redis came back up, you have your answer. Several strategies exist, and production systems often use multiple in combination: Cache locking (mutex pattern): Only one process populates a cache key. Others wait. TTL jitter: Add random variance to cache expiration times so keys do not all expire simultaneously. Probabilistic early expiration: Proactively refresh cache entries before they expire, based on how expensive the recomputation is. A cache restart is not a safe non-event. Treat cache warming as part of your maintenance procedure. Use TTL jitter by default - it costs nothing and prevents a whole class of stampede failures. Look at what these incidents have in common: That last point is the crux of it. Incident response speed is largely pattern recognition. The engineer who has seen a connection pool exhaustion before spots the idle database CPU and goes straight to pg_stat_activity. The one who has not seen it spends an hour tuning query indexes that are not the problem. If you want to build that pattern recognition without waiting for production to teach you the hard way, I built youbrokeprod.com - a free browser game where you investigate production outages step by step. Each scenario drops you into a live incident: you run commands, read logs, check metrics, and work toward a diagnosis. No signup required to try it. The game currently has 10 scenarios across beginner, intermediate, and advanced difficulty - including all five incidents described in this post. The goal is simple: make the muscle memory of incident debugging feel familiar before it is your on-call rotation on the line. What production incidents have scarred you the most? Drop them in the comments - there are 44 scenarios in the backlog and the most painful real-world ones make the best levels. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

Filesystem Size Used Avail Use% Mounted on /dev/sda1 80G 80G 0 100% / Filesystem Size Used Avail Use% Mounted on /dev/sda1 80G 80G 0 100% / Filesystem Size Used Avail Use% Mounted on /dev/sda1 80G 80G 0 100% / # Step 1: Confirm the problem df -h # Step 2: Find the culprit du -sh /var/log/* du -sh /var/log/nginx/* # Step 3: Immediate relief - clear old compressed logs find /var/log -name "*.gz" -mtime +7 -delete # Step 4: Truncate (don't delete) the active log file truncate -s 0 /var/log/myapp/app.log # Step 5: Check logrotate config cat /etc/logrotate.d/myapp # Step 1: Confirm the problem df -h # Step 2: Find the culprit du -sh /var/log/* du -sh /var/log/nginx/* # Step 3: Immediate relief - clear old compressed logs find /var/log -name "*.gz" -mtime +7 -delete # Step 4: Truncate (don't delete) the active log file truncate -s 0 /var/log/myapp/app.log # Step 5: Check logrotate config cat /etc/logrotate.d/myapp # Step 1: Confirm the problem df -h # Step 2: Find the culprit du -sh /var/log/* du -sh /var/log/nginx/* # Step 3: Immediate relief - clear old compressed logs find /var/log -name "*.gz" -mtime +7 -delete # Step 4: Truncate (don't delete) the active log file truncate -s 0 /var/log/myapp/app.log # Step 5: Check logrotate config cat /etc/logrotate.d/myapp Error: Cannot acquire connection from pool TimeoutError: timeout of 5000ms exceeded Error: Cannot acquire connection from pool TimeoutError: timeout of 5000ms exceeded Error: Cannot acquire connection from pool TimeoutError: timeout of 5000ms exceeded // This is a leak async function getUser(id) { const conn = await pool.acquire(); const result = await db.query('SELECT * FROM users WHERE id = $1', [id]); // If the query throws, conn is never released conn.release(); return result; } // This is correct async function getUser(id) { const conn = await pool.acquire(); try { return await db.query('SELECT * FROM users WHERE id = $1', [id]); } finally { conn.release(); // Always runs, even on error } } // This is a leak async function getUser(id) { const conn = await pool.acquire(); const result = await db.query('SELECT * FROM users WHERE id = $1', [id]); // If the query throws, conn is never released conn.release(); return result; } // This is correct async function getUser(id) { const conn = await pool.acquire(); try { return await db.query('SELECT * FROM users WHERE id = $1', [id]); } finally { conn.release(); // Always runs, even on error } } // This is a leak async function getUser(id) { const conn = await pool.acquire(); const result = await db.query('SELECT * FROM users WHERE id = $1', [id]); // If the query throws, conn is never released conn.release(); return result; } // This is correct async function getUser(id) { const conn = await pool.acquire(); try { return await db.query('SELECT * FROM users WHERE id = $1', [id]); } finally { conn.release(); // Always runs, even on error } } -- On PostgreSQL: see all active connections SELECT state, count(*) FROM pg_stat_activity GROUP BY state; -- See who is holding connections longest SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC; -- On PostgreSQL: see all active connections SELECT state, count(*) FROM pg_stat_activity GROUP BY state; -- See who is holding connections longest SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC; -- On PostgreSQL: see all active connections SELECT state, count(*) FROM pg_stat_activity GROUP BY state; -- See who is holding connections longest SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC; NAME READY STATUS RESTARTS AGE myapp-7d9f8b-xkj2p 0/1 CrashLoopBackOff 4 3m NAME READY STATUS RESTARTS AGE myapp-7d9f8b-xkj2p 0/1 CrashLoopBackOff 4 3m NAME READY STATUS RESTARTS AGE myapp-7d9f8b-xkj2p 0/1 CrashLoopBackOff 4 3m kubectl logs myapp-7d9f8b-xkj2p kubectl logs myapp-7d9f8b-xkj2p kubectl logs myapp-7d9f8b-xkj2p Error: Required environment variable DATABASE_PASSWORD is not set Process exited with code 1 Error: Required environment variable DATABASE_PASSWORD is not set Process exited with code 1 Error: Required environment variable DATABASE_PASSWORD is not set Process exited with code 1 kubectl describe pod myapp-7d9f8b-xkj2p kubectl describe pod myapp-7d9f8b-xkj2p kubectl describe pod myapp-7d9f8b-xkj2p Events: Warning Failed 2m kubelet Error: secret "myapp-credentials" not found Events: Warning Failed 2m kubelet Error: secret "myapp-credentials" not found Events: Warning Failed 2m kubelet Error: secret "myapp-credentials" not found # Step 1: Get the actual error kubectl logs <pod-name> kubectl logs <pod-name> --previous # Logs from the crashed instance # Step 2: Describe the pod for Kubernetes-level events kubectl describe pod <pod-name> # Step 3: Check if the secret exists kubectl get secrets -n <namespace> # Step 4: Verify the secret has the expected keys kubectl describe secret myapp-credentials # Step 1: Get the actual error kubectl logs <pod-name> kubectl logs <pod-name> --previous # Logs from the crashed instance # Step 2: Describe the pod for Kubernetes-level events kubectl describe pod <pod-name> # Step 3: Check if the secret exists kubectl get secrets -n <namespace> # Step 4: Verify the secret has the expected keys kubectl describe secret myapp-credentials # Step 1: Get the actual error kubectl logs <pod-name> kubectl logs <pod-name> --previous # Logs from the crashed instance # Step 2: Describe the pod for Kubernetes-level events kubectl describe pod <pod-name> # Step 3: Check if the secret exists kubectl get secrets -n <namespace> # Step 4: Verify the secret has the expected keys kubectl describe secret myapp-credentials // This leaks memory on every new WebSocket connection function setupWebSocket(socket) { // This listener is added fresh on every call // But the reference to process.on keeps the socket alive // even after the connection closes process.on('SIGTERM', () => { socket.close(); }); socket.on('message', handleMessage); } // This leaks memory on every new WebSocket connection function setupWebSocket(socket) { // This listener is added fresh on every call // But the reference to process.on keeps the socket alive // even after the connection closes process.on('SIGTERM', () => { socket.close(); }); socket.on('message', handleMessage); } // This leaks memory on every new WebSocket connection function setupWebSocket(socket) { // This listener is added fresh on every call // But the reference to process.on keeps the socket alive // even after the connection closes process.on('SIGTERM', () => { socket.close(); }); socket.on('message', handleMessage); } MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 SIGTERM listeners added to [process]. Use emitter.setMaxListeners() to increase limit MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 SIGTERM listeners added to [process]. Use emitter.setMaxListeners() to increase limit MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 SIGTERM listeners added to [process]. Use emitter.setMaxListeners() to increase limit # Get a heap snapshot from a running Node.js process kill -USR2 <pid> # Or via the Node.js inspector node --inspect app.js # Then open chrome://inspect and take a heap snapshot # Get a heap snapshot from a running Node.js process kill -USR2 <pid> # Or via the Node.js inspector node --inspect app.js # Then open chrome://inspect and take a heap snapshot # Get a heap snapshot from a running Node.js process kill -USR2 <pid> # Or via the Node.js inspector node --inspect app.js # Then open chrome://inspect and take a heap snapshot npx clinic heapprofiler -- node app.js npx clinic heapprofiler -- node app.js npx clinic heapprofiler -- node app.js function setupWebSocket(socket) { const cleanup = () => socket.close(); process.on('SIGTERM', cleanup); socket.on('close', () => { // Remove the listener when the connection closes process.removeListener('SIGTERM', cleanup); }); } function setupWebSocket(socket) { const cleanup = () => socket.close(); process.on('SIGTERM', cleanup); socket.on('close', () => { // Remove the listener when the connection closes process.removeListener('SIGTERM', cleanup); }); } function setupWebSocket(socket) { const cleanup = () => socket.close(); process.on('SIGTERM', cleanup); socket.on('close', () => { // Remove the listener when the connection closes process.removeListener('SIGTERM', cleanup); }); } Cache hit rate: dropped from 95% to 0% Database connections: spiked from 50 to 800 in 30 seconds Database CPU: 100% API P99 latency: 50ms -> 12,000ms Cache hit rate: dropped from 95% to 0% Database connections: spiked from 50 to 800 in 30 seconds Database CPU: 100% API P99 latency: 50ms -> 12,000ms Cache hit rate: dropped from 95% to 0% Database connections: spiked from 50 to 800 in 30 seconds Database CPU: 100% API P99 latency: 50ms -> 12,000ms import redis import time def get_with_lock(key, fetch_fn, ttl=300): r = redis.Redis() value = r.get(key) if value: return value # Try to acquire a lock lock_key = f"lock:{key}" if r.set(lock_key, "1", nx=True, ex=10): # We got the lock - populate the cache try: value = fetch_fn() r.setex(key, ttl, value) return value finally: r.delete(lock_key) else: # Someone else has the lock - wait briefly and retry time.sleep(0.1) return r.get(key) import redis import time def get_with_lock(key, fetch_fn, ttl=300): r = redis.Redis() value = r.get(key) if value: return value # Try to acquire a lock lock_key = f"lock:{key}" if r.set(lock_key, "1", nx=True, ex=10): # We got the lock - populate the cache try: value = fetch_fn() r.setex(key, ttl, value) return value finally: r.delete(lock_key) else: # Someone else has the lock - wait briefly and retry time.sleep(0.1) return r.get(key) import redis import time def get_with_lock(key, fetch_fn, ttl=300): r = redis.Redis() value = r.get(key) if value: return value # Try to acquire a lock lock_key = f"lock:{key}" if r.set(lock_key, "1", nx=True, ex=10): # We got the lock - populate the cache try: value = fetch_fn() r.setex(key, ttl, value) return value finally: r.delete(lock_key) else: # Someone else has the lock - wait briefly and retry time.sleep(0.1) return r.get(key) import random base_ttl = 300 jitter = random.randint(-30, 30) r.setex(key, base_ttl + jitter, value) import random base_ttl = 300 jitter = random.randint(-30, 30) r.setex(key, base_ttl + jitter, value) import random base_ttl = 300 jitter = random.randint(-30, 30) r.setex(key, base_ttl + jitter, value) - Request comes in for /api/products - All 50 servers check the cache - cache miss - All 50 servers query the database for product data - All 50 servers write the result back to cache - 49 of those database queries were wasted - Under high traffic, "50 servers" becomes "50,000 requests per second" - The symptoms lied. Disk full causing database errors. Connection pool causing "database" problems. A cache issue causing what looks like a database overload. - The actual cause was one layer removed from where the pain was visible. - All five are preventable with the right monitoring thresholds, code patterns, and configuration choices. - All five are faster to debug if you have seen them before.