Tools: I ran incident response on my own homelab. Here's the postmortem. - 2025 Update

Tools: I ran incident response on my own homelab. Here's the postmortem. - 2025 Update

The incident

The postmortem

The fixes

Why homelab IR matters I run a 3-node Proxmox cluster at home with 11 LXC containers. Last week one of them turned into an incident. Not a dramatic one. No data loss. No outage that affected anyone else. But it hit the same failure modes I see documented in enterprise postmortems — and handling it the same way taught me more than any homelab YouTube video has. Here's what happened and what I changed. 00:47 — My homelab control panel stops responding. The web UI that ties together monitoring, service status, and agent health is down. 00:47–01:09 — PM2 restarts the service. Then restarts it again. 32 times total, with exponential backoff, over about 22 minutes. 01:09 — Prometheus alert fires. Wazuh catches the anomaly in PM2 process metrics. I get paged. 01:11 — I SSH in. pm2 logs sjvik-control-panel shows the immediate cause: Cannot find module tsx. The package is gone from node_modules. 01:13 — npm install && pm2 restart sjvik-control-panel. Service is back. Total downtime from first failure to recovery: 26 minutes. Time from alert to recovery: 4 minutes. If you've done incident response at work, this format will look familiar. I use it at home too — not because it's bureaucratic overhead, but because it forces honest thinking. What failed: tsx missing from node_modules on LXC 101 (nx-web-01). Root cause unknown — most likely a stale node_modules state after a system update or partial npm ci run that didn't complete cleanly. What detected it: External monitoring (Prometheus + pm2-prometheus-exporter tracking restart counts per process). NOT the service itself. What slowed detection: The 22-minute gap between first failure and alert. PM2's default backoff delays mean a fast-dying service doesn't trip alerts immediately. I had no alert threshold on restart rate — only on sustained high restart counts. What enabled fast recovery: The service is stateless. No data to recover. npm install is idempotent. Recovery procedure existed in muscle memory. What didn't exist: A startup health check that validates dependencies before PM2 marks the service alive. The service was failing fast, not failing safe. 1. Prestart dependency validation Added to package.json: Now on restart, PM2 gets a clean exit with an actionable message. Restart 1 tells me what's wrong. Not restart 32. 2. Alert on restart rate, not just count Updated the Prometheus alert rule: This fires within 2 minutes of a sustained restart loop instead of waiting for a count threshold. I added a one-page runbook to my Obsidian vault: Runbooks feel like overkill for a homelab until 2am when you're tired and need the answer fast. The enterprise version of this incident would involve a runbook, a Slack war room, a timeline in PagerDuty, and a written postmortem shared with the team. The homelab version is: you fixing something alone at 1am with nobody watching. But the thinking is the same. I have 11 containers. At least a few of them will have incidents. Treating each one as a real IR exercise is how I actually get better at this — not just at homelab ops, but at the SOC analyst work I do professionally. The next time I'm in an enterprise war room following an incident response process, I'll have done it 15 times already on my own infrastructure. Running a homelab security stack? My Proxmox Homelab guide covers the full setup — Proxmox cluster, PBS backups, Wazuh SIEM, and the PM2 service patterns that kept this recoverable in under 5 minutes. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

pm2 logs sjvik-control-panel Cannot find module tsx npm install && pm2 restart sjvik-control-panel npm install package.json "prestart": "npm ls --depth=0 --silent || (echo 'Dependency validation failed — run npm install' && exit 1)" "prestart": "npm ls --depth=0 --silent || (echo 'Dependency validation failed — run npm install' && exit 1)" "prestart": "npm ls --depth=0 --silent || (echo 'Dependency validation failed — run npm install' && exit 1)" - alert: PM2ServiceRestartRateSpiking expr: rate(pm2_restarts_total[5m]) > 0.1 for: 2m labels: severity: warning annotations: summary: "{{ $labels.name }} restarting frequently" - alert: PM2ServiceRestartRateSpiking expr: rate(pm2_restarts_total[5m]) > 0.1 for: 2m labels: severity: warning annotations: summary: "{{ $labels.name }} restarting frequently" - alert: PM2ServiceRestartRateSpiking expr: rate(pm2_restarts_total[5m]) > 0.1 for: 2m labels: severity: warning annotations: summary: "{{ $labels.name }} restarting frequently" Service: sjvik-control-panel Recovery: cd /root/projects/sjvik-control-panel && npm install && pm2 restart sjvik-control-panel Verify: curl -s http://localhost:3456/health | jq .status Escalate if: health endpoint returns non-200 after npm install Service: sjvik-control-panel Recovery: cd /root/projects/sjvik-control-panel && npm install && pm2 restart sjvik-control-panel Verify: curl -s http://localhost:3456/health | jq .status Escalate if: health endpoint returns non-200 after npm install Service: sjvik-control-panel Recovery: cd /root/projects/sjvik-control-panel && npm install && pm2 restart sjvik-control-panel Verify: curl -s http://localhost:3456/health | jq .status Escalate if: health endpoint returns non-200 after npm install - Detection time matters. Alert on behavior, not just state. - Recovery time improves with runbooks. Write them while you still remember what you did. - Postmortems find gaps that didn't feel like gaps until something broke.