Tools

Tools: Complete Guide to Post-Mortem: Why My Ubuntu Docker Homelab Failed (And Why I Killed It)

2026-04-12 0 views admin

The Architecture & The CGNAT Problem

Failure Point 1: The Zombie Process Leak

Failure Point 2: Silent OOM Kills

Failure Point 3: The Single Point of Failure

The Resolution For a year, I ran a monolithic microservices host on a single Ubuntu 24.04 LTS virtual machine. The goal was simple: centralize my data and route around my ISP's Carrier-Grade NAT (CGNAT). It started as a proof of concept — 4 vCPUs, 4 GB of RAM, and the quiet confidence of someone who has never been paged at 3 a.m. It ended up running 10+ containers via Docker Compose: Nextcloud, a full media stack, Prometheus, Netdata, and Grafana. (More on Grafana later. Spoiler: it did not survive.) It worked. And then, slowly, it started to break. Here is the post-mortem of Ghar Labs v1 — the bottlenecks I hit, the failures I missed, and why I ultimately put the server down. My ISP uses CGNAT, which means port forwarding to the public internet is not possible — my server shares a public IP with potentially hundreds of other subscribers. No A record is going to help you there. To route around this, I engineered a split-tunneling setup: All services were containerized. To prevent I/O errors and duplicate media files, the downloader and media player were pointed at the exact same physical path using strict PUID/PGID permissions: Clean architecture on paper. Production, as usual, had other plans. Over weeks of uptime, RAM usage would creep upward with no corresponding spike in CPU. The server wasn't doing more work — it just wasn't cleaning up after itself. Logging into the terminal eventually surfaced a warning: 2 zombie processes. A subsequent htop audit confirmed the diagnosis. Docker containers were not reaping child processes correctly. When you run an application inside a container without an init system — dumb-init or tini — PID 1 inside the container doesn't know how to adopt orphaned processes. They linger in the process table, unconsumed, until the host reboots. The fix is straightforward: add --init to your docker run call, or in Compose: I learned this after the fact. The server did not. Core services like Nextcloud held up reasonably well. Heavier JVM and Go-based monitoring tools did not — they were fighting over the same 4 GB ceiling. During my final audit before decommissioning, a routine docker ps -a revealed what I had missed for months: Grafana had silently crashed — exit code 255, OOM killed — and never came back. Docker's restart policy tried, the kernel said no, and the container just quietly stopped existing. No alert. No notification. The dashboard I thought was watching my stack had itself gone dark. The lesson: docker ps -a is not optional. Automate the check, or instrument a watchdog. A monitoring tool that nobody monitors is just a pretty corpse. The zombie leak and the OOM kills were annoying. This one was existential. The entire lab lived on one virtual disk (.vdi). One volume, no redundancy: A single bad sector — or a host crash mid-write — could corrupt the Nextcloud database and take years of personal data with it. I had built a fairly sophisticated networking layer on top of a foundation that was, architecturally, one fsck error away from disaster. This is the part where I stopped calling it a "proof of concept" and started calling it "a liability." Ghar Labs v1 was a successful learning environment. In twelve months, it taught me: But a single-node VM with no storage redundancy, no init system, and a 4 GB RAM ceiling is not where you keep data you care about. I decommissioned the Ubuntu host, wiped the drives, and migrated the entire stack to a dedicated bare-metal machine running TrueNAS Scale with proper ZFS redundancy. The services are the same. The foundation is not. Sometimes the best thing you can do with a legacy server is document what it taught you, shut it down gracefully, and build the next one right. The full configuration archive — Compose files, Cloudflare tunnel configs, Tailscale ACLs — is preserved here for reference: → devpratyushh/homelab-v1-archive It's retired. But it earned its README. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ /mnt/data/ ├── media/ # The unified directory │ ├── movies/ │ └── shows/ └── nextcloud/ # Sovereign cloud data /mnt/data/ ├── media/ # The unified directory │ ├── movies/ │ └── shows/ └── nextcloud/ # Sovereign cloud data /mnt/data/ ├── media/ # The unified directory │ ├── movies/ │ └── shows/ └── nextcloud/ # Sovereign cloud data services: your-app: init: true services: your-app: init: true services: your-app: init: true CONTAINER ID IMAGE COMMAND STATUS a3f1b2c9d4e5 grafana ... Exited (255) 87 days ago CONTAINER ID IMAGE COMMAND STATUS a3f1b2c9d4e5 grafana ... Exited (255) 87 days ago CONTAINER ID IMAGE COMMAND STATUS a3f1b2c9d4e5 grafana ... Exited (255) 87 days ago - Public routing: Cloudflare Tunnels handled inbound HTTP traffic (Nextcloud web interface, dashboards) without ever exposing my origin IP. No open ports required. - Private routing: Tailscale handled everything that didn't need to be public — SMB shares, SSH, internal dashboards. - ❌ No ZFS bit-rot protection - ❌ No RAID parity - ❌ No snapshots - ❌ No off-host backups of the database - Docker networking and Compose -weight: 500;">service dependencies - Reverse proxying through CGNAT without opening a single port - Linux process management (the hard way) - Why storage architecture is not an afterthought

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolscompleteguidemortemubuntudockerhomelabfailed

More from Tools

Tools: Week 1 – Linux User and Group Management (2026)

2026-04-12 0

Tools: How to safely rename files in Bash (2026)

2026-04-12 0

Tools: Update: Health Checks in ASP.NET Core: Beyond the Basic /health Endpoint

2026-04-12 0

Tools: Best AI Agent Security Tools 2026: 15 Options Compared - Guide

2026-04-12 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: Complete Guide to Post-Mortem: Why My Ubuntu Docker Homelab Failed (And Why I Killed It)

The Architecture & The CGNAT Problem

Failure Point 1: The Zombie Process Leak

Failure Point 2: Silent OOM Kills

Failure Point 3: The Single Point of Failure

🏷️ Tags

More from Tools

Tools: Week 1 – Linux User and Group Management (2026)

Tools: How to safely rename files in Bash (2026)

Tools: Update: Health Checks in ASP.NET Core: Beyond the Basic /health Endpoint

Tools: Best AI Agent Security Tools 2026: 15 Options Compared - Guide

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting