Tools: Why I Run 22 Docker Services at Home - Complete Guide

Tools: Why I Run 22 Docker Services at Home - Complete Guide

The Setup: Three Machines, One Mesh Network

docker-host

inference

edge-vps

Why Local-First: It Started With the Subscriptions

The Cost Math

What's Actually Running: 22 Containers

The RAM Reality

Ollama on the Mac Mini: The Inference Layer

The Logging Proxy

How the Machines Find Each Other

The Memory Mystery

What I'd Change If I Started Over

The Control Argument

Is This For You? Somewhere in my living room, a 2018 gaming PC is running 22 Docker containers, processing 15,000 emails through a local LLM, and managing the finances of a real business. It was never supposed to do any of this. I run a one-person software consultancy in the Netherlands; web development, 3D printing, and consulting. Last year, I started building an AI system to help me manage it all. Eight specialized agents handling email triage, financial tracking, infrastructure monitoring, and scheduling. Every piece of inference runs locally. No cloud APIs touching my private data. This post covers the hardware, what it actually costs, and what I'd do differently if I started over. The entire system runs on three machines connected via Tailscale mesh VPN: A PC I assembled from leftover parts. Over the years, as I upgraded my main gaming machine, the old CPUs, RAM sticks, and motherboards piled up. Eventually, I had enough to build a second computer. It now runs 22+ Docker containers 24/7. A Mac mini M4, bought specifically for local LLM inference. A Hostinger VPS, ~€5/month. Runs Nginx Proxy Manager and Uptime Kuma. Exists for one reason: if my home network dies, this is the canary that tells me about it. You can't monitor your own availability from inside the thing that's failing. Before I built any of this, I was paying for Claude Pro, GPT Pro, Perplexity Pro, and Google AI. Four separate subscriptions. Each gave me partial access to models through their own interfaces, each with its own limitations on what I could integrate, and each getting a copy of whatever I fed into it. My system handles emails, bank transactions, client contracts, delivery tracking, and tax preparation, basically the complete operational picture of my business, in one database. That's the kind of data I don't want leaving my network. It's not that I think cloud providers are malicious. It's that I don't want to be in a position where I have to trust their data handling with everything my business runs on. So the guardrails are explicit: Every piece of production inference runs on Ollama on the Mac mini. Zero tokens leave the house for private data processing. This is the part that convinced me the approach was sustainable: But electricity is real. The Ryzen 5 2600X idles around 65W, the Mac mini M4 around 5-7W (rising to ~30W during inference). Call it 100W average for the whole setup. At Dutch electricity prices (~€0.25/kWh), that's about €25-30/month. Real total: ~€35/month versus a minimum of €300/month in cloud services. And I went from four AI subscriptions down to one. The VPS runs separately with just 2 containers: Nginx Proxy Manager for webhook ingress and Uptime Kuma for external monitoring. Everything else is on docker-host. Here's the part nobody shows you in tutorials: how 32GB actually gets divided: The n8n allocation deserves explanation. It's configured with NODE_OPTIONS=--max-old-space-size=16384, a 16GB ceiling. That sounds aggressive, but without it, Node.js defaults to a much lower heap limit. When a workflow processes a batch of large email bodies through an LLM and the responses come back as multi-kilobyte JSON objects, memory spikes fast. If the heap limit is too low, Node's garbage collector starts running constantly, trying to free memory instead of doing actual work. Eventually, the process crashes with an out-of-memory error. The high ceiling gives it room to breathe. In practice, n8n uses 4-6GB. The real constraint isn't peak usage; it's that everything competes for the same memory bus. When Elasticsearch is indexing, n8n is running 16 workflows, and PostgreSQL is handling a complex CTE query simultaneously... things slow down. Nothing crashes, it just slows down. The M4's unified memory architecture is genuinely excellent for LLM inference. Unlike discrete GPUs, where you're limited by VRAM (my GTX 1060's 3GB is useless for anything beyond tiny models), the M4 can use its full 24GB for model weights. The memory bandwidth (120 GB/s) is lower than a high-end GPU, but for a 14B parameter model, it's more than enough. I run a tiered model strategy: Two more tiers are planned but not yet running locally: With ~17GB available for models, I can only run one at a time. The keep-alive is set to 10 seconds; models unload quickly to free RAM for the next one. The flow looks like: This works because classification and reasoning don't overlap much. The classifier runs on a schedule; the brain runs on events. The 4-second cold start is acceptable. If I had 48GB of unified memory, I'd keep both warm permanently, but the M4 with 24GB was the sweet spot for price/performance. One of the more useful things I built is an HTTP proxy that sits between all consumers and Ollama: Every inference request gets logged with full token counts, latency, and caller info. The logging happens in a daemon thread, so it doesn't block the response. This means I can query the usage table to see exactly which service is consuming the most tokens, what the average latency is, and which workflows are the heaviest users. All containers talk to the proxy. They never hit Ollama directly. This gives me a single point of observability for all LLM traffic across the system. Tailscale gives each machine a stable IP that works regardless of the physical network. No port forwarding. No dynamic DNS. No opening ports on the home router. Docker containers on the docker-host reach the inference server's Ollama through the Tailscale IP: Services on the same Docker host use Docker service names (e.g., http://postgres:5432). Cross-machine communication goes through Tailscale IPs. I also run CoreDNS inside Docker for internal subdomain routing, friendly names like dashboard.internal, api.internal, all resolving to Tailscale IPs within the mesh only. One thing worth knowing if you set this up: CoreDNS in authoritative mode doesn't fall through to external DNS for missing records; it returns NXDOMAIN. So every new internal subdomain needs to be added to the zone file, or it simply won't resolve. The 32GB of DDR4 in docker-host is two 16GB kits of Corsair Vengeance RGB Pro, rated at 3200MHz. Same model number, same batch number; one kit bought in 2018, one in 2022. They should be as compatible as two kits can physically be. They aren't. I've set XMP to 3200MHz multiple times. With the original single kit, I even ran a stable overclock at 3600MHz. But since adding the second kit, the profile either fails to apply or reverts to JEDEC default 2133MHz after some time. No error, no BSOD, just silently drops back. So right now, 32GB of 3200MHz-rated memory is running at 2133MHz. That's roughly 33% of the memory bandwidth sitting unused. Every container, every query, every Docker layer pull. All running at two-thirds speed on the memory bus. I haven't fully diagnosed whether it's a subtle timing incompatibility between the kits, a motherboard limitation with four DIMMs populated, or something else entirely. It's on the list, but it's the kind of issue that requires dedicated downtime to troubleshoot properly, and downtime means taking 22 containers offline. Linux instead of Windows on docker-host. Docker on Windows works, but it adds friction everywhere. My deploy script runs PowerShell commands over SSH (Remove-Item -Recurse -Force instead of rm -rf). I once corrupted a CoreDNS zone file because PowerShell's -replace treats \n as literal text instead of a newline. Linux would eliminate an entire category of issues. A dedicated, purpose-built server. The current machine has three problems: it's not built for this job, it's not efficient at this job, and it has competing use cases. The docker-host is also my occasional Windows machine (I still use it for things that need Windows). That means I can't wipe it to Linux, and it means the machine is pulling double duty when it should be dedicated infrastructure. In an ideal setup, Docker lives on its own box that I never touch except to SSH into. The hardware itself is wasteful for a container host. The Ryzen 5 2600X pulls 95W TDP. Those 12 threads are genuinely useful when n8n, PostgreSQL, and Elasticsearch all spike at once, but most of the time, containers are waiting on I/O, not burning CPU. An Intel i5-12500T at 35W would handle the same workload. Then there's the GTX 1060 drawing 120W under load for absolutely nothing; it's only installed because the Ryzen has no integrated graphics. And the 650W PSU is running at maybe 20% load, which is the least efficient part of its power curve. The whole machine is basically optimized for gaming, not for sitting in a corner running Docker. My ideal replacement: something like a Dell OptiPlex 3080 Micro — small form factor, Intel with integrated graphics (no discrete GPU needed), 16GB RAM (expandable), designed for 24/7 operation, near-silent. These go for reasonable prices secondhand, though RAM pricing makes anything above 16GB expensive. It wouldn't match the Ryzen's raw multi-threaded output, but for a Docker host that's mostly waiting on I/O and network, it doesn't need to. 48GB on the Mac mini. The 24GB M4 is good, but being limited to one model at a time creates a scheduling bottleneck. With 48GB I could keep the classifier and the reasoning model warm simultaneously and cut out the cold-start latency entirely. Start with Elasticsearch earlier. I started with ChromaDB for vector search because it's lighter. But once I needed hybrid search (keyword + semantic in the same query), I had to migrate anyway. If your data has both structured metadata and unstructured text (and you know you'll need to search both), start with something that handles both natively. That said, if you only need vector similarity for a smaller dataset, ChromaDB or pgvector will save you 2GB of RAM and a lot of query DSL. Beyond cost and privacy, there's a third reason I run local-first: I own the upgrade timeline. I decide when to update Postgres. When Elasticsearch changes licensing, it doesn't affect my running instance. When n8n raises cloud pricing, it doesn't matter. When a model provider deprecates an API version, my workflows keep running. I've been bitten by the alternative. I originally planned to use a specific open banking provider for transaction imports. They closed to new signups months after I started planning around them. Because my architecture is local-first, switching to a different provider was a contained change, one API integration, not a full re-architecture. Honest answer: probably not, if you're building a side project or a startup MVP. The setup cost in time is real. Docker Compose files don't write themselves, Tailscale needs configuring, and you'll spend a weekend debugging why a Python service can't reach Elasticsearch through Docker's bridge network. If your data is genuinely sensitive and you have ongoing infrastructure needs, and you don't mind being your own sysadmin, it's worth considering. If you need to scale past what consumer hardware handles, or you have a team that needs managed infrastructure, or you'd rather write application code than debug Docker networking at midnight, stick with cloud services. There's no shame in that — it's a legitimate trade-off. For me, €35/month, zero data leaving the house, and full control over every component is worth being my own sysadmin, DBA, and on-call engineer. For a solo operation, that math works. This is Part 1 of "One Developer, 22 Containers" — a series about building an AI office management system on consumer hardware. Next up: the technology decisions behind every major component, what I considered, and what I'd pick differently today. If you're building something similar or have questions about any of the stack, I'd love to hear about it in the comments. You can also find me on GitHub. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

{ "cloud_llm_boundary": { "hard_rule": "NO cloud LLM usage by any agent without explicit human permission.", "prohibited_data": [ "Email content — body, subject, sender, recipient, attachments", "Financial data — transactions, invoices, account numbers, balances", "Client information — names, contacts, project details, contracts", "Personal data — addresses, phone numbers, government identifiers", "Infrastructure — credentials, API keys, internal hostnames, IPs" ], "exceptions": "Development and debugging only, never with production data." } } { "cloud_llm_boundary": { "hard_rule": "NO cloud LLM usage by any agent without explicit human permission.", "prohibited_data": [ "Email content — body, subject, sender, recipient, attachments", "Financial data — transactions, invoices, account numbers, balances", "Client information — names, contacts, project details, contracts", "Personal data — addresses, phone numbers, government identifiers", "Infrastructure — credentials, API keys, internal hostnames, IPs" ], "exceptions": "Development and debugging only, never with production data." } } { "cloud_llm_boundary": { "hard_rule": "NO cloud LLM usage by any agent without explicit human permission.", "prohibited_data": [ "Email content — body, subject, sender, recipient, attachments", "Financial data — transactions, invoices, account numbers, balances", "Client information — names, contacts, project details, contracts", "Personal data — addresses, phone numbers, government identifiers", "Infrastructure — credentials, API keys, internal hostnames, IPs" ], "exceptions": "Development and debugging only, never with production data." } } Windows 11 OS overhead: ~4 GB Elasticsearch (Java heap): 4 GB (-Xms4g -Xmx4g) n8n (Node.js): ~4-6 GB typical usage PostgreSQL: ~1 GB Mattermost: ~0.5 GB 7x Python services: ~2 GB total Other containers: ~1 GB Docker engine overhead: ~1 GB ───────────────────────────────────────── Total: ~18-20 GB typical, ~30 GB under load Windows 11 OS overhead: ~4 GB Elasticsearch (Java heap): 4 GB (-Xms4g -Xmx4g) n8n (Node.js): ~4-6 GB typical usage PostgreSQL: ~1 GB Mattermost: ~0.5 GB 7x Python services: ~2 GB total Other containers: ~1 GB Docker engine overhead: ~1 GB ───────────────────────────────────────── Total: ~18-20 GB typical, ~30 GB under load Windows 11 OS overhead: ~4 GB Elasticsearch (Java heap): 4 GB (-Xms4g -Xmx4g) n8n (Node.js): ~4-6 GB typical usage PostgreSQL: ~1 GB Mattermost: ~0.5 GB 7x Python services: ~2 GB total Other containers: ~1 GB Docker engine overhead: ~1 GB ───────────────────────────────────────── Total: ~18-20 GB typical, ~30 GB under load # Proxy sits between all services and Ollama # Every inference call gets logged to PostgreSQL INFERENCE_ENDPOINTS = {"/api/generate", "/api/chat", "/api/embed"} POLL_ENDPOINTS = {"/api/tags", "/api/ps", "/api/version"} # Proxy sits between all services and Ollama # Every inference call gets logged to PostgreSQL INFERENCE_ENDPOINTS = {"/api/generate", "/api/chat", "/api/embed"} POLL_ENDPOINTS = {"/api/tags", "/api/ps", "/api/version"} # Proxy sits between all services and Ollama # Every inference call gets logged to PostgreSQL INFERENCE_ENDPOINTS = {"/api/generate", "/api/chat", "/api/embed"} POLL_ENDPOINTS = {"/api/tags", "/api/ps", "/api/version"} # docker-compose.yml (simplified, IPs redacted) api: environment: OLLAMA_URL: http://<inference-tailscale-ip>:11433 DATABASE_URL: postgresql://user:pass@postgres:5432/db # docker-compose.yml (simplified, IPs redacted) api: environment: OLLAMA_URL: http://<inference-tailscale-ip>:11433 DATABASE_URL: postgresql://user:pass@postgres:5432/db # docker-compose.yml (simplified, IPs redacted) api: environment: OLLAMA_URL: http://<inference-tailscale-ip>:11433 DATABASE_URL: postgresql://user:pass@postgres:5432/db - CPU: AMD Ryzen 5 2600X (6 cores, 12 threads) - RAM: 32GB DDR4 (two 16GB kits — more on this later) - GPU: NVIDIA GTX 1060 3GB — useless for inference (3GB VRAM), but the Ryzen 5 2600X has no integrated graphics. Without this card, there's no display output. It exists purely to give the machine a screen. - OS: Windows 11 with Docker Desktop — I still use this machine as a Windows PC occasionally, which is the honest reason it hasn't been wiped to Linux yet - Chip: Apple M4, 10-core CPU, 10-core GPU - RAM: 24GB unified memory (~17GB available for models after OS and services) - Role: Ollama model serving, plus Proton Mail Bridge (which requires a GUI; no headless mode exists) - Generation (qwen3:32b) — for client-facing content where quality matters. Needs a GPU with more VRAM than what I currently have. - Vision (llama3.2-vision:11b) — for screenshot comparison and 3D print quality inspection. Planned for when the system matures enough to need it. - Classification batch starts → qwen2.5:14b loads (~4 second cold start) - Processes 10-50 emails → model stays warm - Batch finishes → 10 seconds idle → model unloads - Brain needs to reason → qwen3:14b loads - Brain finishes → unloads