Tools: Docker Compose vs Kubernetes: What I Actually Learned Running Both in Production

Tools: Docker Compose vs Kubernetes: What I Actually Learned Running Both in Production

Docker Compose in 2026 Is Not What You Used Five Years Ago ## Where Kubernetes Actually Earns Back Its Complexity Tax ## The ML Workload Angle I Didn't Anticipate ## The Signals I Now Actually Use to Decide ## What I'd Actually Tell You to Do Eighteen months ago I inherited a mess. A four-person team had built a reasonably capable ML inference service — three Python microservices, a Redis queue, a Postgres instance, an Nginx reverse proxy — all wired together with a docker-compose.yml that had clearly been written in a hurry and never revisited. The team lead had left a sticky note in the README that said, verbatim: "we should probably move this to Kubernetes at some point." That sticky note started a long argument with myself. I ended up running both. Not as an experiment — as an actual business decision I had to defend, twice, to different stakeholders. What follows is what I learned, what I got wrong, and where I landed. The version of Compose I inherited was using some 3.x syntax with deprecated options. First thing I did was migrate to Compose v2.32 (which ships bundled with Docker Desktop and the Docker CLI now — no separate install needed). That alone fixed several subtle networking headaches. Thing is, Compose has gotten genuinely good at what it was always meant to do. compose watch has been stable for a while now, and it changed how I think about local development: That sync+restart action for the worker is something I use constantly — it syncs files then restarts the process without a full image rebuild. Saves probably 40 seconds per iteration cycle when you're deep in debugging. For a team our size (four engineers, two of whom are ML researchers who don't want to think about infrastructure), Compose has a near-zero learning curve. I can write a docker-compose.yml, push it to the repo, and anyone can docker compose up without reading a manual. That matters more than people admit. On a single host — even a beefy one like an EC2 m7i.4xlarge — Compose handles more than you'd think. I've run services doing 400 req/s on a single host with Compose and it was fine. The constraint is the host, not Compose. If your service fits on one host and your team is small, defaulting to Compose isn't laziness — it's a reasonable engineering decision with real payoff in operational simplicity. I did eventually move part of the system to Kubernetes. Not all of it — more on that in a moment — but the inference serving component specifically, because we started getting requests for GPU-backed endpoints and that's where Compose genuinely hits a wall. Running GPU workloads across multiple nodes is one of those things K8s is legitimately built for. The NVIDIA GPU Operator on K8s 1.35 has become much more stable than it was back in the 1.28 era — I remember hitting a specific issue where device plugin pods would crash on node drain (somewhere around kubernetes/kubernetes#118506, I'd have to dig). By 1.33 that class of issue was mostly sorted. GPU scheduling on multi-node K8s is now a solved problem in a way it genuinely wasn't two years ago. The second payoff: HorizontalPodAutoscaler against custom metrics. We pipe inference latency from Prometheus into KEDA, and the autoscaler responds to queue depth and p95 latency — not just CPU. That's not something you replicate with Compose without building significant custom tooling. Rolling deployments are the other thing worth mentioning. With Compose, docker compose up --force-recreate on a single host means downtime — or you're writing your own health-check loop. K8s rolling updates with a proper readinessProbe mean zero-downtime deploys without having to think about it. I pushed a model update on a Friday afternoon once (yes, I know) and the rollout was fine because the cluster waited for new pods to be healthy before draining the old ones. I would not have taken that risk with Compose on a single host. That said — and I want to be direct about this — the K8s cluster costs us roughly $340/month more than a comparable Compose deployment on a single large instance would. That's real money for a side project or an early-stage product. The break-even only works if you're at a scale where the autoscaling savings outweigh the base cluster cost, or if you genuinely need multi-node availability. I thought I'd have a clear answer here. I didn't. I assumed moving ML inference to K8s would also mean moving training jobs there. Same cluster, same GPU nodes, everything in one place — seemed logical. What I actually found was that training jobs are weird. They're batch, they're stateful in an awkward way, they need specific environment setup that changes frequently, and the feedback loop when something goes wrong is slow. I ran training jobs as K8s Jobs with ttlSecondsAfterFinished for a few months. Fine in theory. In practice, every time an ML researcher wanted to tweak the data pipeline or swap a tokenizer, they were waiting on me to update a ConfigMap or rebuild an image. I had become a gatekeeper for changes that had nothing to do with infrastructure — which is a bad sign. So I moved training back to Compose — on a dedicated GPU box, not the K8s cluster. Training runs as docker compose -f compose.train.yml up with the model checkpoint directory mounted as a volume. Researchers can modify it directly. Inference serving stays on K8s where the availability and scaling story matters. I genuinely didn't see that split coming. I thought "K8s for ML" was the obvious move. The reality: K8s is great for serving (stateless, latency-sensitive, scaling matters) and overkill for training (stateful, batch, where iteration speed matters more than orchestration). After 18 months of this, the heuristic I've landed on is less about features and more about team and workload shape. Compose is the right call when your service runs on one host without strain, your team has fewer than six or seven engineers touching infrastructure, and you're iterating fast enough that deployment simplicity directly affects development speed. Also — and I feel strongly about this — if the people running the service are primarily not infrastructure engineers, Compose's operational model is far more forgiving. A docker compose logs -f worker is something anyone can run. A kubectl logs -n production -l app=worker --since=1h is a command you need to look up, at least at first. Kubernetes makes sense when you need to schedule across multiple nodes (GPUs, memory isolation, availability zones), when you have autoscaling requirements that respond to custom signals, when your team has dedicated platform or SRE capacity to own the cluster, or when your availability requirements are strict enough that single-host failure isn't acceptable. One thing I want to push back on: the idea that Kubernetes is automatically "more production-ready." I've seen Compose deployments that were stable and well-monitored, and K8s clusters that were a disaster of misconfigured RBAC, stale CRDs, and nobody who actually understood the control plane. The tool doesn't make you production-ready. The operational discipline does. Start with Compose. Not because K8s is bad — it isn't — but because you'll hit the limits of Compose in very specific, recognizable ways. You'll know when you need multi-node scheduling because you'll be staring at a GPU allocation problem that Compose can't solve. You'll know when you need cluster-level autoscaling because you'll have just manually scaled your single host twice in a week and you're annoyed about it. When you hit those specific walls, migrate that specific component. Not everything at once. The worst outcome I've seen is teams migrating entirely to K8s before they have the scale to justify it, then spending their first six months of product development fighting cluster configuration instead of shipping features. Kubernetes is powerful and I use it every day, but complexity has a real cost and that cost lands on your team's velocity. Anyway. The sticky note in the README — I never did "move everything to Kubernetes." I moved the inference serving layer and kept the rest on Compose. The system is faster, more reliable, and cheaper to operate than a full K8s migration would have been. Sometimes the boring answer is the right one. Templates let you quickly answer FAQs or store snippets for re-use. as well , this person and/or COMMAND_BLOCK:

# docker-compose.yml — inference service, 2026
services: api: build: ./api ports: - "8000:8000" develop: watch: - action: sync path: ./api/src target: /app/src # Rebuild only when dependencies change, not on every save - action: rebuild path: ./api/requirements.txt environment: - MODEL_PATH=/models/bert-base volumes: - ./models:/models:ro # mount model weights read-only, not baked into image worker: build: ./worker depends_on: redis: condition: service_healthy develop: watch: - action: sync+restart path: ./worker/src target: /app/src redis: image: redis:7.4-alpine healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s retries: 5 COMMAND_BLOCK:
# docker-compose.yml — inference service, 2026
services: api: build: ./api ports: - "8000:8000" develop: watch: - action: sync path: ./api/src target: /app/src # Rebuild only when dependencies change, not on every save - action: rebuild path: ./api/requirements.txt environment: - MODEL_PATH=/models/bert-base volumes: - ./models:/models:ro # mount model weights read-only, not baked into image worker: build: ./worker depends_on: redis: condition: service_healthy develop: watch: - action: sync+restart path: ./worker/src target: /app/src redis: image: redis:7.4-alpine healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s retries: 5 COMMAND_BLOCK:
# docker-compose.yml — inference service, 2026
services: api: build: ./api ports: - "8000:8000" develop: watch: - action: sync path: ./api/src target: /app/src # Rebuild only when dependencies change, not on every save - action: rebuild path: ./api/requirements.txt environment: - MODEL_PATH=/models/bert-base volumes: - ./models:/models:ro # mount model weights read-only, not baked into image worker: build: ./worker depends_on: redis: condition: service_healthy develop: watch: - action: sync+restart path: ./worker/src target: /app/src redis: image: redis:7.4-alpine healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s retries: 5 COMMAND_BLOCK:
# hpa.yaml — scales inference pods on queue depth + latency
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: name: inference-scaler
spec: scaleTargetRef: name: inference-deployment minReplicaCount: 2 maxReplicaCount: 20 triggers: - type: prometheus metadata: serverAddress: http://prometheus:9090 metricName: inference_queue_depth threshold: "15" # scale up when >15 items queued per pod query: sum(inference_queue_depth) / count(kube_pod_info{pod=~"inference.*"}) - type: prometheus metadata: serverAddress: http://prometheus:9090 metricName: inference_p95_latency_ms threshold: "800" query: histogram_quantile(0.95, rate(inference_duration_bucket[2m])) * 1000 COMMAND_BLOCK:
# hpa.yaml — scales inference pods on queue depth + latency
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: name: inference-scaler
spec: scaleTargetRef: name: inference-deployment minReplicaCount: 2 maxReplicaCount: 20 triggers: - type: prometheus metadata: serverAddress: http://prometheus:9090 metricName: inference_queue_depth threshold: "15" # scale up when >15 items queued per pod query: sum(inference_queue_depth) / count(kube_pod_info{pod=~"inference.*"}) - type: prometheus metadata: serverAddress: http://prometheus:9090 metricName: inference_p95_latency_ms threshold: "800" query: histogram_quantile(0.95, rate(inference_duration_bucket[2m])) * 1000 COMMAND_BLOCK:
# hpa.yaml — scales inference pods on queue depth + latency
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: name: inference-scaler
spec: scaleTargetRef: name: inference-deployment minReplicaCount: 2 maxReplicaCount: 20 triggers: - type: prometheus metadata: serverAddress: http://prometheus:9090 metricName: inference_queue_depth threshold: "15" # scale up when >15 items queued per pod query: sum(inference_queue_depth) / count(kube_pod_info{pod=~"inference.*"}) - type: prometheus metadata: serverAddress: http://prometheus:9090 metricName: inference_p95_latency_ms threshold: "800" query: histogram_quantile(0.95, rate(inference_duration_bucket[2m])) * 1000