Tools: Why Your Docker Container Works Locally But Fails in Kubernetes - Guide

Tools: Why Your Docker Container Works Locally But Fails in Kubernetes - Guide

Failure 1: Environment Variables and Secrets That Exist Locally But Not in the Cluster

Failure 3: Networking and Service Discovery Failures

Failure 4: Readiness and Liveness Probes Misconfigured

Failure 5: File Permissions and Volume Mount Issues

The Underlying Pattern

Quick Reference: The Local-to-Kubernetes Readiness Checklist It's not Kubernetes being difficult. It's the assumptions your container was making that Docker quietly satisfied — and Kubernetes doesn't. You've been here before. The container runs perfectly on your laptop. docker run works. The app responds. Logs look clean. You push it to your managed Kubernetes cluster — EKS, GKE, AKS, take your pick — and something breaks. The pod crashes with no useful logs. Or it starts, passes health checks, and returns wrong responses. Or it worked fine in staging and silently fails in production despite identical manifests. This isn't bad luck. It's a specific and repeatable class of problem: your container was built with implicit assumptions about its runtime environment, and Docker satisfies those assumptions automatically while Kubernetes does not. Docker on your laptop is a generous host. It passes through your shell environment, runs containers as your user by default, shares your network namespace, and gives containers as much memory and CPU as they ask for. Kubernetes is a strict host. It enforces isolation, applies resource constraints, manages networking through its own abstraction layer, and runs containers in a security context that may differ significantly from what you tested locally. Every mismatch between those two environments is a potential failure. Here are the ones I've personally hit — and exactly how to close each gap. This is the most common failure and the hardest to diagnose because the error it produces is almost never "environment variable missing." It's usually a downstream failure — a database connection refused, an API call returning 401, a feature that behaves as if it's in the wrong mode. Locally, your container inherits environment variables from your shell, your .env file, your docker-compose.yml. You've set these up once and forgotten about them. In Kubernetes, none of that exists. The pod gets exactly what you put in the manifest — nothing more. The failure pattern I've seen most in EKS environments: an application that uses AWS SDK will work locally because the developer's machine has IAM credentials in ~/.aws/credentials. In EKS, those credentials don't exist — the pod needs an IAM role attached via a service account. The app starts, the pod is Running, health checks pass, and every AWS API call silently fails or returns permission errors that look like application bugs. Always run an environment audit before moving to Kubernetes. Start the container locally with a completely clean environment — no .env file, no inherited shell variables: If it breaks locally with a clean environment, it will break in Kubernetes. Fix it before it gets there. For secrets in managed clusters, use the platform's native secret injection — AWS Secrets Manager with External Secrets Operator on EKS, GCP Secret Manager on GKE — rather than baking secrets into ConfigMaps or manifests: For IAM authentication specifically on EKS, use IRSA (IAM Roles for Service Accounts) — not instance profiles, not hardcoded credentials: This one presents as the most confusing failure because the symptoms look like application bugs, not infrastructure problems. OOMKill: the pod runs for a few minutes, then disappears. No error in application logs because the process was killed before it could write one. kubectl describe pod shows OOMKilled in the last state — but only if you look at the right time, because that state rotates out of describe output after the pod restarts. Miss the window and you're debugging a ghost. CPU throttling: the pod runs, the application responds, but it's slow. Intermittently slow in ways that don't correlate with traffic. This is the cgroup CPU quota applying — your container is being throttled because it requested 200m CPU, hit a burst, and the kernel is enforcing the limit. Locally, docker run with no resource flags gives the container your full machine's CPU. In Kubernetes with limits set, the container gets exactly what you asked for — which may be far less than it needs under load. Never set resource limits in Kubernetes without first understanding your container's actual consumption profile. Run it under realistic load and measure: Set requests and limits based on observed data, not guesses: A pattern worth adopting in production: set memory limits (OOMKill is preferable to a node going down) but be conservative with CPU limits. CPU throttling degrades performance silently; it doesn't crash the pod, so it's far harder to detect. Use CPU requests for scheduling, and monitor actual CPU usage separately. For OOMKill diagnosis, always check the pod's last state immediately after a crash: Locally, your microservices talk to each other via localhost or hostnames defined in docker-compose. In Kubernetes, localhost refers to the pod itself — not other services. Service discovery works through DNS, and that DNS only resolves correctly if your service names, namespaces, and selectors are configured precisely. The failure I've hit most: an application configured to connect to localhost:5432 for its database — perfectly valid in a Docker Compose setup where the database is a sidecar. In Kubernetes, that connection attempt hits the pod's own loopback interface and fails immediately. The error looks like a database connection failure, not a networking misconfiguration. The staging-to-production variant: services work in staging because everything is in the default namespace and short DNS names resolve. In production with multiple namespaces, myservice doesn't resolve — myservice.production.svc.cluster.local does. The same manifest, different namespace, different DNS behavior. Replace all localhost service references with Kubernetes DNS names before deploying. The full DNS format is: For services in the same namespace, the short name works: Debug DNS resolution from inside the pod — not from your laptop: Network policies are the other common gotcha in production managed clusters. EKS and GKE often ship with default-deny network policies in hardened configurations. A service that communicates freely in staging can be silently blocked in production: This failure is subtle because it's the Kubernetes layer doing exactly what you told it to do — you just told it the wrong thing. A liveness probe that's too aggressive will kill a pod that's healthy but slow to start — especially JVM applications, Python apps loading large models, or anything with a meaningful initialization phase. The pod starts, Kubernetes probes it at second 10, gets no response because the app isn't ready yet, and kills it. CrashLoopBackOff. The app never had a chance to run. A readiness probe that's too lenient — or missing entirely — sends traffic to pods that aren't ready. The service shows endpoints, requests route to the new pod, and users get errors during the rollout window. Locally, neither of these exists. Docker runs your container and leaves it alone. Configure initialDelaySeconds generously on liveness probes — always longer than your slowest observed startup time: Use separate endpoints for liveness and readiness. /healthz for liveness should return 200 as long as the process is alive and not deadlocked. /ready for readiness should verify the application can actually serve traffic — database connected, cache warm, dependencies reachable. Locally, your Docker container typically runs as root or as your user — whichever the Dockerfile specifies, with no external enforcement. In managed Kubernetes clusters, particularly on GKE Autopilot and hardened EKS configurations, pods run with runAsNonRoot: true enforced at the namespace or cluster level. If your container expects to write to /app/logs or /tmp/cache as root, it silently fails or crashes with a permission error that's easy to misread. Volume mounts compound this. A hostPath volume that works in a local Docker setup doesn't exist in a managed cluster. An emptyDir volume mounted at /app/data will be owned by root unless you explicitly set fsGroup — meaning a container running as a non-root user can't write to it. Always set an explicit security context and test against it locally: And in your Dockerfile, match the user: Test this locally before pushing to the cluster: If it fails locally with these constraints, it will fail in Kubernetes. Fix the permissions at the image level, not with cluster-level workarounds. Every failure above follows the same structure: Docker locally is permissive by default, Kubernetes in production is restrictive by design. This isn't a Kubernetes flaw. Isolation, resource enforcement, and security contexts exist for good reasons in multi-tenant managed clusters. The problem is that the permissive local environment creates invisible dependencies — on inherited environment variables, on unrestricted resources, on root file access — that your container never had to explicitly declare. The fix isn't to make Kubernetes more permissive. It's to make your container honest about what it needs. Build containers that declare their requirements explicitly: environment variables, resource requests, security context, health check endpoints, DNS-based service addressing. Test them under production-like constraints before they reach the cluster. When a container works locally and fails in Kubernetes, the question isn't "what's wrong with Kubernetes" — it's "what assumption was my container making that I didn't know about." Kubernetes just makes those assumptions visible. Usually at the worst possible time. Before promoting any container from local Docker to a managed Kubernetes cluster: What's the most confusing Docker-to-Kubernetes failure you've debugged? Drop it in the comments — the weirder the better. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# Strip your local environment entirely -weight: 500;">docker run --env-file /dev/null myapp:latest # Or explicitly pass only what Kubernetes will provide -weight: 500;">docker run \ -e DB_HOST=localhost \ -e APP_ENV=production \ myapp:latest # Strip your local environment entirely -weight: 500;">docker run --env-file /dev/null myapp:latest # Or explicitly pass only what Kubernetes will provide -weight: 500;">docker run \ -e DB_HOST=localhost \ -e APP_ENV=production \ myapp:latest # Strip your local environment entirely -weight: 500;">docker run --env-file /dev/null myapp:latest # Or explicitly pass only what Kubernetes will provide -weight: 500;">docker run \ -e DB_HOST=localhost \ -e APP_ENV=production \ myapp:latest # External Secrets Operator pattern for EKS apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: app-secrets spec: refreshInterval: 1h secretStoreRef: name: aws-secrets-manager kind: ClusterSecretStore target: name: app-secrets data: - secretKey: DB_PASSWORD remoteRef: key: prod/myapp/db property: password # External Secrets Operator pattern for EKS apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: app-secrets spec: refreshInterval: 1h secretStoreRef: name: aws-secrets-manager kind: ClusterSecretStore target: name: app-secrets data: - secretKey: DB_PASSWORD remoteRef: key: prod/myapp/db property: password # External Secrets Operator pattern for EKS apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: app-secrets spec: refreshInterval: 1h secretStoreRef: name: aws-secrets-manager kind: ClusterSecretStore target: name: app-secrets data: - secretKey: DB_PASSWORD remoteRef: key: prod/myapp/db property: password apiVersion: v1 kind: ServiceAccount metadata: name: myapp-sa annotations: eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/myapp-role apiVersion: v1 kind: ServiceAccount metadata: name: myapp-sa annotations: eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/myapp-role apiVersion: v1 kind: ServiceAccount metadata: name: myapp-sa annotations: eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/myapp-role # Watch resource consumption in real time -weight: 500;">kubectl top pod myapp-pod --containers # Get historical metrics if you have metrics-server -weight: 500;">kubectl top pods -l app=myapp --sort-by=memory # Watch resource consumption in real time -weight: 500;">kubectl top pod myapp-pod --containers # Get historical metrics if you have metrics-server -weight: 500;">kubectl top pods -l app=myapp --sort-by=memory # Watch resource consumption in real time -weight: 500;">kubectl top pod myapp-pod --containers # Get historical metrics if you have metrics-server -weight: 500;">kubectl top pods -l app=myapp --sort-by=memory resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" # Consider not setting CPU limits — only requests # CPU limits cause throttling; CPU requests cause scheduling resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" # Consider not setting CPU limits — only requests # CPU limits cause throttling; CPU requests cause scheduling resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" # Consider not setting CPU limits — only requests # CPU limits cause throttling; CPU requests cause scheduling -weight: 500;">kubectl describe pod myapp-pod | grep -A 10 "Last State" # Look for: Reason: OOMKilled -weight: 500;">kubectl describe pod myapp-pod | grep -A 10 "Last State" # Look for: Reason: OOMKilled -weight: 500;">kubectl describe pod myapp-pod | grep -A 10 "Last State" # Look for: Reason: OOMKilled <-weight: 500;">service-name>.<namespace>.svc.cluster.local <-weight: 500;">service-name>.<namespace>.svc.cluster.local <-weight: 500;">service-name>.<namespace>.svc.cluster.local env: - name: DB_HOST value: "postgres--weight: 500;">service" # same namespace - name: AUTH_SERVICE_URL value: "http://auth--weight: 500;">service.auth-namespace.svc.cluster.local" # cross-namespace env: - name: DB_HOST value: "postgres--weight: 500;">service" # same namespace - name: AUTH_SERVICE_URL value: "http://auth--weight: 500;">service.auth-namespace.svc.cluster.local" # cross-namespace env: - name: DB_HOST value: "postgres--weight: 500;">service" # same namespace - name: AUTH_SERVICE_URL value: "http://auth--weight: 500;">service.auth-namespace.svc.cluster.local" # cross-namespace # Exec into the pod and test DNS directly -weight: 500;">kubectl exec -it myapp-pod -- nslookup postgres--weight: 500;">service -weight: 500;">kubectl exec -it myapp-pod -- -weight: 500;">curl -v http://postgres--weight: 500;">service:5432 # If nslookup fails, check CoreDNS -weight: 500;">kubectl logs -n kube-system -l k8s-app=kube-dns # Exec into the pod and test DNS directly -weight: 500;">kubectl exec -it myapp-pod -- nslookup postgres--weight: 500;">service -weight: 500;">kubectl exec -it myapp-pod -- -weight: 500;">curl -v http://postgres--weight: 500;">service:5432 # If nslookup fails, check CoreDNS -weight: 500;">kubectl logs -n kube-system -l k8s-app=kube-dns # Exec into the pod and test DNS directly -weight: 500;">kubectl exec -it myapp-pod -- nslookup postgres--weight: 500;">service -weight: 500;">kubectl exec -it myapp-pod -- -weight: 500;">curl -v http://postgres--weight: 500;">service:5432 # If nslookup fails, check CoreDNS -weight: 500;">kubectl logs -n kube-system -l k8s-app=kube-dns # Explicit ingress policy — don't rely on default-allow apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-myapp-ingress spec: podSelector: matchLabels: app: myapp ingress: - from: - podSelector: matchLabels: app: frontend ports: - port: 8080 # Explicit ingress policy — don't rely on default-allow apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-myapp-ingress spec: podSelector: matchLabels: app: myapp ingress: - from: - podSelector: matchLabels: app: frontend ports: - port: 8080 # Explicit ingress policy — don't rely on default-allow apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-myapp-ingress spec: podSelector: matchLabels: app: myapp ingress: - from: - podSelector: matchLabels: app: frontend ports: - port: 8080 livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 30 # give the app time to -weight: 500;">start periodSeconds: 10 failureThreshold: 3 timeoutSeconds: 5 readinessProbe: httpGet: path: /ready # separate endpoint from liveness port: 8080 initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 3 livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 30 # give the app time to -weight: 500;">start periodSeconds: 10 failureThreshold: 3 timeoutSeconds: 5 readinessProbe: httpGet: path: /ready # separate endpoint from liveness port: 8080 initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 3 livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 30 # give the app time to -weight: 500;">start periodSeconds: 10 failureThreshold: 3 timeoutSeconds: 5 readinessProbe: httpGet: path: /ready # separate endpoint from liveness port: 8080 initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 3 securityContext: runAsNonRoot: true runAsUser: 1000 runAsGroup: 1000 fsGroup: 1000 # ensures volume mounts are group-writable readOnlyRootFilesystem: true # force explicit volume declarations securityContext: runAsNonRoot: true runAsUser: 1000 runAsGroup: 1000 fsGroup: 1000 # ensures volume mounts are group-writable readOnlyRootFilesystem: true # force explicit volume declarations securityContext: runAsNonRoot: true runAsUser: 1000 runAsGroup: 1000 fsGroup: 1000 # ensures volume mounts are group-writable readOnlyRootFilesystem: true # force explicit volume declarations RUN addgroup --system appgroup && adduser --system --ingroup appgroup appuser RUN chown -R appuser:appgroup /app USER appuser RUN addgroup --system appgroup && adduser --system --ingroup appgroup appuser RUN chown -R appuser:appgroup /app USER appuser RUN addgroup --system appgroup && adduser --system --ingroup appgroup appuser RUN chown -R appuser:appgroup /app USER appuser -weight: 500;">docker run --user 1000:1000 --read-only myapp:latest -weight: 500;">docker run --user 1000:1000 --read-only myapp:latest -weight: 500;">docker run --user 1000:1000 --read-only myapp:latest - Environment audit — run locally with clean environment, no inherited shell variables - IAM/credentials — no local credential files; use IRSA or Workload Identity - Resource profiling — measure actual CPU and memory under load before setting limits - DNS references — replace all localhost with Kubernetes -weight: 500;">service DNS names - Probe configuration — separate liveness/readiness endpoints, generous initialDelaySeconds - Security context — test with runAsNonRoot: true and readOnlyRootFilesystem: true locally - Volume permissions — set fsGroup on all writable volume mounts