Tools: 6 Real Debugging Failures I Hit in My Homelab (And What They Taught Me) - Full Analysis

Tools: 6 Real Debugging Failures I Hit in My Homelab (And What They Taught Me) - Full Analysis

Errors You'll Hit in Your Home Lab And What to Do When You Do

Why Building in a Home Lab Gives You Real Experience

What You Need to Set This Up

How to Set Up a DevOps Lab on Your Laptop (Without Spending a Dime)

What Turns Errors Into Experience

6 Errors You'll Hit as You Build, and How to Debug Each One

1. Database Connection Refused on Startup

2. Pods Stuck in Pending: Nothing Is Scheduling

3. ImagePullBackOff: The Cluster Can't Find Your Local Image

4. Ingress Returns 404: Service Selector Mismatch

5. Grafana Shows "No Data": Metrics Not Reaching Prometheus

6. ERR_TOO_MANY_REDIRECTS: When Two Systems Both Try to Enforce HTTPS

Keep Going: Free Labs to Build On

Quick Reference: When Something Breaks, Start Here

What You Have After All Six The first time a pod crashed in production, I ran kubectl logs and got nothing. It was empty, clean, and had no errors. I didn't know the container had already restarted, nor did I know about --previous. I was staring at a blank screen while the app was down, and I had no idea why. The command exists, but I just didn't know to reach for it when it mattered. So I kept digging in the wrong place. I will restart pods, re-run requests, but nothing changes. Meanwhile, the actual error had already disappeared with the previous container. That's the part no tutorial really prepares you for. Not the command, but knowing when to use it. That only comes from things breaking in your own lab. From hitting errors, misreading them, and going back until it clicks. The more you build, the more it breaks. The more it breaks, the more you learn, if you slow down to understand what actually happened. This article walks through six of those failures. What they look like, how to debug them, and how to document them so the lesson sticks. Each one comes with the exact commands, what you're looking for, and a five-line incident report template, so every error becomes an experience you can talk about. Tutorials show you what to do when everything works. Your home lab shows you what to do when it doesn't. When you build a real app in your lab, deploy it, wire up monitoring, and connect it to a database, things break. Containers crash, pipelines fail, services stop talking to each other, and Terraform and your actual infra fall out of sync. Most people hit these errors, Google the fix, copy-paste it, and move on. The error is gone. Nothing was learned. The engineers who get hired and trusted are the ones who stop when something breaks, read what the system was telling them, fix it based on what they found, and write down what happened. That's the difference between using a home lab and learning from one. A laptop with 8GB RAM (16GB recommended) and 50GB free disk space. No cloud account. No Raspberry Pi. Never built a DevOps lab before? Start here first, it walks you through Docker, Kubernetes, Vagrant, Terraform, and Ansible from scratch on your laptop: → How to Set Up a DevOps Lab on Your Laptop at Zero Cost. Come back when your lab is running. Step 1: Install local Kubernetes. Pick one: Step 2: Deploy a real app. You need something with actual moving parts, an API, a database, and services talking to each other. That's where real errors come from. A single hello-world container won't teach you anything. Clone this free repo and follow its setup steps to get it running in your cluster: → 🔗 DevOps Home-Lab 2026 It builds a full multi-service app, API, Postgres, and Redis from Docker Compose through to Kubernetes. Follow the setup steps in the repo. Don't move to Step 3 until your pods are healthy: Step 3: Wire up observability. Install Prometheus and Grafana now that your app is deployed, and there's something real to monitor. Open Grafana → "Kubernetes / Compute Resources / Pod" dashboard → confirm your app's pods are visible. That's your baseline. When something breaks as you build, this is where the evidence shows up. Most people who build homelabs end up doing the same thing: installing tools, following a tutorial, and deleting the cluster, no experience gained. The ones who come out of it with real skills do something different every time something breaks. They follow this loop. Especially the last step. The incident report template. Copy this. Use it every time something breaks. Every error you hit and document becomes an experience you can speak about. Not "I've read about CrashLoopBackOff." But: "I hit this at 11 pm, building my lab, here's what the logs showed, here's what fixed it." That's what interviewers are actually asking for. These are real errors from the DevOps Home-Lab 2026 repo - pulled from actual lab sessions, not invented for a tutorial. You will hit most of them. Throughout this section, replace <your-deployment> and <pod-name> with your actual names. Run kubectl get pods -A to see what's running. Stack: Docker Compose. What you'll learn: depends_on doesn't mean "wait until ready." It means "start after." Those are not the same thing. You run docker compose up. Both containers start. But the backend throws an error immediately and dies: What happened: your backend container started before Postgres finished its initialization sequence. Docker Compose started them roughly in parallel, depends_on only waits for the container to exist, not for the database inside it to be ready to accept connections. Then add a healthcheck block to your Postgres service that runs pg_isready. Docker Compose will wait for the health check to pass before starting dependent containers. Write your incident report: What you'll learn: Pending means the scheduler wants to place the pod, but can't. kubectl describe tells you exactly why. You deploy your app to Kubernetes. The pods sit in Pending forever. No errors in the logs, because the container never started. This happens when your cluster doesn't have enough memory to satisfy the pod's resource requests. Running a full stack (Postgres, Redis, backend, frontend, Prometheus, Grafana) on a single-node cluster with Docker Desktop memory limited to 4GB will hit this fast. The Events block in kubectl describe is the most useful thing in Kubernetes for this error. It tells you exactly what the scheduler is thinking. Write your incident report: Stack: Kubernetes / k3d. What you'll learn: Building an image locally and deploying it to a cluster are two separate steps. The cluster has its own image context. You build your Docker image locally, write a deployment YAML that references it, and apply it to your k3d cluster. The pod fails immediately: What happened: k3d runs inside Docker but has its own separate image registry. When you randocker build -t my-backend:latest .That image lives in Docker's local daemon; k3d nodes can't see it. The cluster tries to pull from Docker Hub, fails, and gives up. Every time you rebuild the image, you need to re-import it. Write your incident report: Stack: Kubernetes / Ingress. What you'll learn: 404 from an Ingress rarely means the Ingress is broken. It means the service behind it has no endpoints. You set up an Ingress. You visit the URL. NGINX returns a clean 404. The Ingress looks fine. The service exists. But something in the chain is broken, usually a label mismatch between the service selector and the pod labels, or a wrong port number. Zero endpoints means Kubernetes never connected the service to any pod. That's always a label mismatch or port mismatch, not an Ingress problem. The service selector must exactly match the pod labels: Also verify targetPort in the service matches the containerPort in the deployment. Write your incident report: Stack: Observability / Prometheus / Grafana What you'll learn: "No data" in Grafana means something in the chain between your app and Grafana is broken. Walk it backwards. You open Grafana. Your dashboards show "No data" on every panel. You didn't change anything recently. This happens when Prometheus isn't scraping your app, either because the app never exposed a /metrics endpoint, or the ServiceMonitor is missing or misconfigured, or the label selectors don't match what Prometheus is watching for. Walk the chain: Grafana → Prometheus data source → Prometheus targets → app /metrics endpoint. The failure is always at one of those four links. Write your incident report: Stack: Ingress / Cloudflare / ArgoCD What you'll learn: Redirect loops happen when two layers both try to handle HTTPS. You fix it by deciding which layer owns TLS termination and disabling it everywhere else. You expose your app through Cloudflare Tunnel. It works. Then you add ArgoCD or enable ssl-redirect on your NGINX Ingress. Suddenly, your browser returns ERR_TOO_MANY_REDIRECTS on every request. What's happening: Cloudflare Tunnel terminates HTTPS at the edge and forwards plain HTTP into your cluster. NGINX Ingress sees HTTP traffic and immediately 301-redirects to HTTPS. Cloudflare sends that HTTPS back through the tunnel, where it gets redirected again. Infinite loop. Rule: terminate TLS at ONE layer only. If Cloudflare handles TLS at the edge, everything inside the cluster runs plain HTTP. Write your incident report: The six errors above are the ones you'll hit most often when you're starting. But there's a lot more to encounter. These free resources give you more environments to build in, and more things to break, debug, and learn from. The full structured path from foundations to cloud: 🔗 List of DevOps Projects - five phases, all free. Pick a specific problem or follow the whole thing. Most people will read this and move on, but please, build something in your lab. Hit one of these errors. Come back to the right section, read what the system is telling you, fix it, and write the incident report. Do that enough times, and the next time someone asks you to walk through a debugging incident, you won't be narrating something you read. You'll be recalling something you fixed. That's the difference. The six errors above are a starting point. If you want a structured system for working through failures you haven't seen before, STOP framework, intentional break scenarios, and production checklists, that's what The Kubernetes Detective is built around. For building the full lab environment from scratch: Build Your Own DevOps Lab (V3.5) Let's connect on LinkedIn Every week, I share what I learned in my Newsletter. Case studies from real companies, the tactics that saved money, and the honest moments where everything broke. Subscribe if that sounds useful. Job hunting? Grab my Free DevOps resume template that's helped 300+ people land interviews. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# kind - fastest -weight: 500;">start -weight: 500;">curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64 chmod +x ./kind && -weight: 600;">sudo mv ./kind /usr/local/bin/kind kind create cluster --name devops-lab -weight: 500;">kubectl get nodes # NAME STATUS ROLES AGE # devops-lab-control-plane Ready control-plane 30s # kind - fastest -weight: 500;">start -weight: 500;">curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64 chmod +x ./kind && -weight: 600;">sudo mv ./kind /usr/local/bin/kind kind create cluster --name devops-lab -weight: 500;">kubectl get nodes # NAME STATUS ROLES AGE # devops-lab-control-plane Ready control-plane 30s # kind - fastest -weight: 500;">start -weight: 500;">curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64 chmod +x ./kind && -weight: 600;">sudo mv ./kind /usr/local/bin/kind kind create cluster --name devops-lab -weight: 500;">kubectl get nodes # NAME STATUS ROLES AGE # devops-lab-control-plane Ready control-plane 30s -weight: 500;">kubectl get pods -A # Your app's pods should show STATUS: Running -weight: 500;">kubectl get pods -A # Your app's pods should show STATUS: Running -weight: 500;">kubectl get pods -A # Your app's pods should show STATUS: Running helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm -weight: 500;">install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace # Access Grafana -weight: 500;">kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80 # Login: admin / prom-operator helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm -weight: 500;">install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace # Access Grafana -weight: 500;">kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80 # Login: admin / prom-operator helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm -weight: 500;">install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace # Access Grafana -weight: 500;">kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80 # Login: admin / prom-operator INCIDENT: [error type] - [-weight: 500;">service name] ROOT CAUSE: [one sentence - what actually caused it] DETECTION: [which command or metric showed you the problem] FIX: [exactly what you did to resolve it] LESSON: [what you now know that you didn't before] INCIDENT: [error type] - [-weight: 500;">service name] ROOT CAUSE: [one sentence - what actually caused it] DETECTION: [which command or metric showed you the problem] FIX: [exactly what you did to resolve it] LESSON: [what you now know that you didn't before] INCIDENT: [error type] - [-weight: 500;">service name] ROOT CAUSE: [one sentence - what actually caused it] DETECTION: [which command or metric showed you the problem] FIX: [exactly what you did to resolve it] LESSON: [what you now know that you didn't before] Error: connect ECONNREFUSED 127.0.0.1:5432 Error: connect ECONNREFUSED 127.0.0.1:5432 Error: connect ECONNREFUSED 127.0.0.1:5432 # What did the backend actually see? -weight: 500;">docker logs <your-backend-container> # Did Postgres finish starting? -weight: 500;">docker logs postgres # Look for: "database system is ready to accept connections" # Are both containers up? -weight: 500;">docker ps # What did the backend actually see? -weight: 500;">docker logs <your-backend-container> # Did Postgres finish starting? -weight: 500;">docker logs postgres # Look for: "database system is ready to accept connections" # Are both containers up? -weight: 500;">docker ps # What did the backend actually see? -weight: 500;">docker logs <your-backend-container> # Did Postgres finish starting? -weight: 500;">docker logs postgres # Look for: "database system is ready to accept connections" # Are both containers up? -weight: 500;">docker ps # Quick fix: -weight: 500;">restart the backend after Postgres is ready -weight: 500;">docker -weight: 500;">restart <your-backend-container> # Permanent fix: add a healthcheck-based dependency in -weight: 500;">docker-compose.yml # Quick fix: -weight: 500;">restart the backend after Postgres is ready -weight: 500;">docker -weight: 500;">restart <your-backend-container> # Permanent fix: add a healthcheck-based dependency in -weight: 500;">docker-compose.yml # Quick fix: -weight: 500;">restart the backend after Postgres is ready -weight: 500;">docker -weight: 500;">restart <your-backend-container> # Permanent fix: add a healthcheck-based dependency in -weight: 500;">docker-compose.yml depends_on: postgres: condition: service_healthy depends_on: postgres: condition: service_healthy depends_on: postgres: condition: service_healthy INCIDENT: Backend container crashed on startup - ECONNREFUSED ROOT CAUSE: Backend started before Postgres finished initializing. depends_on controls -weight: 500;">start order, not -weight: 500;">service readiness. DETECTION: -weight: 500;">docker logs backend showed ECONNREFUSED to port 5432. -weight: 500;">docker logs postgres confirmed it hadn't finished booting yet. FIX: Added healthcheck to postgres -weight: 500;">service. Set depends_on condition to service_healthy. Backend now waits until Postgres is ready. LESSON: Always use healthchecks for stateful services. This same pattern applies in Kubernetes via readiness probes. INCIDENT: Backend container crashed on startup - ECONNREFUSED ROOT CAUSE: Backend started before Postgres finished initializing. depends_on controls -weight: 500;">start order, not -weight: 500;">service readiness. DETECTION: -weight: 500;">docker logs backend showed ECONNREFUSED to port 5432. -weight: 500;">docker logs postgres confirmed it hadn't finished booting yet. FIX: Added healthcheck to postgres -weight: 500;">service. Set depends_on condition to service_healthy. Backend now waits until Postgres is ready. LESSON: Always use healthchecks for stateful services. This same pattern applies in Kubernetes via readiness probes. INCIDENT: Backend container crashed on startup - ECONNREFUSED ROOT CAUSE: Backend started before Postgres finished initializing. depends_on controls -weight: 500;">start order, not -weight: 500;">service readiness. DETECTION: -weight: 500;">docker logs backend showed ECONNREFUSED to port 5432. -weight: 500;">docker logs postgres confirmed it hadn't finished booting yet. FIX: Added healthcheck to postgres -weight: 500;">service. Set depends_on condition to service_healthy. Backend now waits until Postgres is ready. LESSON: Always use healthchecks for stateful services. This same pattern applies in Kubernetes via readiness probes. -weight: 500;">kubectl get pods -A # NAME READY STATUS RESTARTS # backend-7d9f-xk4m 0/1 Pending 0 -weight: 500;">kubectl get pods -A # NAME READY STATUS RESTARTS # backend-7d9f-xk4m 0/1 Pending 0 -weight: 500;">kubectl get pods -A # NAME READY STATUS RESTARTS # backend-7d9f-xk4m 0/1 Pending 0 -weight: 500;">kubectl describe pod <pod-name> -n <namespace> # Read the Events section at the bottom: # "0/1 nodes are available: 1 Insufficient memory." # What's the cluster actually using right now? -weight: 500;">kubectl top nodes # Do nodes exist at all? -weight: 500;">kubectl get nodes -o wide -weight: 500;">kubectl describe pod <pod-name> -n <namespace> # Read the Events section at the bottom: # "0/1 nodes are available: 1 Insufficient memory." # What's the cluster actually using right now? -weight: 500;">kubectl top nodes # Do nodes exist at all? -weight: 500;">kubectl get nodes -o wide -weight: 500;">kubectl describe pod <pod-name> -n <namespace> # Read the Events section at the bottom: # "0/1 nodes are available: 1 Insufficient memory." # What's the cluster actually using right now? -weight: 500;">kubectl top nodes # Do nodes exist at all? -weight: 500;">kubectl get nodes -o wide # Option A: recreate the cluster with more capacity k3d cluster delete devops-lab k3d cluster create devops-lab --agents 2 # Option B: lower resource requests in your deployment YAML # Option A: recreate the cluster with more capacity k3d cluster delete devops-lab k3d cluster create devops-lab --agents 2 # Option B: lower resource requests in your deployment YAML # Option A: recreate the cluster with more capacity k3d cluster delete devops-lab k3d cluster create devops-lab --agents 2 # Option B: lower resource requests in your deployment YAML resources: requests: memory: "128Mi" # reduce from whatever you had cpu: "50m" resources: requests: memory: "128Mi" # reduce from whatever you had cpu: "50m" resources: requests: memory: "128Mi" # reduce from whatever you had cpu: "50m" INCIDENT: Pods stuck in Pending - backend and frontend ROOT CAUSE: Cluster had insufficient memory to schedule pods. Resource requests exceeded available node capacity. DETECTION: -weight: 500;">kubectl describe pod showed "0/1 nodes available: Insufficient memory" in Events. -weight: 500;">kubectl top nodes confirmed node was at capacity. FIX: Reduced memory requests in deployment YAML. Pods scheduled immediately. LESSON: Kubernetes won't silently lower your resource requests. Pending forever = scheduler can't fit the pod. Describe it first. INCIDENT: Pods stuck in Pending - backend and frontend ROOT CAUSE: Cluster had insufficient memory to schedule pods. Resource requests exceeded available node capacity. DETECTION: -weight: 500;">kubectl describe pod showed "0/1 nodes available: Insufficient memory" in Events. -weight: 500;">kubectl top nodes confirmed node was at capacity. FIX: Reduced memory requests in deployment YAML. Pods scheduled immediately. LESSON: Kubernetes won't silently lower your resource requests. Pending forever = scheduler can't fit the pod. Describe it first. INCIDENT: Pods stuck in Pending - backend and frontend ROOT CAUSE: Cluster had insufficient memory to schedule pods. Resource requests exceeded available node capacity. DETECTION: -weight: 500;">kubectl describe pod showed "0/1 nodes available: Insufficient memory" in Events. -weight: 500;">kubectl top nodes confirmed node was at capacity. FIX: Reduced memory requests in deployment YAML. Pods scheduled immediately. LESSON: Kubernetes won't silently lower your resource requests. Pending forever = scheduler can't fit the pod. Describe it first. -weight: 500;">kubectl get pods -n <namespace> # NAME READY STATUS RESTARTS # backend-abc123 0/1 ImagePullBackOff 0 -weight: 500;">kubectl get pods -n <namespace> # NAME READY STATUS RESTARTS # backend-abc123 0/1 ImagePullBackOff 0 -weight: 500;">kubectl get pods -n <namespace> # NAME READY STATUS RESTARTS # backend-abc123 0/1 ImagePullBackOff 0 -weight: 500;">kubectl describe pod <pod-name> -n <namespace> # Events will say: # "Failed to pull image: rpc error... # repository does not exist or may require authentication" # The image IS here: -weight: 500;">docker images | grep my-backend # But NOT here: k3d image list devops-lab -weight: 500;">kubectl describe pod <pod-name> -n <namespace> # Events will say: # "Failed to pull image: rpc error... # repository does not exist or may require authentication" # The image IS here: -weight: 500;">docker images | grep my-backend # But NOT here: k3d image list devops-lab -weight: 500;">kubectl describe pod <pod-name> -n <namespace> # Events will say: # "Failed to pull image: rpc error... # repository does not exist or may require authentication" # The image IS here: -weight: 500;">docker images | grep my-backend # But NOT here: k3d image list devops-lab # Import your local image into the k3d cluster k3d image import my-backend:latest -c devops-lab # Also set imagePullPolicy: Never in your deployment YAML # so Kubernetes doesn't attempt to pull from Docker Hub # Import your local image into the k3d cluster k3d image import my-backend:latest -c devops-lab # Also set imagePullPolicy: Never in your deployment YAML # so Kubernetes doesn't attempt to pull from Docker Hub # Import your local image into the k3d cluster k3d image import my-backend:latest -c devops-lab # Also set imagePullPolicy: Never in your deployment YAML # so Kubernetes doesn't attempt to pull from Docker Hub spec: containers: - name: backend image: my-backend:latest imagePullPolicy: Never spec: containers: - name: backend image: my-backend:latest imagePullPolicy: Never spec: containers: - name: backend image: my-backend:latest imagePullPolicy: Never INCIDENT: ImagePullBackOff - backend deployment ROOT CAUSE: Local Docker image not imported into k3d cluster. k3d has its own image context - it can't see Docker daemon images. DETECTION: -weight: 500;">kubectl describe pod showed "repository does not exist" in Events. -weight: 500;">docker images confirmed image existed locally. k3d image list confirmed it was absent from the cluster. FIX: Ran k3d image import. Set imagePullPolicy: Never in deployment YAML. LESSON: Build → import → deploy. Every rebuild needs a re-import. Or set up a local registry to automate this. INCIDENT: ImagePullBackOff - backend deployment ROOT CAUSE: Local Docker image not imported into k3d cluster. k3d has its own image context - it can't see Docker daemon images. DETECTION: -weight: 500;">kubectl describe pod showed "repository does not exist" in Events. -weight: 500;">docker images confirmed image existed locally. k3d image list confirmed it was absent from the cluster. FIX: Ran k3d image import. Set imagePullPolicy: Never in deployment YAML. LESSON: Build → import → deploy. Every rebuild needs a re-import. Or set up a local registry to automate this. INCIDENT: ImagePullBackOff - backend deployment ROOT CAUSE: Local Docker image not imported into k3d cluster. k3d has its own image context - it can't see Docker daemon images. DETECTION: -weight: 500;">kubectl describe pod showed "repository does not exist" in Events. -weight: 500;">docker images confirmed image existed locally. k3d image list confirmed it was absent from the cluster. FIX: Ran k3d image import. Set imagePullPolicy: Never in deployment YAML. LESSON: Build → import → deploy. Every rebuild needs a re-import. Or set up a local registry to automate this. -weight: 500;">curl http://yourapp.local # <html>404 Not Found</html> -weight: 500;">curl http://yourapp.local # <html>404 Not Found</html> -weight: 500;">curl http://yourapp.local # <html>404 Not Found</html> # Check if the -weight: 500;">service actually has any endpoints -weight: 500;">kubectl get endpoints -n <namespace> # NAME ENDPOINTS AGE # frontend <none> 5m ← this is the problem # What labels are your pods actually using? -weight: 500;">kubectl get pods -n <namespace> --show-labels # What is the -weight: 500;">service selecting for? -weight: 500;">kubectl describe svc <-weight: 500;">service-name> -n <namespace> # Look at the Selector field # Check if the -weight: 500;">service actually has any endpoints -weight: 500;">kubectl get endpoints -n <namespace> # NAME ENDPOINTS AGE # frontend <none> 5m ← this is the problem # What labels are your pods actually using? -weight: 500;">kubectl get pods -n <namespace> --show-labels # What is the -weight: 500;">service selecting for? -weight: 500;">kubectl describe svc <-weight: 500;">service-name> -n <namespace> # Look at the Selector field # Check if the -weight: 500;">service actually has any endpoints -weight: 500;">kubectl get endpoints -n <namespace> # NAME ENDPOINTS AGE # frontend <none> 5m ← this is the problem # What labels are your pods actually using? -weight: 500;">kubectl get pods -n <namespace> --show-labels # What is the -weight: 500;">service selecting for? -weight: 500;">kubectl describe svc <-weight: 500;">service-name> -n <namespace> # Look at the Selector field # In your -weight: 500;">service: selector: app: my-frontend # this must match exactly # In your deployment pod template: labels: app: my-frontend # must be identical - case, spelling, everything # In your -weight: 500;">service: selector: app: my-frontend # this must match exactly # In your deployment pod template: labels: app: my-frontend # must be identical - case, spelling, everything # In your -weight: 500;">service: selector: app: my-frontend # this must match exactly # In your deployment pod template: labels: app: my-frontend # must be identical - case, spelling, everything INCIDENT: Ingress returning 404 - frontend unreachable ROOT CAUSE: Service selector label didn't match pod labels. Service had zero endpoints - never connected to any pod. DETECTION: -weight: 500;">kubectl get endpoints showed <none> for frontend -weight: 500;">service. -weight: 500;">kubectl get pods --show-labels revealed the label mismatch. FIX: Updated -weight: 500;">service selector to match actual pod labels. Endpoints populated immediately. Ingress routed correctly. LESSON: Check endpoints first, not the Ingress YAML. Zero endpoints = label selector or port mismatch, not an Ingress problem. INCIDENT: Ingress returning 404 - frontend unreachable ROOT CAUSE: Service selector label didn't match pod labels. Service had zero endpoints - never connected to any pod. DETECTION: -weight: 500;">kubectl get endpoints showed <none> for frontend -weight: 500;">service. -weight: 500;">kubectl get pods --show-labels revealed the label mismatch. FIX: Updated -weight: 500;">service selector to match actual pod labels. Endpoints populated immediately. Ingress routed correctly. LESSON: Check endpoints first, not the Ingress YAML. Zero endpoints = label selector or port mismatch, not an Ingress problem. INCIDENT: Ingress returning 404 - frontend unreachable ROOT CAUSE: Service selector label didn't match pod labels. Service had zero endpoints - never connected to any pod. DETECTION: -weight: 500;">kubectl get endpoints showed <none> for frontend -weight: 500;">service. -weight: 500;">kubectl get pods --show-labels revealed the label mismatch. FIX: Updated -weight: 500;">service selector to match actual pod labels. Endpoints populated immediately. Ingress routed correctly. LESSON: Check endpoints first, not the Ingress YAML. Zero endpoints = label selector or port mismatch, not an Ingress problem. # Step 1: Is Prometheus even trying to scrape your app? -weight: 500;">kubectl port-forward svc/prometheus -n monitoring 9090:9090 # Open: http://localhost:9090/targets # Check the Status column - is your app listed? Is it UP or DOWN? # Step 2: Does your app actually expose metrics? -weight: 500;">kubectl port-forward svc/<your-backend> -n <namespace> 3001:3001 -weight: 500;">curl http://localhost:3001/metrics # Should return Prometheus text format. If 404 - the endpoint isn't wired up. # Step 3: Does the ServiceMonitor exist? -weight: 500;">kubectl get servicemonitor -n monitoring # Step 1: Is Prometheus even trying to scrape your app? -weight: 500;">kubectl port-forward svc/prometheus -n monitoring 9090:9090 # Open: http://localhost:9090/targets # Check the Status column - is your app listed? Is it UP or DOWN? # Step 2: Does your app actually expose metrics? -weight: 500;">kubectl port-forward svc/<your-backend> -n <namespace> 3001:3001 -weight: 500;">curl http://localhost:3001/metrics # Should return Prometheus text format. If 404 - the endpoint isn't wired up. # Step 3: Does the ServiceMonitor exist? -weight: 500;">kubectl get servicemonitor -n monitoring # Step 1: Is Prometheus even trying to scrape your app? -weight: 500;">kubectl port-forward svc/prometheus -n monitoring 9090:9090 # Open: http://localhost:9090/targets # Check the Status column - is your app listed? Is it UP or DOWN? # Step 2: Does your app actually expose metrics? -weight: 500;">kubectl port-forward svc/<your-backend> -n <namespace> 3001:3001 -weight: 500;">curl http://localhost:3001/metrics # Should return Prometheus text format. If 404 - the endpoint isn't wired up. # Step 3: Does the ServiceMonitor exist? -weight: 500;">kubectl get servicemonitor -n monitoring // Fix A: expose a /metrics endpoint in your backend (Node.js example) const client = require('prom-client'); client.collectDefaultMetrics(); app.get('/metrics', async (req, res) => { res.set('Content-Type', client.register.contentType); res.end(await client.register.metrics()); }); // Fix A: expose a /metrics endpoint in your backend (Node.js example) const client = require('prom-client'); client.collectDefaultMetrics(); app.get('/metrics', async (req, res) => { res.set('Content-Type', client.register.contentType); res.end(await client.register.metrics()); }); // Fix A: expose a /metrics endpoint in your backend (Node.js example) const client = require('prom-client'); client.collectDefaultMetrics(); app.get('/metrics', async (req, res) => { res.set('Content-Type', client.register.contentType); res.end(await client.register.metrics()); }); # Fix B: ensure ServiceMonitor label selector matches your app spec: selector: matchLabels: app: my-backend # must match your pod labels namespaceSelector: matchNames: [your-namespace] # Fix B: ensure ServiceMonitor label selector matches your app spec: selector: matchLabels: app: my-backend # must match your pod labels namespaceSelector: matchNames: [your-namespace] # Fix B: ensure ServiceMonitor label selector matches your app spec: selector: matchLabels: app: my-backend # must match your pod labels namespaceSelector: matchNames: [your-namespace] INCIDENT: Grafana showing "No data" - all panels blank ROOT CAUSE: Backend had no /metrics endpoint. Prometheus had nothing to scrape. DETECTION: Prometheus targets page showed app as absent. -weight: 500;">curl to /metrics returned 404 - endpoint was never implemented. FIX: Added prom-client middleware to Express app. Exposed /metrics route. Prometheus began scraping within one scrape interval. Grafana populated. LESSON: Grafana "No data" is not a Grafana problem. Walk the chain backwards. Prometheus targets page tells you exactly where the chain breaks. INCIDENT: Grafana showing "No data" - all panels blank ROOT CAUSE: Backend had no /metrics endpoint. Prometheus had nothing to scrape. DETECTION: Prometheus targets page showed app as absent. -weight: 500;">curl to /metrics returned 404 - endpoint was never implemented. FIX: Added prom-client middleware to Express app. Exposed /metrics route. Prometheus began scraping within one scrape interval. Grafana populated. LESSON: Grafana "No data" is not a Grafana problem. Walk the chain backwards. Prometheus targets page tells you exactly where the chain breaks. INCIDENT: Grafana showing "No data" - all panels blank ROOT CAUSE: Backend had no /metrics endpoint. Prometheus had nothing to scrape. DETECTION: Prometheus targets page showed app as absent. -weight: 500;">curl to /metrics returned 404 - endpoint was never implemented. FIX: Added prom-client middleware to Express app. Exposed /metrics route. Prometheus began scraping within one scrape interval. Grafana populated. LESSON: Grafana "No data" is not a Grafana problem. Walk the chain backwards. Prometheus targets page tells you exactly where the chain breaks. # Is ssl-redirect enabled on the Ingress? -weight: 500;">kubectl get ingress -n <namespace> -o yaml | grep ssl # Is ArgoCD running with HTTPS enforcement? -weight: 500;">kubectl get deployment argocd-server -n argocd -o yaml | grep insecure # Watch the redirect chain -weight: 500;">curl -I http://yourapp.domain.com # You'll see 301 → 301 → 301 repeating # Is ssl-redirect enabled on the Ingress? -weight: 500;">kubectl get ingress -n <namespace> -o yaml | grep ssl # Is ArgoCD running with HTTPS enforcement? -weight: 500;">kubectl get deployment argocd-server -n argocd -o yaml | grep insecure # Watch the redirect chain -weight: 500;">curl -I http://yourapp.domain.com # You'll see 301 → 301 → 301 repeating # Is ssl-redirect enabled on the Ingress? -weight: 500;">kubectl get ingress -n <namespace> -o yaml | grep ssl # Is ArgoCD running with HTTPS enforcement? -weight: 500;">kubectl get deployment argocd-server -n argocd -o yaml | grep insecure # Watch the redirect chain -weight: 500;">curl -I http://yourapp.domain.com # You'll see 301 → 301 → 301 repeating # Fix A: -weight: 500;">disable ssl-redirect on the Ingress annotations: nginx.ingress.kubernetes.io/ssl-redirect: "false" nginx.ingress.kubernetes.io/force-ssl-redirect: "false" # Fix A: -weight: 500;">disable ssl-redirect on the Ingress annotations: nginx.ingress.kubernetes.io/ssl-redirect: "false" nginx.ingress.kubernetes.io/force-ssl-redirect: "false" # Fix A: -weight: 500;">disable ssl-redirect on the Ingress annotations: nginx.ingress.kubernetes.io/ssl-redirect: "false" nginx.ingress.kubernetes.io/force-ssl-redirect: "false" # Fix B: run ArgoCD in insecure mode (behind Cloudflare, it's fine) # In argocd-server deployment args: args: ["--insecure"] # Fix B: run ArgoCD in insecure mode (behind Cloudflare, it's fine) # In argocd-server deployment args: args: ["--insecure"] # Fix B: run ArgoCD in insecure mode (behind Cloudflare, it's fine) # In argocd-server deployment args: args: ["--insecure"] INCIDENT: ERR_TOO_MANY_REDIRECTS - app unreachable after adding Cloudflare Tunnel ROOT CAUSE: Cloudflare terminates HTTPS at edge, forwards HTTP to cluster. NGINX Ingress had ssl-redirect enabled - redirected HTTP back to HTTPS. Cloudflare re-sent HTTPS, NGINX redirected again. Infinite loop. DETECTION: -weight: 500;">curl -I showed 301 → 301 chain. -weight: 500;">kubectl get ingress yaml showed ssl-redirect: true. Cloudflare logs showed HTTP being forwarded. FIX: Set ssl-redirect: false on Ingress. Cloudflare owns TLS. Cluster runs HTTP internally. LESSON: Two systems enforcing HTTPS termination = redirect war. Decide one layer owns TLS. Disable it everywhere else. INCIDENT: ERR_TOO_MANY_REDIRECTS - app unreachable after adding Cloudflare Tunnel ROOT CAUSE: Cloudflare terminates HTTPS at edge, forwards HTTP to cluster. NGINX Ingress had ssl-redirect enabled - redirected HTTP back to HTTPS. Cloudflare re-sent HTTPS, NGINX redirected again. Infinite loop. DETECTION: -weight: 500;">curl -I showed 301 → 301 chain. -weight: 500;">kubectl get ingress yaml showed ssl-redirect: true. Cloudflare logs showed HTTP being forwarded. FIX: Set ssl-redirect: false on Ingress. Cloudflare owns TLS. Cluster runs HTTP internally. LESSON: Two systems enforcing HTTPS termination = redirect war. Decide one layer owns TLS. Disable it everywhere else. INCIDENT: ERR_TOO_MANY_REDIRECTS - app unreachable after adding Cloudflare Tunnel ROOT CAUSE: Cloudflare terminates HTTPS at edge, forwards HTTP to cluster. NGINX Ingress had ssl-redirect enabled - redirected HTTP back to HTTPS. Cloudflare re-sent HTTPS, NGINX redirected again. Infinite loop. DETECTION: -weight: 500;">curl -I showed 301 → 301 chain. -weight: 500;">kubectl get ingress yaml showed ssl-redirect: true. Cloudflare logs showed HTTP being forwarded. FIX: Set ssl-redirect: false on Ingress. Cloudflare owns TLS. Cluster runs HTTP internally. LESSON: Two systems enforcing HTTPS termination = redirect war. Decide one layer owns TLS. Disable it everywhere else. - Container keeps crashing, CrashLoopBackOff, OOMKilled, exit codes that tell you exactly what happened - Your CI/CD pipeline reports success, but the app is down - Terraform says "no changes," but your infra is out of sync - Two services can't talk to each other, and you don't know why - A Prometheus alert fires, and you need to trace it to a root cause - kind lightest, runs on any OS - MicroK8scloser to production behavior, best on Ubuntu - K3s good inside VMs - Spot it: something isn't working, let's say a pod won't -weight: 500;">start or a deploy failed, or two services can't reach each other. This is the beginning of a lesson. - Read it: before you Google anything, read what the system is telling you. Logs, events, metrics. The answer is almost always there. Your job is to learn how to see it. - Fix it: solve it from what you observed. - Document it: write the incident report immediately, while it's fresh. - Backend crashes on startup -weight: 500;">docker logs → Look for ECONNREFUSED A dependency isn't ready yet - Pod stuck in Pending -weight: 500;">kubectl describe pod → Look for Events: "Insufficient memory" or "Insufficient CPU." - ImagePullBackOff -weight: 500;">kubectl describe pod → Look for Events: "repository does not exist" image not imported - Ingress returning 404 -weight: 500;">kubectl get endpoints -n → = label selector mismatch - Grafana showing "No data." -weight: 500;">curl http://localhost:/metrics → 404 = app never exposed /metrics endpoint - ERR_TOO_MANY_REDIRECTS -weight: 500;">kubectl get ingress -o yaml | grep ssl → ssl-redirect: true + Cloudflare = redirect loop.