# kind - fastest -weight: 500;">start
-weight: 500;">curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64
chmod +x ./kind && -weight: 600;">sudo mv ./kind /usr/local/bin/kind
kind create cluster --name devops-lab -weight: 500;">kubectl get nodes
# NAME STATUS ROLES AGE
# devops-lab-control-plane Ready control-plane 30s
# kind - fastest -weight: 500;">start
-weight: 500;">curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64
chmod +x ./kind && -weight: 600;">sudo mv ./kind /usr/local/bin/kind
kind create cluster --name devops-lab -weight: 500;">kubectl get nodes
# NAME STATUS ROLES AGE
# devops-lab-control-plane Ready control-plane 30s
# kind - fastest -weight: 500;">start
-weight: 500;">curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64
chmod +x ./kind && -weight: 600;">sudo mv ./kind /usr/local/bin/kind
kind create cluster --name devops-lab -weight: 500;">kubectl get nodes
# NAME STATUS ROLES AGE
# devops-lab-control-plane Ready control-plane 30s
-weight: 500;">kubectl get pods -A
# Your app's pods should show STATUS: Running
-weight: 500;">kubectl get pods -A
# Your app's pods should show STATUS: Running
-weight: 500;">kubectl get pods -A
# Your app's pods should show STATUS: Running
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm -weight: 500;">install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace # Access Grafana
-weight: 500;">kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Login: admin / prom-operator
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm -weight: 500;">install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace # Access Grafana
-weight: 500;">kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Login: admin / prom-operator
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm -weight: 500;">install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace # Access Grafana
-weight: 500;">kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Login: admin / prom-operator
INCIDENT: [error type] - [-weight: 500;">service name]
ROOT CAUSE: [one sentence - what actually caused it]
DETECTION: [which command or metric showed you the problem]
FIX: [exactly what you did to resolve it]
LESSON: [what you now know that you didn't before]
INCIDENT: [error type] - [-weight: 500;">service name]
ROOT CAUSE: [one sentence - what actually caused it]
DETECTION: [which command or metric showed you the problem]
FIX: [exactly what you did to resolve it]
LESSON: [what you now know that you didn't before]
INCIDENT: [error type] - [-weight: 500;">service name]
ROOT CAUSE: [one sentence - what actually caused it]
DETECTION: [which command or metric showed you the problem]
FIX: [exactly what you did to resolve it]
LESSON: [what you now know that you didn't before]
Error: connect ECONNREFUSED 127.0.0.1:5432
Error: connect ECONNREFUSED 127.0.0.1:5432
Error: connect ECONNREFUSED 127.0.0.1:5432
# What did the backend actually see?
-weight: 500;">docker logs <your-backend-container> # Did Postgres finish starting?
-weight: 500;">docker logs postgres # Look for: "database system is ready to accept connections"
# Are both containers up?
-weight: 500;">docker ps
# What did the backend actually see?
-weight: 500;">docker logs <your-backend-container> # Did Postgres finish starting?
-weight: 500;">docker logs postgres # Look for: "database system is ready to accept connections"
# Are both containers up?
-weight: 500;">docker ps
# What did the backend actually see?
-weight: 500;">docker logs <your-backend-container> # Did Postgres finish starting?
-weight: 500;">docker logs postgres # Look for: "database system is ready to accept connections"
# Are both containers up?
-weight: 500;">docker ps
# Quick fix: -weight: 500;">restart the backend after Postgres is ready
-weight: 500;">docker -weight: 500;">restart <your-backend-container> # Permanent fix: add a healthcheck-based dependency in -weight: 500;">docker-compose.yml
# Quick fix: -weight: 500;">restart the backend after Postgres is ready
-weight: 500;">docker -weight: 500;">restart <your-backend-container> # Permanent fix: add a healthcheck-based dependency in -weight: 500;">docker-compose.yml
# Quick fix: -weight: 500;">restart the backend after Postgres is ready
-weight: 500;">docker -weight: 500;">restart <your-backend-container> # Permanent fix: add a healthcheck-based dependency in -weight: 500;">docker-compose.yml
depends_on: postgres: condition: service_healthy
depends_on: postgres: condition: service_healthy
depends_on: postgres: condition: service_healthy
INCIDENT: Backend container crashed on startup - ECONNREFUSED
ROOT CAUSE: Backend started before Postgres finished initializing. depends_on controls -weight: 500;">start order, not -weight: 500;">service readiness.
DETECTION: -weight: 500;">docker logs backend showed ECONNREFUSED to port 5432. -weight: 500;">docker logs postgres confirmed it hadn't finished booting yet.
FIX: Added healthcheck to postgres -weight: 500;">service. Set depends_on condition to service_healthy. Backend now waits until Postgres is ready.
LESSON: Always use healthchecks for stateful services. This same pattern applies in Kubernetes via readiness probes.
INCIDENT: Backend container crashed on startup - ECONNREFUSED
ROOT CAUSE: Backend started before Postgres finished initializing. depends_on controls -weight: 500;">start order, not -weight: 500;">service readiness.
DETECTION: -weight: 500;">docker logs backend showed ECONNREFUSED to port 5432. -weight: 500;">docker logs postgres confirmed it hadn't finished booting yet.
FIX: Added healthcheck to postgres -weight: 500;">service. Set depends_on condition to service_healthy. Backend now waits until Postgres is ready.
LESSON: Always use healthchecks for stateful services. This same pattern applies in Kubernetes via readiness probes.
INCIDENT: Backend container crashed on startup - ECONNREFUSED
ROOT CAUSE: Backend started before Postgres finished initializing. depends_on controls -weight: 500;">start order, not -weight: 500;">service readiness.
DETECTION: -weight: 500;">docker logs backend showed ECONNREFUSED to port 5432. -weight: 500;">docker logs postgres confirmed it hadn't finished booting yet.
FIX: Added healthcheck to postgres -weight: 500;">service. Set depends_on condition to service_healthy. Backend now waits until Postgres is ready.
LESSON: Always use healthchecks for stateful services. This same pattern applies in Kubernetes via readiness probes.
-weight: 500;">kubectl get pods -A
# NAME READY STATUS RESTARTS
# backend-7d9f-xk4m 0/1 Pending 0
-weight: 500;">kubectl get pods -A
# NAME READY STATUS RESTARTS
# backend-7d9f-xk4m 0/1 Pending 0
-weight: 500;">kubectl get pods -A
# NAME READY STATUS RESTARTS
# backend-7d9f-xk4m 0/1 Pending 0
-weight: 500;">kubectl describe pod <pod-name> -n <namespace>
# Read the Events section at the bottom:
# "0/1 nodes are available: 1 Insufficient memory." # What's the cluster actually using right now?
-weight: 500;">kubectl top nodes # Do nodes exist at all?
-weight: 500;">kubectl get nodes -o wide
-weight: 500;">kubectl describe pod <pod-name> -n <namespace>
# Read the Events section at the bottom:
# "0/1 nodes are available: 1 Insufficient memory." # What's the cluster actually using right now?
-weight: 500;">kubectl top nodes # Do nodes exist at all?
-weight: 500;">kubectl get nodes -o wide
-weight: 500;">kubectl describe pod <pod-name> -n <namespace>
# Read the Events section at the bottom:
# "0/1 nodes are available: 1 Insufficient memory." # What's the cluster actually using right now?
-weight: 500;">kubectl top nodes # Do nodes exist at all?
-weight: 500;">kubectl get nodes -o wide
# Option A: recreate the cluster with more capacity
k3d cluster delete devops-lab
k3d cluster create devops-lab --agents 2 # Option B: lower resource requests in your deployment YAML
# Option A: recreate the cluster with more capacity
k3d cluster delete devops-lab
k3d cluster create devops-lab --agents 2 # Option B: lower resource requests in your deployment YAML
# Option A: recreate the cluster with more capacity
k3d cluster delete devops-lab
k3d cluster create devops-lab --agents 2 # Option B: lower resource requests in your deployment YAML
resources: requests: memory: "128Mi" # reduce from whatever you had cpu: "50m"
resources: requests: memory: "128Mi" # reduce from whatever you had cpu: "50m"
resources: requests: memory: "128Mi" # reduce from whatever you had cpu: "50m"
INCIDENT: Pods stuck in Pending - backend and frontend
ROOT CAUSE: Cluster had insufficient memory to schedule pods. Resource requests exceeded available node capacity.
DETECTION: -weight: 500;">kubectl describe pod showed "0/1 nodes available: Insufficient memory" in Events. -weight: 500;">kubectl top nodes confirmed node was at capacity.
FIX: Reduced memory requests in deployment YAML. Pods scheduled immediately.
LESSON: Kubernetes won't silently lower your resource requests. Pending forever = scheduler can't fit the pod. Describe it first.
INCIDENT: Pods stuck in Pending - backend and frontend
ROOT CAUSE: Cluster had insufficient memory to schedule pods. Resource requests exceeded available node capacity.
DETECTION: -weight: 500;">kubectl describe pod showed "0/1 nodes available: Insufficient memory" in Events. -weight: 500;">kubectl top nodes confirmed node was at capacity.
FIX: Reduced memory requests in deployment YAML. Pods scheduled immediately.
LESSON: Kubernetes won't silently lower your resource requests. Pending forever = scheduler can't fit the pod. Describe it first.
INCIDENT: Pods stuck in Pending - backend and frontend
ROOT CAUSE: Cluster had insufficient memory to schedule pods. Resource requests exceeded available node capacity.
DETECTION: -weight: 500;">kubectl describe pod showed "0/1 nodes available: Insufficient memory" in Events. -weight: 500;">kubectl top nodes confirmed node was at capacity.
FIX: Reduced memory requests in deployment YAML. Pods scheduled immediately.
LESSON: Kubernetes won't silently lower your resource requests. Pending forever = scheduler can't fit the pod. Describe it first.
-weight: 500;">kubectl get pods -n <namespace>
# NAME READY STATUS RESTARTS
# backend-abc123 0/1 ImagePullBackOff 0
-weight: 500;">kubectl get pods -n <namespace>
# NAME READY STATUS RESTARTS
# backend-abc123 0/1 ImagePullBackOff 0
-weight: 500;">kubectl get pods -n <namespace>
# NAME READY STATUS RESTARTS
# backend-abc123 0/1 ImagePullBackOff 0
-weight: 500;">kubectl describe pod <pod-name> -n <namespace>
# Events will say:
# "Failed to pull image: rpc error...
# repository does not exist or may require authentication" # The image IS here:
-weight: 500;">docker images | grep my-backend
# But NOT here:
k3d image list devops-lab
-weight: 500;">kubectl describe pod <pod-name> -n <namespace>
# Events will say:
# "Failed to pull image: rpc error...
# repository does not exist or may require authentication" # The image IS here:
-weight: 500;">docker images | grep my-backend
# But NOT here:
k3d image list devops-lab
-weight: 500;">kubectl describe pod <pod-name> -n <namespace>
# Events will say:
# "Failed to pull image: rpc error...
# repository does not exist or may require authentication" # The image IS here:
-weight: 500;">docker images | grep my-backend
# But NOT here:
k3d image list devops-lab
# Import your local image into the k3d cluster
k3d image import my-backend:latest -c devops-lab # Also set imagePullPolicy: Never in your deployment YAML
# so Kubernetes doesn't attempt to pull from Docker Hub
# Import your local image into the k3d cluster
k3d image import my-backend:latest -c devops-lab # Also set imagePullPolicy: Never in your deployment YAML
# so Kubernetes doesn't attempt to pull from Docker Hub
# Import your local image into the k3d cluster
k3d image import my-backend:latest -c devops-lab # Also set imagePullPolicy: Never in your deployment YAML
# so Kubernetes doesn't attempt to pull from Docker Hub
spec: containers: - name: backend image: my-backend:latest imagePullPolicy: Never
spec: containers: - name: backend image: my-backend:latest imagePullPolicy: Never
spec: containers: - name: backend image: my-backend:latest imagePullPolicy: Never
INCIDENT: ImagePullBackOff - backend deployment
ROOT CAUSE: Local Docker image not imported into k3d cluster. k3d has its own image context - it can't see Docker daemon images.
DETECTION: -weight: 500;">kubectl describe pod showed "repository does not exist" in Events. -weight: 500;">docker images confirmed image existed locally. k3d image list confirmed it was absent from the cluster.
FIX: Ran k3d image import. Set imagePullPolicy: Never in deployment YAML.
LESSON: Build → import → deploy. Every rebuild needs a re-import. Or set up a local registry to automate this.
INCIDENT: ImagePullBackOff - backend deployment
ROOT CAUSE: Local Docker image not imported into k3d cluster. k3d has its own image context - it can't see Docker daemon images.
DETECTION: -weight: 500;">kubectl describe pod showed "repository does not exist" in Events. -weight: 500;">docker images confirmed image existed locally. k3d image list confirmed it was absent from the cluster.
FIX: Ran k3d image import. Set imagePullPolicy: Never in deployment YAML.
LESSON: Build → import → deploy. Every rebuild needs a re-import. Or set up a local registry to automate this.
INCIDENT: ImagePullBackOff - backend deployment
ROOT CAUSE: Local Docker image not imported into k3d cluster. k3d has its own image context - it can't see Docker daemon images.
DETECTION: -weight: 500;">kubectl describe pod showed "repository does not exist" in Events. -weight: 500;">docker images confirmed image existed locally. k3d image list confirmed it was absent from the cluster.
FIX: Ran k3d image import. Set imagePullPolicy: Never in deployment YAML.
LESSON: Build → import → deploy. Every rebuild needs a re-import. Or set up a local registry to automate this.
-weight: 500;">curl http://yourapp.local
# <html>404 Not Found</html>
-weight: 500;">curl http://yourapp.local
# <html>404 Not Found</html>
-weight: 500;">curl http://yourapp.local
# <html>404 Not Found</html>
# Check if the -weight: 500;">service actually has any endpoints
-weight: 500;">kubectl get endpoints -n <namespace>
# NAME ENDPOINTS AGE
# frontend <none> 5m ← this is the problem # What labels are your pods actually using?
-weight: 500;">kubectl get pods -n <namespace> --show-labels # What is the -weight: 500;">service selecting for?
-weight: 500;">kubectl describe svc <-weight: 500;">service-name> -n <namespace>
# Look at the Selector field
# Check if the -weight: 500;">service actually has any endpoints
-weight: 500;">kubectl get endpoints -n <namespace>
# NAME ENDPOINTS AGE
# frontend <none> 5m ← this is the problem # What labels are your pods actually using?
-weight: 500;">kubectl get pods -n <namespace> --show-labels # What is the -weight: 500;">service selecting for?
-weight: 500;">kubectl describe svc <-weight: 500;">service-name> -n <namespace>
# Look at the Selector field
# Check if the -weight: 500;">service actually has any endpoints
-weight: 500;">kubectl get endpoints -n <namespace>
# NAME ENDPOINTS AGE
# frontend <none> 5m ← this is the problem # What labels are your pods actually using?
-weight: 500;">kubectl get pods -n <namespace> --show-labels # What is the -weight: 500;">service selecting for?
-weight: 500;">kubectl describe svc <-weight: 500;">service-name> -n <namespace>
# Look at the Selector field
# In your -weight: 500;">service:
selector: app: my-frontend # this must match exactly # In your deployment pod template:
labels: app: my-frontend # must be identical - case, spelling, everything
# In your -weight: 500;">service:
selector: app: my-frontend # this must match exactly # In your deployment pod template:
labels: app: my-frontend # must be identical - case, spelling, everything
# In your -weight: 500;">service:
selector: app: my-frontend # this must match exactly # In your deployment pod template:
labels: app: my-frontend # must be identical - case, spelling, everything
INCIDENT: Ingress returning 404 - frontend unreachable
ROOT CAUSE: Service selector label didn't match pod labels. Service had zero endpoints - never connected to any pod.
DETECTION: -weight: 500;">kubectl get endpoints showed <none> for frontend -weight: 500;">service. -weight: 500;">kubectl get pods --show-labels revealed the label mismatch.
FIX: Updated -weight: 500;">service selector to match actual pod labels. Endpoints populated immediately. Ingress routed correctly.
LESSON: Check endpoints first, not the Ingress YAML. Zero endpoints = label selector or port mismatch, not an Ingress problem.
INCIDENT: Ingress returning 404 - frontend unreachable
ROOT CAUSE: Service selector label didn't match pod labels. Service had zero endpoints - never connected to any pod.
DETECTION: -weight: 500;">kubectl get endpoints showed <none> for frontend -weight: 500;">service. -weight: 500;">kubectl get pods --show-labels revealed the label mismatch.
FIX: Updated -weight: 500;">service selector to match actual pod labels. Endpoints populated immediately. Ingress routed correctly.
LESSON: Check endpoints first, not the Ingress YAML. Zero endpoints = label selector or port mismatch, not an Ingress problem.
INCIDENT: Ingress returning 404 - frontend unreachable
ROOT CAUSE: Service selector label didn't match pod labels. Service had zero endpoints - never connected to any pod.
DETECTION: -weight: 500;">kubectl get endpoints showed <none> for frontend -weight: 500;">service. -weight: 500;">kubectl get pods --show-labels revealed the label mismatch.
FIX: Updated -weight: 500;">service selector to match actual pod labels. Endpoints populated immediately. Ingress routed correctly.
LESSON: Check endpoints first, not the Ingress YAML. Zero endpoints = label selector or port mismatch, not an Ingress problem.
# Step 1: Is Prometheus even trying to scrape your app?
-weight: 500;">kubectl port-forward svc/prometheus -n monitoring 9090:9090
# Open: http://localhost:9090/targets
# Check the Status column - is your app listed? Is it UP or DOWN? # Step 2: Does your app actually expose metrics?
-weight: 500;">kubectl port-forward svc/<your-backend> -n <namespace> 3001:3001
-weight: 500;">curl http://localhost:3001/metrics
# Should return Prometheus text format. If 404 - the endpoint isn't wired up. # Step 3: Does the ServiceMonitor exist?
-weight: 500;">kubectl get servicemonitor -n monitoring
# Step 1: Is Prometheus even trying to scrape your app?
-weight: 500;">kubectl port-forward svc/prometheus -n monitoring 9090:9090
# Open: http://localhost:9090/targets
# Check the Status column - is your app listed? Is it UP or DOWN? # Step 2: Does your app actually expose metrics?
-weight: 500;">kubectl port-forward svc/<your-backend> -n <namespace> 3001:3001
-weight: 500;">curl http://localhost:3001/metrics
# Should return Prometheus text format. If 404 - the endpoint isn't wired up. # Step 3: Does the ServiceMonitor exist?
-weight: 500;">kubectl get servicemonitor -n monitoring
# Step 1: Is Prometheus even trying to scrape your app?
-weight: 500;">kubectl port-forward svc/prometheus -n monitoring 9090:9090
# Open: http://localhost:9090/targets
# Check the Status column - is your app listed? Is it UP or DOWN? # Step 2: Does your app actually expose metrics?
-weight: 500;">kubectl port-forward svc/<your-backend> -n <namespace> 3001:3001
-weight: 500;">curl http://localhost:3001/metrics
# Should return Prometheus text format. If 404 - the endpoint isn't wired up. # Step 3: Does the ServiceMonitor exist?
-weight: 500;">kubectl get servicemonitor -n monitoring
// Fix A: expose a /metrics endpoint in your backend (Node.js example)
const client = require('prom-client');
client.collectDefaultMetrics(); app.get('/metrics', async (req, res) => { res.set('Content-Type', client.register.contentType); res.end(await client.register.metrics());
});
// Fix A: expose a /metrics endpoint in your backend (Node.js example)
const client = require('prom-client');
client.collectDefaultMetrics(); app.get('/metrics', async (req, res) => { res.set('Content-Type', client.register.contentType); res.end(await client.register.metrics());
});
// Fix A: expose a /metrics endpoint in your backend (Node.js example)
const client = require('prom-client');
client.collectDefaultMetrics(); app.get('/metrics', async (req, res) => { res.set('Content-Type', client.register.contentType); res.end(await client.register.metrics());
});
# Fix B: ensure ServiceMonitor label selector matches your app
spec: selector: matchLabels: app: my-backend # must match your pod labels namespaceSelector: matchNames: [your-namespace]
# Fix B: ensure ServiceMonitor label selector matches your app
spec: selector: matchLabels: app: my-backend # must match your pod labels namespaceSelector: matchNames: [your-namespace]
# Fix B: ensure ServiceMonitor label selector matches your app
spec: selector: matchLabels: app: my-backend # must match your pod labels namespaceSelector: matchNames: [your-namespace]
INCIDENT: Grafana showing "No data" - all panels blank
ROOT CAUSE: Backend had no /metrics endpoint. Prometheus had nothing to scrape.
DETECTION: Prometheus targets page showed app as absent. -weight: 500;">curl to /metrics returned 404 - endpoint was never implemented.
FIX: Added prom-client middleware to Express app. Exposed /metrics route. Prometheus began scraping within one scrape interval. Grafana populated.
LESSON: Grafana "No data" is not a Grafana problem. Walk the chain backwards. Prometheus targets page tells you exactly where the chain breaks.
INCIDENT: Grafana showing "No data" - all panels blank
ROOT CAUSE: Backend had no /metrics endpoint. Prometheus had nothing to scrape.
DETECTION: Prometheus targets page showed app as absent. -weight: 500;">curl to /metrics returned 404 - endpoint was never implemented.
FIX: Added prom-client middleware to Express app. Exposed /metrics route. Prometheus began scraping within one scrape interval. Grafana populated.
LESSON: Grafana "No data" is not a Grafana problem. Walk the chain backwards. Prometheus targets page tells you exactly where the chain breaks.
INCIDENT: Grafana showing "No data" - all panels blank
ROOT CAUSE: Backend had no /metrics endpoint. Prometheus had nothing to scrape.
DETECTION: Prometheus targets page showed app as absent. -weight: 500;">curl to /metrics returned 404 - endpoint was never implemented.
FIX: Added prom-client middleware to Express app. Exposed /metrics route. Prometheus began scraping within one scrape interval. Grafana populated.
LESSON: Grafana "No data" is not a Grafana problem. Walk the chain backwards. Prometheus targets page tells you exactly where the chain breaks.
# Is ssl-redirect enabled on the Ingress?
-weight: 500;">kubectl get ingress -n <namespace> -o yaml | grep ssl # Is ArgoCD running with HTTPS enforcement?
-weight: 500;">kubectl get deployment argocd-server -n argocd -o yaml | grep insecure # Watch the redirect chain
-weight: 500;">curl -I http://yourapp.domain.com
# You'll see 301 → 301 → 301 repeating
# Is ssl-redirect enabled on the Ingress?
-weight: 500;">kubectl get ingress -n <namespace> -o yaml | grep ssl # Is ArgoCD running with HTTPS enforcement?
-weight: 500;">kubectl get deployment argocd-server -n argocd -o yaml | grep insecure # Watch the redirect chain
-weight: 500;">curl -I http://yourapp.domain.com
# You'll see 301 → 301 → 301 repeating
# Is ssl-redirect enabled on the Ingress?
-weight: 500;">kubectl get ingress -n <namespace> -o yaml | grep ssl # Is ArgoCD running with HTTPS enforcement?
-weight: 500;">kubectl get deployment argocd-server -n argocd -o yaml | grep insecure # Watch the redirect chain
-weight: 500;">curl -I http://yourapp.domain.com
# You'll see 301 → 301 → 301 repeating
# Fix A: -weight: 500;">disable ssl-redirect on the Ingress
annotations: nginx.ingress.kubernetes.io/ssl-redirect: "false" nginx.ingress.kubernetes.io/force-ssl-redirect: "false"
# Fix A: -weight: 500;">disable ssl-redirect on the Ingress
annotations: nginx.ingress.kubernetes.io/ssl-redirect: "false" nginx.ingress.kubernetes.io/force-ssl-redirect: "false"
# Fix A: -weight: 500;">disable ssl-redirect on the Ingress
annotations: nginx.ingress.kubernetes.io/ssl-redirect: "false" nginx.ingress.kubernetes.io/force-ssl-redirect: "false"
# Fix B: run ArgoCD in insecure mode (behind Cloudflare, it's fine)
# In argocd-server deployment args:
args: ["--insecure"]
# Fix B: run ArgoCD in insecure mode (behind Cloudflare, it's fine)
# In argocd-server deployment args:
args: ["--insecure"]
# Fix B: run ArgoCD in insecure mode (behind Cloudflare, it's fine)
# In argocd-server deployment args:
args: ["--insecure"]
INCIDENT: ERR_TOO_MANY_REDIRECTS - app unreachable after adding Cloudflare Tunnel
ROOT CAUSE: Cloudflare terminates HTTPS at edge, forwards HTTP to cluster. NGINX Ingress had ssl-redirect enabled - redirected HTTP back to HTTPS. Cloudflare re-sent HTTPS, NGINX redirected again. Infinite loop.
DETECTION: -weight: 500;">curl -I showed 301 → 301 chain. -weight: 500;">kubectl get ingress yaml showed ssl-redirect: true. Cloudflare logs showed HTTP being forwarded.
FIX: Set ssl-redirect: false on Ingress. Cloudflare owns TLS. Cluster runs HTTP internally.
LESSON: Two systems enforcing HTTPS termination = redirect war. Decide one layer owns TLS. Disable it everywhere else.
INCIDENT: ERR_TOO_MANY_REDIRECTS - app unreachable after adding Cloudflare Tunnel
ROOT CAUSE: Cloudflare terminates HTTPS at edge, forwards HTTP to cluster. NGINX Ingress had ssl-redirect enabled - redirected HTTP back to HTTPS. Cloudflare re-sent HTTPS, NGINX redirected again. Infinite loop.
DETECTION: -weight: 500;">curl -I showed 301 → 301 chain. -weight: 500;">kubectl get ingress yaml showed ssl-redirect: true. Cloudflare logs showed HTTP being forwarded.
FIX: Set ssl-redirect: false on Ingress. Cloudflare owns TLS. Cluster runs HTTP internally.
LESSON: Two systems enforcing HTTPS termination = redirect war. Decide one layer owns TLS. Disable it everywhere else.
INCIDENT: ERR_TOO_MANY_REDIRECTS - app unreachable after adding Cloudflare Tunnel
ROOT CAUSE: Cloudflare terminates HTTPS at edge, forwards HTTP to cluster. NGINX Ingress had ssl-redirect enabled - redirected HTTP back to HTTPS. Cloudflare re-sent HTTPS, NGINX redirected again. Infinite loop.
DETECTION: -weight: 500;">curl -I showed 301 → 301 chain. -weight: 500;">kubectl get ingress yaml showed ssl-redirect: true. Cloudflare logs showed HTTP being forwarded.
FIX: Set ssl-redirect: false on Ingress. Cloudflare owns TLS. Cluster runs HTTP internally.
LESSON: Two systems enforcing HTTPS termination = redirect war. Decide one layer owns TLS. Disable it everywhere else. - Container keeps crashing, CrashLoopBackOff, OOMKilled, exit codes that tell you exactly what happened
- Your CI/CD pipeline reports success, but the app is down
- Terraform says "no changes," but your infra is out of sync
- Two services can't talk to each other, and you don't know why
- A Prometheus alert fires, and you need to trace it to a root cause - kind lightest, runs on any OS
- MicroK8scloser to production behavior, best on Ubuntu
- K3s good inside VMs - Spot it: something isn't working, let's say a pod won't -weight: 500;">start or a deploy failed, or two services can't reach each other. This is the beginning of a lesson.
- Read it: before you Google anything, read what the system is telling you. Logs, events, metrics. The answer is almost always there. Your job is to learn how to see it.
- Fix it: solve it from what you observed.
- Document it: write the incident report immediately, while it's fresh. - Backend crashes on startup -weight: 500;">docker logs → Look for ECONNREFUSED A dependency isn't ready yet
- Pod stuck in Pending -weight: 500;">kubectl describe pod → Look for Events: "Insufficient memory" or "Insufficient CPU."
- ImagePullBackOff -weight: 500;">kubectl describe pod → Look for Events: "repository does not exist" image not imported
- Ingress returning 404 -weight: 500;">kubectl get endpoints -n → = label selector mismatch
- Grafana showing "No data." -weight: 500;">curl http://localhost:/metrics → 404 = app never exposed /metrics endpoint
- ERR_TOO_MANY_REDIRECTS -weight: 500;">kubectl get ingress -o yaml | grep ssl → ssl-redirect: true + Cloudflare = redirect loop.