Tools: Kubernetes Troubleshooting - Full Analysis

Tools: Kubernetes Troubleshooting - Full Analysis

Why this exists

1. CrashLoopBackOff

2. ImagePullBackOff or ErrImagePull

3. Pod stuck Pending

4. OOMKilled

5. Service unreachable

6. DNS resolution failing

7. Ingress 502 Bad Gateway

8. PVC stuck Pending

9. Node Not Ready

10. HPA not scaling

How to use this playbook I've been running K8s troubleshooting workshops for two years. We have a 200-student program at IT Defined where we throw broken clusters at people. Patterns emerged. Most failures aren't novel. The same 25-30 failure modes account for 90% of real-world K8s incidents. If you can confidently debug these, you'll handle most production incidents. Here are the 10 most critical scenarios. Full 26 in the linked post. Symptom: Pod restart count climbing. Likely causes: App crashes on startup (config error, missing env var, can't connect to DB), liveness probe too aggressive, command/args misconfigured. Fix: Read the previous container's logs. Reason is usually right there. If logs are empty, the container died before logging — check the entrypoint, command, and args. Diagnosis: kubectl describe pod, look at events at the bottom. Likely causes: Image name typo, image doesn't exist, registry credentials missing, wrong region (ECR is regional), node IAM role can't pull from ECR. Fix: Run docker pull manually from a workstation. If it works, it's a node permission issue. Diagnosis: kubectl describe pod. Look for "0/3 nodes available: insufficient cpu" or "didn't match node selector." Likely causes: Insufficient capacity, resource requests too high, taints/tolerations mismatch, PVC not bound. Fix: Check kubectl describe nodes for available resources. If maxed, autoscale. Diagnosis: kubectl describe pod shows "Last State: Terminated, Reason: OOMKilled." Likely causes: Container exceeded memory limit, JVM not configured for container limits, memory leak. Fix: Increase limits if workload genuinely needs more. For Java apps, use -XX:MaxRAMPercentage properly. Likely causes: No endpoints (selector doesn't match pod labels), pod not listening on expected port, NetworkPolicy blocking traffic. Fix: 99% of the time it's a label selector mismatch. Diagnosis: kubectl exec into pod, run nslookup. Check CoreDNS pods. Likely causes: CoreDNS pods crashed, NetworkPolicy blocking DNS, /etc/resolv.conf misconfigured. Fix: Restart CoreDNS if misbehaving. On EKS, defaults are sometimes too low for busy clusters. Likely causes: Backend pod down, target group health check failing, port mismatch, slow startup so ALB marks unhealthy. Fix: Check target group health in AWS console. Fix readiness probe if pods unhealthy. Likely causes: No StorageClass set, EBS CSI driver not installed, IAM permissions for the driver. Fix on EKS: Install EBS CSI driver as an EKS add-on. Service account needs the right IAM role via IRSA. Likely causes: Kubelet crashed, container runtime issue, disk pressure, network plugin failure. Fix: SSH to node (or SSM Session Manager). Check journalctl -u kubelet. Often it's disk full from log accumulation. Likely causes: Metrics-server not installed, HPA targeting CPU but pod has no CPU requests, max replicas reached. Fix: kubectl get hpa. If <unknown> appears under metrics, metrics-server is broken. When you hit a real incident, search for keywords from the symptom. Most day-to-day stuff is covered. If you want to actually practice these in a safe environment, our K8s troubleshooting labs at IT Defined are exactly this — broken clusters with planted issues, fix them under time pressure. Full 26 scenarios — including ConfigMap updates, Secret rotation, NetworkPolicy issues, PDB blocks, autoscaler problems, kube-proxy/CNI issues, Job failures, IRSA problems, webhook admission controllers, liveness probes, PV cleanup, and cluster upgrades — on itdefined.org. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 500;">kubectl describe pod POD_NAME -weight: 500;">kubectl logs POD_NAME --previous -weight: 500;">kubectl describe pod POD_NAME -weight: 500;">kubectl logs POD_NAME --previous -weight: 500;">kubectl describe pod POD_NAME -weight: 500;">kubectl logs POD_NAME --previous -weight: 500;">kubectl get endpoints SVC_NAME -weight: 500;">kubectl get endpoints SVC_NAME -weight: 500;">kubectl get endpoints SVC_NAME