Tools: Why Your Kubernetes Pod Keeps Getting Killed — And It's Not an OOMKill

Tools: Why Your Kubernetes Pod Keeps Getting Killed — And It's Not an OOMKill

The Incident

Step 1: The Obvious Suspect — OOMKill

Step 2: Checking Kubernetes Events

What Does NodeNotReady Mean?

Step 4: What's Eating the CPU?

Step 5: Getting Inside the Node

What is Load Average?

Step 6: Finding the Real Bottleneck with Linux PSI

Step 7: Swap Thrashing — The Real Killer

Step 8: The Hidden Memory Consumer — Kernel Slab

Step 9: Inside the Slab — 1.66 Million Leaked Network Packets

Step 10: It's Not Just One Node

Step 11: The Culprit — An Abandoned Monitoring Tool

The Fix

Lessons Learned

1. Your monitoring tools can be the problem

2. Kubernetes metrics have a blind spot

3. Small nodes amplify kernel-level issues

4. Audit your DaemonSets regularly

6. High CPU doesn't always mean high computation

7. Follow the evidence, not assumptions

Debugging Cheatsheet

Kubernetes-level

OS-level (via SSM or SSH)

AWS-level

Further Reading A real-world debugging guide: from mysterious pod terminations to discovering a hidden kernel memory leak consuming 55% of node RAM. It was a regular morning when we noticed something off. One of our production services — running on an EKS cluster — had been terminated and a new pod had spun up in its place. No deployment had been triggered. No config changes. The pod just... died. The Grafana dashboard for the old pod told a strange story: The new pod was already using 757 MiB of memory and running just fine. So what killed the old one? This is the story of how we debugged it — and what we found. When a Kubernetes pod dies unexpectedly, the first thing most engineers check is whether it was killed for using too much memory (an OOMKill). We looked at the deployment spec: No memory limit was configured. In Kubernetes, if you don't set a limit, the container can use as much memory as the node has available. So this wasn't a container-level OOMKill — the pod had no ceiling to hit. But wait — if the new pod was happily running at 757 MiB, why would 832 MiB on the old pod be a problem? Something else was going on. We tried to pull events for the terminated pod: Nothing. Kubernetes only retains events for about an hour, and the pod had died over 4 hours ago. The events had expired. But when we checked broader events in the namespace, we found something interesting: And the node had recent events: The node itself had gone NotReady. When a Kubernetes node stops responding to the API server, all pods on that node get evicted. This explained the pod termination — but why did the node go NotReady? Every node in a Kubernetes cluster runs a process called the kubelet. The kubelet sends periodic heartbeats to the API server (the control plane) saying "I'm alive and healthy." If the API server doesn't receive a heartbeat within a grace period (default 40 seconds), it marks the node as NotReady and begins evicting pods to reschedule them elsewhere. A node goes NotReady when the kubelet process is too overwhelmed to send these heartbeats — usually due to extreme resource pressure (CPU, memory, or disk). The node was a t3a.medium instance on AWS. T3/T3a instances are burstable — they don't give you full CPU all the time. We initially suspected that the instance had exhausted its CPU credits and was being throttled, causing the kubelet to miss heartbeats. Not familiar with AWS burstable instances and CPU credits? Read our deep dive: AWS Burstable Instances Explained: CPU Credits, Throttling, and Why Your t3 Instance Isn't What You Think We checked the credit configuration: T3 Unlimited mode was already enabled — meaning the instance could burst beyond its credit balance without throttling (you just pay for the extra usage). We verified with CloudWatch: CPU credits were at 0 but surplus credits were maxed at 576. The instance was not being throttled. CPU credits: ruled out. But CloudWatch revealed something alarming: the instance had been running at ~100% CPU utilization for the entire day. We checked CPU usage of all pods on the node using kubectl top: Pods were barely using any CPU. Yet CloudWatch showed 100% at the instance level. The CPU was being consumed by something outside of Kubernetes pods — at the operating system or kernel level. We needed to look at the node's operating system directly. We used AWS Systems Manager (SSM) to run commands on the instance without SSH: A load average of 34 on a 2-CPU machine. That's 17x the capacity. The system was completely overloaded. Load average represents the average number of processes that are either currently running on a CPU, or waiting in the queue to run. On a 2-CPU machine, a load average of 2.0 means both CPUs are fully utilized. A load average of 34 means there are 34 processes competing for 2 CPUs — each process spends most of its time waiting. The three numbers represent the 1-minute, 5-minute, and 15-minute averages. All three being high meant this had been going on for a long time. Linux has a feature called Pressure Stall Information (PSI) that tells you exactly which resource is the bottleneck. We checked /proc/pressure/: The numbers told a clear story: 99% of the time, some process was waiting for memory. 63% of the time, ALL processes were completely stalled. This wasn't a CPU problem at all — it was a memory problem that manifested as high CPU usage. The memory stats confirmed it: The node had 5 GB of memory committed on a machine with only 4 GB of RAM. The overflow was being handled by swap — a section of the disk used as overflow memory. But disk is 100,000x slower than RAM, and when the system constantly moves data between RAM and disk, you get swap thrashing: the CPU spends all its time waiting for disk I/O instead of doing useful work. Want to understand swap, swap thrashing, and why memory problems cause CPU spikes? Read our explainer: Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You This explained everything: swap thrashing -> kubelet can't send heartbeats -> NodeNotReady -> pod evicted. But where was all the memory going? We checked memory distribution across all pods on the node: That's only about 1.6 GB accounted for. On a 4 GB node, where was the other 2+ GB? 2.1 GB of non-reclaimable kernel memory. Over half the node's RAM was consumed by the Linux kernel itself, completely invisible to all Kubernetes monitoring tools. What is kernel slab memory and why can't Kubernetes see it? This is covered in detail in: Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You Normal SUnreclaim on a healthy node is 50-200 MB. Our node had 2,143 MB. Something was leaking memory inside the kernel. We examined /proc/slabinfo to see what was consuming the slab: 1.66 million skbuff_head_cache entries — each one representing a network packet header in the Linux kernel. And 1.67 million kmalloc-1k allocations (the associated packet data). The almost 1:1 ratio confirmed this was a network subsystem memory leak: millions of network packets stuck in kernel memory, never being cleaned up. What is skbuff and how does it relate to network packets? Explained in: Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You Our affected pod ran on a dedicated node. Maybe this was a one-off? We checked two other nodes in the cluster: Every node had the same leak. The t3a.xlarge node (16 GB) had an even bigger leak at 4.5 GB — but survived because it had enough RAM headroom. The other t3a.medium nodes were ticking time bombs. What was common across all nodes and was intercepting network traffic? A network visualization DaemonSet. We had Weave Scope running on every node — a tool that captures and analyzes network traffic to build a real-time map of your infrastructure. Over weeks and months, these accumulated into the millions, consuming gigabytes of kernel memory on every node. We deleted the entire namespace: The effect was immediate: When the agent processes were killed, the kernel cleaned up all the orphaned socket buffers. 2 GB of memory was freed instantly. No node restart was even needed. A monitoring tool designed to give visibility into our infrastructure was silently killing it. Tools that intercept network traffic at the kernel level can cause kernel-level resource leaks that are invisible to standard Kubernetes metrics. kubectl top and Prometheus container metrics only show userspace memory used by containers. The 2.1 GB of kernel slab memory was completely invisible. We only found it by SSHing into the node and checking /proc/meminfo and /proc/slabinfo. If you're running node-exporter, consider alerting on node_memory_SUnreclaim_bytes — it would have caught this early. A t3a.medium (4 GB RAM) leaves very little headroom after kubelet, container runtime, CNI plugins, CSI drivers, DaemonSet pods, and OS overhead. Any kernel-level issue eats directly into the limited space available for your workloads. DaemonSets run on every node. A single misbehaving DaemonSet multiplies its impact across your entire infrastructure. Review them periodically: Ask: Is this still needed? Is it maintained? When was it last updated? Running unmaintained software in production — especially software that operates at the kernel level — is a risk that's easy to forget about. If the maintainers or company behind a tool have moved on, you should too. Our node showed 100% CPU, but actual computation was negligible. The CPU was spent on memory management — swapping pages in and out of disk. When you see high CPU coupled with high memory usage, check for swap thrashing first. Our investigation path: OOMKill? (no) -> CPU credits? (no) -> Node issue? (yes, NodeNotReady) -> What caused it? (memory pressure) -> Where's the memory? (kernel slab) -> What's in the slab? (leaked socket buffers) -> What's leaking? (abandoned DaemonSet). Each wrong hypothesis was eliminated with data, not guesswork. The most dangerous problems in production aren't the ones that set off alarms — they're the ones that slowly accumulate in places you're not looking. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ resources: requests: cpu: 100m memory: 500Mi # No limits set resources: requests: cpu: 100m memory: 500Mi # No limits set resources: requests: cpu: 100m memory: 500Mi # No limits set -weight: 500;">kubectl get events -n live --field-selector involvedObject.name=<pod-name> -weight: 500;">kubectl get events -n live --field-selector involvedObject.name=<pod-name> -weight: 500;">kubectl get events -n live --field-selector involvedObject.name=<pod-name> TaintManagerEviction pod/<new-pod> Cancelling deletion of Pod TaintManagerEviction pod/<new-pod> Cancelling deletion of Pod TaintManagerEviction pod/<new-pod> Cancelling deletion of Pod NodeNotReady node/<node-name> Node -weight: 500;">status is now: NodeNotReady NodeReady node/<node-name> Node -weight: 500;">status is now: NodeReady NodeNotReady node/<node-name> Node -weight: 500;">status is now: NodeNotReady NodeReady node/<node-name> Node -weight: 500;">status is now: NodeReady NodeNotReady node/<node-name> Node -weight: 500;">status is now: NodeNotReady NodeReady node/<node-name> Node -weight: 500;">status is now: NodeReady aws ec2 describe-instance-credit-specifications \ --instance-ids <instance-id> aws ec2 describe-instance-credit-specifications \ --instance-ids <instance-id> aws ec2 describe-instance-credit-specifications \ --instance-ids <instance-id> CpuCredits: unlimited CpuCredits: unlimited CpuCredits: unlimited -weight: 500;">service pod: 7m (0.007 CPUs) weave-scope-agent: 40m aws-node: 23m kube-proxy: 7m ebs-csi: 3m efs-csi: 5m ────────────────────────── Total: ~85m (out of 2000m available) -weight: 500;">service pod: 7m (0.007 CPUs) weave-scope-agent: 40m aws-node: 23m kube-proxy: 7m ebs-csi: 3m efs-csi: 5m ────────────────────────── Total: ~85m (out of 2000m available) -weight: 500;">service pod: 7m (0.007 CPUs) weave-scope-agent: 40m aws-node: 23m kube-proxy: 7m ebs-csi: 3m efs-csi: 5m ────────────────────────── Total: ~85m (out of 2000m available) aws ssm send-command \ --instance-ids <instance-id> \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["cat /proc/loadavg"]' aws ssm send-command \ --instance-ids <instance-id> \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["cat /proc/loadavg"]' aws ssm send-command \ --instance-ids <instance-id> \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["cat /proc/loadavg"]' 34.04 25.03 22.70 34.04 25.03 22.70 34.04 25.03 22.70 CPU: some avg10=85.41 avg60=84.62 avg300=82.10 Memory: some avg10=98.98 avg60=98.90 avg300=98.38 full avg10=62.85 avg60=63.91 avg300=63.33 IO: some avg10=0.04 avg60=0.16 avg300=0.21 CPU: some avg10=85.41 avg60=84.62 avg300=82.10 Memory: some avg10=98.98 avg60=98.90 avg300=98.38 full avg10=62.85 avg60=63.91 avg300=63.33 IO: some avg10=0.04 avg60=0.16 avg300=0.21 CPU: some avg10=85.41 avg60=84.62 avg300=82.10 Memory: some avg10=98.98 avg60=98.90 avg300=98.38 full avg10=62.85 avg60=63.91 avg300=63.33 IO: some avg10=0.04 avg60=0.16 avg300=0.21 MemTotal: 3,936 MB MemFree: 86 MB (only 86 MB free!) MemAvailable: 735 MB (after counting reclaimable cache) SwapTotal: 1,048 MB SwapFree: 549 MB (500 MB of swap in use) Committed_AS: 5,001 MB (5 GB committed on a 4 GB machine!) MemTotal: 3,936 MB MemFree: 86 MB (only 86 MB free!) MemAvailable: 735 MB (after counting reclaimable cache) SwapTotal: 1,048 MB SwapFree: 549 MB (500 MB of swap in use) Committed_AS: 5,001 MB (5 GB committed on a 4 GB machine!) MemTotal: 3,936 MB MemFree: 86 MB (only 86 MB free!) MemAvailable: 735 MB (after counting reclaimable cache) SwapTotal: 1,048 MB SwapFree: 549 MB (500 MB of swap in use) Committed_AS: 5,001 MB (5 GB committed on a 4 GB machine!) PODS (total): 616 MB ├── main -weight: 500;">service 381 MB ├── weave-scope-agent 64 MB ├── aws-node (VPC CNI) 39 MB ├── promtail 30 MB ├── ebs-csi-node 26 MB ├── kube-proxy 24 MB SYSTEM PROCESSES: 58 MB PAGE CACHE: 831 MB FREE: 87 MB PODS (total): 616 MB ├── main -weight: 500;">service 381 MB ├── weave-scope-agent 64 MB ├── aws-node (VPC CNI) 39 MB ├── promtail 30 MB ├── ebs-csi-node 26 MB ├── kube-proxy 24 MB SYSTEM PROCESSES: 58 MB PAGE CACHE: 831 MB FREE: 87 MB PODS (total): 616 MB ├── main -weight: 500;">service 381 MB ├── weave-scope-agent 64 MB ├── aws-node (VPC CNI) 39 MB ├── promtail 30 MB ├── ebs-csi-node 26 MB ├── kube-proxy 24 MB SYSTEM PROCESSES: 58 MB PAGE CACHE: 831 MB FREE: 87 MB KERNEL SLAB: 2,194 MB ├── SReclaimable: 50 MB (can be freed) ├── SUnreclaim: 2,143 MB (CANNOT be freed!) KERNEL SLAB: 2,194 MB ├── SReclaimable: 50 MB (can be freed) ├── SUnreclaim: 2,143 MB (CANNOT be freed!) KERNEL SLAB: 2,194 MB ├── SReclaimable: 50 MB (can be freed) ├── SUnreclaim: 2,143 MB (CANNOT be freed!) SLAB OBJECT COUNT x SIZE = TOTAL ──────────────────────────────────────────────────────────── kmalloc-1k 1,667,384 x 1,024 B = 1,632 MB skbuff_head_cache 1,657,980 x 256 B = 414 MB ──────────────────────────────────────────────────────────── These two alone: 2,046 MB SLAB OBJECT COUNT x SIZE = TOTAL ──────────────────────────────────────────────────────────── kmalloc-1k 1,667,384 x 1,024 B = 1,632 MB skbuff_head_cache 1,657,980 x 256 B = 414 MB ──────────────────────────────────────────────────────────── These two alone: 2,046 MB SLAB OBJECT COUNT x SIZE = TOTAL ──────────────────────────────────────────────────────────── kmalloc-1k 1,667,384 x 1,024 B = 1,632 MB skbuff_head_cache 1,657,980 x 256 B = 414 MB ──────────────────────────────────────────────────────────── These two alone: 2,046 MB affected node large node another small node (t3a.medium) (t3a.xlarge) (t3a.medium) ────────────────────────────────────────────────────────────────────────── Total RAM 3,936 MB 16,207 MB 3,938 MB Slab (SUnreclaim) 2,143 MB 4,533 MB 1,744 MB skbuff count 1,667,384 3,309,501 1,310,669 Memory pressure 98.98% 0.00% 0.00% Load average 32.64 2.11 0.12 affected node large node another small node (t3a.medium) (t3a.xlarge) (t3a.medium) ────────────────────────────────────────────────────────────────────────── Total RAM 3,936 MB 16,207 MB 3,938 MB Slab (SUnreclaim) 2,143 MB 4,533 MB 1,744 MB skbuff count 1,667,384 3,309,501 1,310,669 Memory pressure 98.98% 0.00% 0.00% Load average 32.64 2.11 0.12 affected node large node another small node (t3a.medium) (t3a.xlarge) (t3a.medium) ────────────────────────────────────────────────────────────────────────── Total RAM 3,936 MB 16,207 MB 3,938 MB Slab (SUnreclaim) 2,143 MB 4,533 MB 1,744 MB skbuff count 1,667,384 3,309,501 1,310,669 Memory pressure 98.98% 0.00% 0.00% Load average 32.64 2.11 0.12 -weight: 500;">kubectl get daemonsets -n weave -weight: 500;">kubectl get daemonsets -n weave -weight: 500;">kubectl get daemonsets -n weave NAME DESIRED CURRENT READY AGE weave-scope-agent 16 16 16 2y326d NAME DESIRED CURRENT READY AGE weave-scope-agent 16 16 16 2y326d NAME DESIRED CURRENT READY AGE weave-scope-agent 16 16 16 2y326d -weight: 500;">kubectl delete namespace weave -weight: 500;">kubectl delete namespace weave -weight: 500;">kubectl delete namespace weave BEFORE AFTER ────────────────────────────────────────────────────── Slab (SUnreclaim) 2,143 MB 74 MB MemFree 87 MB 1,937 MB MemAvailable 735 MB 2,600 MB Memory pressure 98.98% 0.00% Load average 32.64 0.39 BEFORE AFTER ────────────────────────────────────────────────────── Slab (SUnreclaim) 2,143 MB 74 MB MemFree 87 MB 1,937 MB MemAvailable 735 MB 2,600 MB Memory pressure 98.98% 0.00% Load average 32.64 0.39 BEFORE AFTER ────────────────────────────────────────────────────── Slab (SUnreclaim) 2,143 MB 74 MB MemFree 87 MB 1,937 MB MemAvailable 735 MB 2,600 MB Memory pressure 98.98% 0.00% Load average 32.64 0.39 -weight: 500;">kubectl get daemonsets --all-namespaces -weight: 500;">kubectl get daemonsets --all-namespaces -weight: 500;">kubectl get daemonsets --all-namespaces # Pod events -weight: 500;">kubectl get events -n <namespace> --sort-by='.lastTimestamp' # Node conditions -weight: 500;">kubectl describe node <node-name> | grep -A5 Conditions # All pods on a node -weight: 500;">kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> -o wide # Pod resource usage -weight: 500;">kubectl top pods -n <namespace> # List all DaemonSets -weight: 500;">kubectl get daemonsets --all-namespaces # Pod events -weight: 500;">kubectl get events -n <namespace> --sort-by='.lastTimestamp' # Node conditions -weight: 500;">kubectl describe node <node-name> | grep -A5 Conditions # All pods on a node -weight: 500;">kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> -o wide # Pod resource usage -weight: 500;">kubectl top pods -n <namespace> # List all DaemonSets -weight: 500;">kubectl get daemonsets --all-namespaces # Pod events -weight: 500;">kubectl get events -n <namespace> --sort-by='.lastTimestamp' # Node conditions -weight: 500;">kubectl describe node <node-name> | grep -A5 Conditions # All pods on a node -weight: 500;">kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> -o wide # Pod resource usage -weight: 500;">kubectl top pods -n <namespace> # List all DaemonSets -weight: 500;">kubectl get daemonsets --all-namespaces # System pressure — which resource is the bottleneck? cat /proc/pressure/cpu cat /proc/pressure/memory cat /proc/pressure/io # Memory breakdown — look for SUnreclaim grep -E "MemTotal|MemFree|MemAvailable|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree" /proc/meminfo # Top kernel slab consumers cat /proc/slabinfo | sort -k3 -rn | head -10 # Load average cat /proc/loadavg # System pressure — which resource is the bottleneck? cat /proc/pressure/cpu cat /proc/pressure/memory cat /proc/pressure/io # Memory breakdown — look for SUnreclaim grep -E "MemTotal|MemFree|MemAvailable|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree" /proc/meminfo # Top kernel slab consumers cat /proc/slabinfo | sort -k3 -rn | head -10 # Load average cat /proc/loadavg # System pressure — which resource is the bottleneck? cat /proc/pressure/cpu cat /proc/pressure/memory cat /proc/pressure/io # Memory breakdown — look for SUnreclaim grep -E "MemTotal|MemFree|MemAvailable|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree" /proc/meminfo # Top kernel slab consumers cat /proc/slabinfo | sort -k3 -rn | head -10 # Load average cat /proc/loadavg # CPU credit balance (burstable instances) aws cloudwatch get-metric-statistics \ --namespace AWS/EC2 --metric-name CPUCreditBalance \ --dimensions Name=InstanceId,Value=<id> \ ---weight: 500;">start-time <-weight: 500;">start> --end-time <end> \ --period 300 --statistics Average # Run commands on a node without SSH aws ssm send-command \ --instance-ids <id> \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["your-command"]' # CPU credit balance (burstable instances) aws cloudwatch get-metric-statistics \ --namespace AWS/EC2 --metric-name CPUCreditBalance \ --dimensions Name=InstanceId,Value=<id> \ ---weight: 500;">start-time <-weight: 500;">start> --end-time <end> \ --period 300 --statistics Average # Run commands on a node without SSH aws ssm send-command \ --instance-ids <id> \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["your-command"]' # CPU credit balance (burstable instances) aws cloudwatch get-metric-statistics \ --namespace AWS/EC2 --metric-name CPUCreditBalance \ --dimensions Name=InstanceId,Value=<id> \ ---weight: 500;">start-time <-weight: 500;">start> --end-time <end> \ --period 300 --statistics Average # Run commands on a node without SSH aws ssm send-command \ --instance-ids <id> \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["your-command"]' - Memory usage had climbed to 832 MiB, then abruptly dropped to zero - CPU dropped to zero at the same time - After a ~45 minute gap, a new pod appeared and started running normally - some = percentage of time at least one process was stalled on this resource - full = percentage of time ALL processes were stalled - Installed 2 years and 326 days ago via raw -weight: 500;">kubectl apply (no Helm, no GitOps) - Running weaveworks/scope:1.13.2 — the last version ever released - Weaveworks, the company behind it, shut down in 2024 - The DaemonSet was running on all 16 nodes, intercepting all network traffic - Its packet interception was creating socket buffers in kernel space that were never freed - AWS Burstable Instances Explained: CPU Credits, Throttling, and Why Your t3 Instance Isn't What You Think - Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You