Tools: Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You

Tools: Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You

The Problem With Kubernetes Memory Metrics

How Linux Organizes Memory

Layer 1: Application Memory (What Kubernetes Shows)

Why This Number Isn't the Full Picture

Layer 2: Page Cache

Layer 3: Kernel Slab Memory (The Hidden Consumer)

What is the Slab Allocator?

SReclaimable vs SUnreclaim

Why Kubernetes Can't See Slab Memory

Layer 4: Swap — The Emergency Overflow

What is Swap?

A Step-by-Step Example

Why Swap Thrashing Looks Like a CPU Problem

What is skbuff? (Socket Buffers)

Anatomy of a Network Packet in Linux

What a Leak Looks Like

How to Investigate Memory Issues on Kubernetes Nodes

Step 1: Check if the problem is even memory

Step 2: Get the full memory breakdown

Step 3: If slab is high, find out what's in it

Step 4: Check for swap thrashing

Monitoring: What to Alert On

Key Takeaways Your kubectl top says the node has plenty of free memory. The node crashes anyway. Here's what's hiding in the gap. When you run kubectl top node, you see something like: 15% memory usage. Looks healthy, right? But the node is swap thrashing, the load average is 34, and pods are being evicted. How? Because Kubernetes only shows you userspace memory — the memory your containers are using. It doesn't show you what the Linux kernel is consuming behind the scenes. On the node we were debugging, the kernel was secretly eating 2.1 GB out of 4 GB — and kubectl had no idea. This post explains the layers of Linux memory that Kubernetes can't see, and how to find them when things go wrong. When you check /proc/meminfo on a Linux machine, you see dozens of entries. Here's how they fit together on a 4 GB node: Kubernetes metrics cover the first bucket. Everything else is the OS and kernel. Let's break down each layer. This is the memory your processes actively use — variables, heap allocations, stack frames. In Linux terms, these are anonymous pages (memory not backed by any file on disk). This 381 MiB is the Resident Set Size (RSS) of the container's processes — the amount of physical RAM their memory allocations are currently occupying. RSS only counts memory your process asked for. It doesn't count: The page cache is Linux's way of caching file data in RAM so that repeated reads don't hit the disk. On our node, 831 MB was used for page cache. This sounds like a lot, but page cache is reclaimable — the kernel will automatically free it when applications need more RAM. It's essentially "free memory being used productively." This is why MemAvailable is often much higher than MemFree: Key insight: If you see low MemFree but healthy MemAvailable, your system is fine — the kernel is just being smart about caching. Panic when MemAvailable is low. This is where things get interesting — and where our production incident hid for months. The Linux kernel constantly needs to create and destroy small data structures: file descriptors, inode objects, network packet headers, process descriptors, and hundreds of other internal types. Allocating and freeing these one at a time from the general-purpose memory allocator would be slow. The slab allocator solves this by maintaining pre-allocated pools for each object type. Think of it like a restaurant kitchen with separate prep stations: Each pool is called a slab cache. You can see all of them in /proc/slabinfo: Slab memory is split into two categories: SReclaimable — Slab caches that hold cached data the kernel can regenerate. The biggest example is the dentry cache (directory entry cache), which caches filesystem path lookups. If memory is needed, the kernel can shrink these caches. SUnreclaim — Slab caches that hold active data the kernel is currently using. Network packet buffers, open file descriptors, active inode structures. These cannot be freed until the code that created them explicitly releases them. Kubernetes resource metrics come from cgroups (control groups), which track memory allocated by processes inside containers. Kernel slab allocations happen outside of any cgroup — they're charged to the kernel, not to any container. Even if your container triggered the kernel allocation (by sending a network packet, for example), the slab memory shows up as kernel memory, not container memory. The only way to see it is by checking /proc/meminfo or using node-exporter's node_memory_SUnreclaim_bytes metric. Swap is a section of the disk that Linux uses as overflow memory when physical RAM is full. When the kernel needs to free up RAM (because something needs more memory and there's nothing reclaimable left), it takes memory pages that haven't been accessed recently and writes them to the swap area on disk. This is called swapping out or paging out. Stage 1: Everything fits in RAM All processes' memory is in RAM. Memory access is fast. No problems. Stage 2: RAM fills up Free memory is nearly gone. The kernel starts shrinking the page cache, but slab (SUnreclaim) can't be freed. Stage 3: Swap kicks in The kernel identified memory pages that hadn't been accessed recently and moved them to disk. RAM now has room for active work. Stage 4: Swap thrashing This is where things go catastrophically wrong. When a process needs a page that was swapped out: Now multiply this by dozens of processes, all needing pages that were swapped out: This circular eviction is swap thrashing. The system does almost no useful work — all CPU time is spent managing page faults and disk I/O. CloudWatch and top will show 100% CPU utilization during swap thrashing. But the CPU isn't doing computation. Here's the breakdown: The load average also skyrockets because Linux counts processes in uninterruptible sleep (waiting for disk I/O) in the load average. If 30 processes are all waiting for swap pages, the load average shows 30 — even though very little CPU work is happening. This is why our node showed a load average of 34 with pods using only 85m of CPU. The CPUs weren't busy computing — they were busy waiting for the disk. sk_buff (socket buffer) is the data structure at the heart of Linux networking. Every network packet — in or out — is represented by an sk_buff. When your container sends an HTTP request: On a healthy system, sk_buff structures are allocated when a packet is created and freed when the packet is sent/received/dropped. The slab pool recycles them efficiently. On our node, we found: The almost 1:1 ratio between skbuff headers and 1KB allocations is the signature of a network packet leak. Each packet consists of a header + data buffer. 1.66 million packets were stuck in kernel memory, never freed. At a normal rate of ~1000 packets/second, 1.66 million packets represents about 28 minutes of traffic that was captured and never released. Over days and weeks, with the leaking tool constantly intercepting traffic, this accumulated to gigabytes. Common slab objects and what they mean: If you're running Prometheus with node-exporter, set up alerts for these metrics: kubectl top only shows container memory. The kernel can consume gigabytes that are invisible to Kubernetes. Always check /proc/meminfo when debugging node-level memory issues. High SUnreclaim means something is wrong. Normal is 50-200 MB. If it's in the gigabytes, you have a kernel memory leak — find the leaking slab cache in /proc/slabinfo. Swap thrashing masquerades as a CPU problem. If you see high CPU + high load average + swap usage, the CPU isn't busy computing — it's busy waiting for disk I/O from swap. Page cache is not a problem. Low MemFree with healthy MemAvailable is normal — the kernel is caching files intelligently. Only worry when MemAvailable drops. Network monitoring tools can leak socket buffers. Any tool that intercepts packets at the kernel level (Weave Scope, long-running tcpdump, certain service mesh sidecars) can accumulate sk_buff objects in slab memory over time. Monitor node_memory_SUnreclaim_bytes. This is the one metric that would have caught our issue months before it caused an outage. This post is part of a series on debugging Kubernetes pod terminations. Read the full incident story: Why Your Kubernetes Pod Keeps Getting Killed — And It's Not an OOMKill Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 500;">kubectl top -weight: 500;">kubectl top node NAME CPU MEMORY ip-10-2-1-35 45m 616Mi/3936Mi (15%) NAME CPU MEMORY ip-10-2-1-35 45m 616Mi/3936Mi (15%) NAME CPU MEMORY ip-10-2-1-35 45m 616Mi/3936Mi (15%) /proc/meminfo Total RAM: 4,096 MB ├── Used by applications (Anonymous pages): 617 MB │ ├── Container processes (what -weight: 500;">kubectl sees) │ └── System processes (kubelet, containerd, etc.) ├── Page Cache (file-backed pages): 831 MB │ └── Cached file data (can be reclaimed) ├── Kernel Slab: 2,194 MB ← invisible to k8s │ ├── SReclaimable: 50 MB (can be freed) │ └── SUnreclaim: 2,143 MB (cannot be freed!) ├── Kernel Stack, Page Tables, etc.: 60 MB └── Free: 87 MB Total RAM: 4,096 MB ├── Used by applications (Anonymous pages): 617 MB │ ├── Container processes (what -weight: 500;">kubectl sees) │ └── System processes (kubelet, containerd, etc.) ├── Page Cache (file-backed pages): 831 MB │ └── Cached file data (can be reclaimed) ├── Kernel Slab: 2,194 MB ← invisible to k8s │ ├── SReclaimable: 50 MB (can be freed) │ └── SUnreclaim: 2,143 MB (cannot be freed!) ├── Kernel Stack, Page Tables, etc.: 60 MB └── Free: 87 MB Total RAM: 4,096 MB ├── Used by applications (Anonymous pages): 617 MB │ ├── Container processes (what -weight: 500;">kubectl sees) │ └── System processes (kubelet, containerd, etc.) ├── Page Cache (file-backed pages): 831 MB │ └── Cached file data (can be reclaimed) ├── Kernel Slab: 2,194 MB ← invisible to k8s │ ├── SReclaimable: 50 MB (can be freed) │ └── SUnreclaim: 2,143 MB (cannot be freed!) ├── Kernel Stack, Page Tables, etc.: 60 MB └── Free: 87 MB # What Kubernetes reports -weight: 500;">kubectl top pods -n live NAME CPU MEMORY nightfort-688ccc5974-p47qs 7m 381Mi # What Kubernetes reports -weight: 500;">kubectl top pods -n live NAME CPU MEMORY nightfort-688ccc5974-p47qs 7m 381Mi # What Kubernetes reports -weight: 500;">kubectl top pods -n live NAME CPU MEMORY nightfort-688ccc5974-p47qs 7m 381Mi First read of a file: Disk → RAM (page cache) → Process [slow] Second read: Page cache → Process [fast] First read of a file: Disk → RAM (page cache) → Process [slow] Second read: Page cache → Process [fast] First read of a file: Disk → RAM (page cache) → Process [slow] Second read: Page cache → Process [fast] MemAvailable MemFree: 87 MB (truly unused) MemAvailable: 735 MB (free + reclaimable cache) MemFree: 87 MB (truly unused) MemAvailable: 735 MB (free + reclaimable cache) MemFree: 87 MB (truly unused) MemAvailable: 735 MB (free + reclaimable cache) MemAvailable MemAvailable Instead of: "I need an inode" → malloc(sizeof(inode)) → slow, fragmentation The kernel does: "I need an inode" → grab one from the inode pool → fast, no fragmentation "Done with inode" → return it to the pool → ready for reuse Instead of: "I need an inode" → malloc(sizeof(inode)) → slow, fragmentation The kernel does: "I need an inode" → grab one from the inode pool → fast, no fragmentation "Done with inode" → return it to the pool → ready for reuse Instead of: "I need an inode" → malloc(sizeof(inode)) → slow, fragmentation The kernel does: "I need an inode" → grab one from the inode pool → fast, no fragmentation "Done with inode" → return it to the pool → ready for reuse /proc/slabinfo cat /proc/slabinfo | sort -k3 -rn | head -10 cat /proc/slabinfo | sort -k3 -rn | head -10 cat /proc/slabinfo | sort -k3 -rn | head -10 kmalloc-1k 1,667,384 1024 bytes each → 1,632 MB skbuff_head_cache 1,657,980 256 bytes each → 414 MB dentry 9,248 192 bytes each → 1.7 MB xfs_inode 9,649 1024 bytes each → 9.4 MB kmalloc-1k 1,667,384 1024 bytes each → 1,632 MB skbuff_head_cache 1,657,980 256 bytes each → 414 MB dentry 9,248 192 bytes each → 1.7 MB xfs_inode 9,649 1024 bytes each → 9.4 MB kmalloc-1k 1,667,384 1024 bytes each → 1,632 MB skbuff_head_cache 1,657,980 256 bytes each → 414 MB dentry 9,248 192 bytes each → 1.7 MB xfs_inode 9,649 1024 bytes each → 9.4 MB SReclaimable: 200 MB (caches, will shrink if needed) SUnreclaim: 100 MB (active kernel objects) SReclaimable: 200 MB (caches, will shrink if needed) SUnreclaim: 100 MB (active kernel objects) SReclaimable: 200 MB (caches, will shrink if needed) SUnreclaim: 100 MB (active kernel objects) SReclaimable: 50 MB SUnreclaim: 2,143 MB ← 21x normal! SReclaimable: 50 MB SUnreclaim: 2,143 MB ← 21x normal! SReclaimable: 50 MB SUnreclaim: 2,143 MB ← 21x normal! -weight: 500;">kubectl top /proc/meminfo node_memory_SUnreclaim_bytes RAM (4 GB) → Fast (nanoseconds) → Expensive Disk/Swap → Slow (milliseconds) → Cheap RAM (4 GB) → Fast (nanoseconds) → Expensive Disk/Swap → Slow (milliseconds) → Cheap RAM (4 GB) → Fast (nanoseconds) → Expensive Disk/Swap → Slow (milliseconds) → Cheap RAM [App 750MB] [Kubelet 200MB] [Other 500MB] [Cache 700MB] [Free 1.8GB] Swap [empty] RAM [App 750MB] [Kubelet 200MB] [Other 500MB] [Cache 700MB] [Free 1.8GB] Swap [empty] RAM [App 750MB] [Kubelet 200MB] [Other 500MB] [Cache 700MB] [Free 1.8GB] Swap [empty] RAM [App 830MB] [Kubelet 200MB] [Other 800MB] [Cache 700MB] [Slab 2.1GB] [Free 87MB] Swap [empty] RAM [App 830MB] [Kubelet 200MB] [Other 800MB] [Cache 700MB] [Slab 2.1GB] [Free 87MB] Swap [empty] RAM [App 830MB] [Kubelet 200MB] [Other 800MB] [Cache 700MB] [Slab 2.1GB] [Free 87MB] Swap [empty] RAM [App 750MB] [Kubelet 100MB] [Other 600MB] [Slab 2.1GB] [Cache 300MB] Swap [Kubelet-old-pages 100MB | App-idle-pages 80MB | Other 320MB] = 500MB used RAM [App 750MB] [Kubelet 100MB] [Other 600MB] [Slab 2.1GB] [Cache 300MB] Swap [Kubelet-old-pages 100MB | App-idle-pages 80MB | Other 320MB] = 500MB used RAM [App 750MB] [Kubelet 100MB] [Other 600MB] [Slab 2.1GB] [Cache 300MB] Swap [Kubelet-old-pages 100MB | App-idle-pages 80MB | Other 320MB] = 500MB used Normal access (page in RAM): CPU: "Give me address 0x1234" RAM: "Here you go" → 100 nanoseconds Swapped access (page on disk): CPU: "Give me address 0x1234" RAM: "Not here — it's on disk" → PAGE FAULT Kernel: "I need to load it from swap" Kernel: "But RAM is full. Let me swap OUT another page first" Disk write: Evict some other page to swap → 1-5 milliseconds Disk read: Load the requested page → 1-5 milliseconds CPU: "Finally!" → 2-10 milliseconds total (100,000x slower) Normal access (page in RAM): CPU: "Give me address 0x1234" RAM: "Here you go" → 100 nanoseconds Swapped access (page on disk): CPU: "Give me address 0x1234" RAM: "Not here — it's on disk" → PAGE FAULT Kernel: "I need to load it from swap" Kernel: "But RAM is full. Let me swap OUT another page first" Disk write: Evict some other page to swap → 1-5 milliseconds Disk read: Load the requested page → 1-5 milliseconds CPU: "Finally!" → 2-10 milliseconds total (100,000x slower) Normal access (page in RAM): CPU: "Give me address 0x1234" RAM: "Here you go" → 100 nanoseconds Swapped access (page on disk): CPU: "Give me address 0x1234" RAM: "Not here — it's on disk" → PAGE FAULT Kernel: "I need to load it from swap" Kernel: "But RAM is full. Let me swap OUT another page first" Disk write: Evict some other page to swap → 1-5 milliseconds Disk read: Load the requested page → 1-5 milliseconds CPU: "Finally!" → 2-10 milliseconds total (100,000x slower) Process A needs a page → it's on disk → swap in A, swap out B → 5ms Process B runs → needs its page → swapped out by A! → swap in B, swap out C → 5ms Process C runs → needs its page → swapped out by B! → swap in C, swap out A → 5ms Process A runs → needs its page → swapped out by C! → ... Process A needs a page → it's on disk → swap in A, swap out B → 5ms Process B runs → needs its page → swapped out by A! → swap in B, swap out C → 5ms Process C runs → needs its page → swapped out by B! → swap in C, swap out A → 5ms Process A runs → needs its page → swapped out by C! → ... Process A needs a page → it's on disk → swap in A, swap out B → 5ms Process B runs → needs its page → swapped out by A! → swap in B, swap out C → 5ms Process C runs → needs its page → swapped out by B! → swap in C, swap out A → 5ms Process A runs → needs its page → swapped out by C! → ... Actual computation: ~5% (your app, kubelet, etc.) Kernel swap management: ~30% (deciding what to evict, page table updates) I/O wait: ~65% (waiting for disk reads/writes) ──────────────────────────────── Total: ~100% Actual computation: ~5% (your app, kubelet, etc.) Kernel swap management: ~30% (deciding what to evict, page table updates) I/O wait: ~65% (waiting for disk reads/writes) ──────────────────────────────── Total: ~100% Actual computation: ~5% (your app, kubelet, etc.) Kernel swap management: ~30% (deciding what to evict, page table updates) I/O wait: ~65% (waiting for disk reads/writes) ──────────────────────────────── Total: ~100% Application: send("GET /health HTTP/1.1\r\n...") ↓ Kernel: allocate an sk_buff ├── skbuff_head_cache entry (256 bytes) — metadata, pointers, protocol info └── kmalloc-1k entry (1024 bytes) — the actual packet data ↓ Network stack: add TCP header, IP header, Ethernet header ↓ Network driver: transmit the packet ↓ Kernel: free the sk_buff ← THIS is what wasn't happening Application: send("GET /health HTTP/1.1\r\n...") ↓ Kernel: allocate an sk_buff ├── skbuff_head_cache entry (256 bytes) — metadata, pointers, protocol info └── kmalloc-1k entry (1024 bytes) — the actual packet data ↓ Network stack: add TCP header, IP header, Ethernet header ↓ Network driver: transmit the packet ↓ Kernel: free the sk_buff ← THIS is what wasn't happening Application: send("GET /health HTTP/1.1\r\n...") ↓ Kernel: allocate an sk_buff ├── skbuff_head_cache entry (256 bytes) — metadata, pointers, protocol info └── kmalloc-1k entry (1024 bytes) — the actual packet data ↓ Network stack: add TCP header, IP header, Ethernet header ↓ Network driver: transmit the packet ↓ Kernel: free the sk_buff ← THIS is what wasn't happening skbuff_head_cache: 1,657,980 objects (414 MB) kmalloc-1k: 1,667,384 objects (1,632 MB) skbuff_head_cache: 1,657,980 objects (414 MB) kmalloc-1k: 1,667,384 objects (1,632 MB) skbuff_head_cache: 1,657,980 objects (414 MB) kmalloc-1k: 1,667,384 objects (1,632 MB) cat /proc/pressure/memory cat /proc/pressure/memory cat /proc/pressure/memory some avg10=98.98 avg60=98.90 avg300=98.38 total=381246311078 full avg10=62.85 avg60=63.91 avg300=63.33 total=281968539996 some avg10=98.98 avg60=98.90 avg300=98.38 total=381246311078 full avg10=62.85 avg60=63.91 avg300=63.33 total=281968539996 some avg10=98.98 avg60=98.90 avg300=98.38 total=381246311078 full avg10=62.85 avg60=63.91 avg300=63.33 total=281968539996 grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree|AnonPages|Committed_AS" /proc/meminfo grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree|AnonPages|Committed_AS" /proc/meminfo grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree|AnonPages|Committed_AS" /proc/meminfo MemTotal → Total physical RAM MemFree → Completely unused RAM MemAvailable → Free + reclaimable (what's actually available) AnonPages → Application memory (what -weight: 500;">kubectl roughly shows) Cached + Buffers → Page cache (reclaimable, usually harmless) Slab → Kernel internal allocations SReclaimable → Kernel caches (can be freed) SUnreclaim → Active kernel objects (cannot be freed!) SwapTotal → Total swap space SwapFree → Unused swap (SwapTotal - SwapFree = swap used) Committed_AS → Total memory promised to all processes MemTotal → Total physical RAM MemFree → Completely unused RAM MemAvailable → Free + reclaimable (what's actually available) AnonPages → Application memory (what -weight: 500;">kubectl roughly shows) Cached + Buffers → Page cache (reclaimable, usually harmless) Slab → Kernel internal allocations SReclaimable → Kernel caches (can be freed) SUnreclaim → Active kernel objects (cannot be freed!) SwapTotal → Total swap space SwapFree → Unused swap (SwapTotal - SwapFree = swap used) Committed_AS → Total memory promised to all processes MemTotal → Total physical RAM MemFree → Completely unused RAM MemAvailable → Free + reclaimable (what's actually available) AnonPages → Application memory (what -weight: 500;">kubectl roughly shows) Cached + Buffers → Page cache (reclaimable, usually harmless) Slab → Kernel internal allocations SReclaimable → Kernel caches (can be freed) SUnreclaim → Active kernel objects (cannot be freed!) SwapTotal → Total swap space SwapFree → Unused swap (SwapTotal - SwapFree = swap used) Committed_AS → Total memory promised to all processes Committed_AS MemTotal + SwapTotal MemAvailable # Show top slab consumers by object count cat /proc/slabinfo | sort -k3 -rn | head -10 # Show top slab consumers by object count cat /proc/slabinfo | sort -k3 -rn | head -10 # Show top slab consumers by object count cat /proc/slabinfo | sort -k3 -rn | head -10 skbuff_head_cache inode_cache ext4_inode_cache nf_conntrack # Load average (should be < number of CPUs) cat /proc/loadavg # Swap usage grep -E "SwapTotal|SwapFree" /proc/meminfo # If swap is being actively used, check swap I/O cat /proc/vmstat | grep -E "pswpin|pswpout" # Load average (should be < number of CPUs) cat /proc/loadavg # Swap usage grep -E "SwapTotal|SwapFree" /proc/meminfo # If swap is being actively used, check swap I/O cat /proc/vmstat | grep -E "pswpin|pswpout" # Load average (should be < number of CPUs) cat /proc/loadavg # Swap usage grep -E "SwapTotal|SwapFree" /proc/meminfo # If swap is being actively used, check swap I/O cat /proc/vmstat | grep -E "pswpin|pswpout" # Alert when non-reclaimable slab memory exceeds 500MB - alert: HighKernelSlabMemory expr: node_memory_SUnreclaim_bytes > 500 * 1024 * 1024 for: 30m labels: severity: warning annotations: summary: "High non-reclaimable kernel slab memory on {{ $labels.instance }}" # Alert when swap usage exceeds 50% - alert: HighSwapUsage expr: (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) > 0.5 for: 15m labels: severity: warning # Alert when memory pressure is high (PSI) - alert: MemoryPressureHigh expr: node_pressure_memory_stalled_seconds_total rate > 0.5 for: 5m labels: severity: critical # Alert when available memory is critically low - alert: LowAvailableMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 10m labels: severity: critical # Alert when non-reclaimable slab memory exceeds 500MB - alert: HighKernelSlabMemory expr: node_memory_SUnreclaim_bytes > 500 * 1024 * 1024 for: 30m labels: severity: warning annotations: summary: "High non-reclaimable kernel slab memory on {{ $labels.instance }}" # Alert when swap usage exceeds 50% - alert: HighSwapUsage expr: (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) > 0.5 for: 15m labels: severity: warning # Alert when memory pressure is high (PSI) - alert: MemoryPressureHigh expr: node_pressure_memory_stalled_seconds_total rate > 0.5 for: 5m labels: severity: critical # Alert when available memory is critically low - alert: LowAvailableMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 10m labels: severity: critical # Alert when non-reclaimable slab memory exceeds 500MB - alert: HighKernelSlabMemory expr: node_memory_SUnreclaim_bytes > 500 * 1024 * 1024 for: 30m labels: severity: warning annotations: summary: "High non-reclaimable kernel slab memory on {{ $labels.instance }}" # Alert when swap usage exceeds 50% - alert: HighSwapUsage expr: (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) > 0.5 for: 15m labels: severity: warning # Alert when memory pressure is high (PSI) - alert: MemoryPressureHigh expr: node_pressure_memory_stalled_seconds_total rate > 0.5 for: 5m labels: severity: critical # Alert when available memory is critically low - alert: LowAvailableMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 10m labels: severity: critical -weight: 500;">kubectl top /proc/meminfo /proc/slabinfo MemAvailable MemAvailable node_memory_SUnreclaim_bytes - Memory the kernel allocated on behalf of your process (network buffers, file descriptors) - Kernel data structures for managing your containers (cgroups, namespaces) - Shared libraries loaded once but used by multiple containers - -weight: 500;">kubectl top won't show it - Prometheus container metrics won't show it - Your pod's memory limit won't be hit by it - But it still uses physical RAM on the node - some > 50% → memory pressure exists - full > 10% → severe memory pressure (all tasks stalling) - full > 50% → critical — system is barely functional - SUnreclaim > 500 MB on a small node → possible kernel memory leak - Committed_AS > MemTotal + SwapTotal → system is overcommitted - SwapFree much less than SwapTotal → active swapping - MemAvailable < 10% of MemTotal → trouble ahead - pswpin = pages swapped in from disk (high = thrashing) - pswpout = pages swapped out to disk (high = thrashing) - -weight: 500;">kubectl top only shows container memory. The kernel can consume gigabytes that are invisible to Kubernetes. Always check /proc/meminfo when debugging node-level memory issues. - High SUnreclaim means something is wrong. Normal is 50-200 MB. If it's in the gigabytes, you have a kernel memory leak — find the leaking slab cache in /proc/slabinfo. - Swap thrashing masquerades as a CPU problem. If you see high CPU + high load average + swap usage, the CPU isn't busy computing — it's busy waiting for disk I/O from swap. - Page cache is not a problem. Low MemFree with healthy MemAvailable is normal — the kernel is caching files intelligently. Only worry when MemAvailable drops. - Network monitoring tools can leak socket buffers. Any tool that intercepts packets at the kernel level (Weave Scope, long-running tcpdump, certain -weight: 500;">service mesh sidecars) can accumulate sk_buff objects in slab memory over time. - Monitor node_memory_SUnreclaim_bytes. This is the one metric that would have caught our issue months before it caused an outage.