$ -weight: 500;">kubectl top
-weight: 500;">kubectl top node
NAME CPU MEMORY
ip-10-2-1-35 45m 616Mi/3936Mi (15%)
NAME CPU MEMORY
ip-10-2-1-35 45m 616Mi/3936Mi (15%)
NAME CPU MEMORY
ip-10-2-1-35 45m 616Mi/3936Mi (15%)
/proc/meminfo
Total RAM: 4,096 MB
├── Used by applications (Anonymous pages): 617 MB
│ ├── Container processes (what -weight: 500;">kubectl sees)
│ └── System processes (kubelet, containerd, etc.)
├── Page Cache (file-backed pages): 831 MB
│ └── Cached file data (can be reclaimed)
├── Kernel Slab: 2,194 MB ← invisible to k8s
│ ├── SReclaimable: 50 MB (can be freed)
│ └── SUnreclaim: 2,143 MB (cannot be freed!)
├── Kernel Stack, Page Tables, etc.: 60 MB
└── Free: 87 MB
Total RAM: 4,096 MB
├── Used by applications (Anonymous pages): 617 MB
│ ├── Container processes (what -weight: 500;">kubectl sees)
│ └── System processes (kubelet, containerd, etc.)
├── Page Cache (file-backed pages): 831 MB
│ └── Cached file data (can be reclaimed)
├── Kernel Slab: 2,194 MB ← invisible to k8s
│ ├── SReclaimable: 50 MB (can be freed)
│ └── SUnreclaim: 2,143 MB (cannot be freed!)
├── Kernel Stack, Page Tables, etc.: 60 MB
└── Free: 87 MB
Total RAM: 4,096 MB
├── Used by applications (Anonymous pages): 617 MB
│ ├── Container processes (what -weight: 500;">kubectl sees)
│ └── System processes (kubelet, containerd, etc.)
├── Page Cache (file-backed pages): 831 MB
│ └── Cached file data (can be reclaimed)
├── Kernel Slab: 2,194 MB ← invisible to k8s
│ ├── SReclaimable: 50 MB (can be freed)
│ └── SUnreclaim: 2,143 MB (cannot be freed!)
├── Kernel Stack, Page Tables, etc.: 60 MB
└── Free: 87 MB
# What Kubernetes reports
-weight: 500;">kubectl top pods -n live NAME CPU MEMORY
nightfort-688ccc5974-p47qs 7m 381Mi
# What Kubernetes reports
-weight: 500;">kubectl top pods -n live NAME CPU MEMORY
nightfort-688ccc5974-p47qs 7m 381Mi
# What Kubernetes reports
-weight: 500;">kubectl top pods -n live NAME CPU MEMORY
nightfort-688ccc5974-p47qs 7m 381Mi
First read of a file: Disk → RAM (page cache) → Process [slow]
Second read: Page cache → Process [fast]
First read of a file: Disk → RAM (page cache) → Process [slow]
Second read: Page cache → Process [fast]
First read of a file: Disk → RAM (page cache) → Process [slow]
Second read: Page cache → Process [fast]
MemAvailable
MemFree: 87 MB (truly unused)
MemAvailable: 735 MB (free + reclaimable cache)
MemFree: 87 MB (truly unused)
MemAvailable: 735 MB (free + reclaimable cache)
MemFree: 87 MB (truly unused)
MemAvailable: 735 MB (free + reclaimable cache)
MemAvailable
MemAvailable
Instead of: "I need an inode" → malloc(sizeof(inode)) → slow, fragmentation The kernel does: "I need an inode" → grab one from the inode pool → fast, no fragmentation "Done with inode" → return it to the pool → ready for reuse
Instead of: "I need an inode" → malloc(sizeof(inode)) → slow, fragmentation The kernel does: "I need an inode" → grab one from the inode pool → fast, no fragmentation "Done with inode" → return it to the pool → ready for reuse
Instead of: "I need an inode" → malloc(sizeof(inode)) → slow, fragmentation The kernel does: "I need an inode" → grab one from the inode pool → fast, no fragmentation "Done with inode" → return it to the pool → ready for reuse
/proc/slabinfo
cat /proc/slabinfo | sort -k3 -rn | head -10
cat /proc/slabinfo | sort -k3 -rn | head -10
cat /proc/slabinfo | sort -k3 -rn | head -10
kmalloc-1k 1,667,384 1024 bytes each → 1,632 MB
skbuff_head_cache 1,657,980 256 bytes each → 414 MB
dentry 9,248 192 bytes each → 1.7 MB
xfs_inode 9,649 1024 bytes each → 9.4 MB
kmalloc-1k 1,667,384 1024 bytes each → 1,632 MB
skbuff_head_cache 1,657,980 256 bytes each → 414 MB
dentry 9,248 192 bytes each → 1.7 MB
xfs_inode 9,649 1024 bytes each → 9.4 MB
kmalloc-1k 1,667,384 1024 bytes each → 1,632 MB
skbuff_head_cache 1,657,980 256 bytes each → 414 MB
dentry 9,248 192 bytes each → 1.7 MB
xfs_inode 9,649 1024 bytes each → 9.4 MB
SReclaimable: 200 MB (caches, will shrink if needed)
SUnreclaim: 100 MB (active kernel objects)
SReclaimable: 200 MB (caches, will shrink if needed)
SUnreclaim: 100 MB (active kernel objects)
SReclaimable: 200 MB (caches, will shrink if needed)
SUnreclaim: 100 MB (active kernel objects)
SReclaimable: 50 MB
SUnreclaim: 2,143 MB ← 21x normal!
SReclaimable: 50 MB
SUnreclaim: 2,143 MB ← 21x normal!
SReclaimable: 50 MB
SUnreclaim: 2,143 MB ← 21x normal!
-weight: 500;">kubectl top
/proc/meminfo
node_memory_SUnreclaim_bytes
RAM (4 GB) → Fast (nanoseconds) → Expensive
Disk/Swap → Slow (milliseconds) → Cheap
RAM (4 GB) → Fast (nanoseconds) → Expensive
Disk/Swap → Slow (milliseconds) → Cheap
RAM (4 GB) → Fast (nanoseconds) → Expensive
Disk/Swap → Slow (milliseconds) → Cheap
RAM [App 750MB] [Kubelet 200MB] [Other 500MB] [Cache 700MB] [Free 1.8GB]
Swap [empty]
RAM [App 750MB] [Kubelet 200MB] [Other 500MB] [Cache 700MB] [Free 1.8GB]
Swap [empty]
RAM [App 750MB] [Kubelet 200MB] [Other 500MB] [Cache 700MB] [Free 1.8GB]
Swap [empty]
RAM [App 830MB] [Kubelet 200MB] [Other 800MB] [Cache 700MB] [Slab 2.1GB] [Free 87MB]
Swap [empty]
RAM [App 830MB] [Kubelet 200MB] [Other 800MB] [Cache 700MB] [Slab 2.1GB] [Free 87MB]
Swap [empty]
RAM [App 830MB] [Kubelet 200MB] [Other 800MB] [Cache 700MB] [Slab 2.1GB] [Free 87MB]
Swap [empty]
RAM [App 750MB] [Kubelet 100MB] [Other 600MB] [Slab 2.1GB] [Cache 300MB]
Swap [Kubelet-old-pages 100MB | App-idle-pages 80MB | Other 320MB] = 500MB used
RAM [App 750MB] [Kubelet 100MB] [Other 600MB] [Slab 2.1GB] [Cache 300MB]
Swap [Kubelet-old-pages 100MB | App-idle-pages 80MB | Other 320MB] = 500MB used
RAM [App 750MB] [Kubelet 100MB] [Other 600MB] [Slab 2.1GB] [Cache 300MB]
Swap [Kubelet-old-pages 100MB | App-idle-pages 80MB | Other 320MB] = 500MB used
Normal access (page in RAM): CPU: "Give me address 0x1234" RAM: "Here you go" → 100 nanoseconds Swapped access (page on disk): CPU: "Give me address 0x1234" RAM: "Not here — it's on disk" → PAGE FAULT Kernel: "I need to load it from swap" Kernel: "But RAM is full. Let me swap OUT another page first" Disk write: Evict some other page to swap → 1-5 milliseconds Disk read: Load the requested page → 1-5 milliseconds CPU: "Finally!" → 2-10 milliseconds total (100,000x slower)
Normal access (page in RAM): CPU: "Give me address 0x1234" RAM: "Here you go" → 100 nanoseconds Swapped access (page on disk): CPU: "Give me address 0x1234" RAM: "Not here — it's on disk" → PAGE FAULT Kernel: "I need to load it from swap" Kernel: "But RAM is full. Let me swap OUT another page first" Disk write: Evict some other page to swap → 1-5 milliseconds Disk read: Load the requested page → 1-5 milliseconds CPU: "Finally!" → 2-10 milliseconds total (100,000x slower)
Normal access (page in RAM): CPU: "Give me address 0x1234" RAM: "Here you go" → 100 nanoseconds Swapped access (page on disk): CPU: "Give me address 0x1234" RAM: "Not here — it's on disk" → PAGE FAULT Kernel: "I need to load it from swap" Kernel: "But RAM is full. Let me swap OUT another page first" Disk write: Evict some other page to swap → 1-5 milliseconds Disk read: Load the requested page → 1-5 milliseconds CPU: "Finally!" → 2-10 milliseconds total (100,000x slower)
Process A needs a page → it's on disk → swap in A, swap out B → 5ms
Process B runs → needs its page → swapped out by A! → swap in B, swap out C → 5ms
Process C runs → needs its page → swapped out by B! → swap in C, swap out A → 5ms
Process A runs → needs its page → swapped out by C! → ...
Process A needs a page → it's on disk → swap in A, swap out B → 5ms
Process B runs → needs its page → swapped out by A! → swap in B, swap out C → 5ms
Process C runs → needs its page → swapped out by B! → swap in C, swap out A → 5ms
Process A runs → needs its page → swapped out by C! → ...
Process A needs a page → it's on disk → swap in A, swap out B → 5ms
Process B runs → needs its page → swapped out by A! → swap in B, swap out C → 5ms
Process C runs → needs its page → swapped out by B! → swap in C, swap out A → 5ms
Process A runs → needs its page → swapped out by C! → ...
Actual computation: ~5% (your app, kubelet, etc.)
Kernel swap management: ~30% (deciding what to evict, page table updates)
I/O wait: ~65% (waiting for disk reads/writes)
────────────────────────────────
Total: ~100%
Actual computation: ~5% (your app, kubelet, etc.)
Kernel swap management: ~30% (deciding what to evict, page table updates)
I/O wait: ~65% (waiting for disk reads/writes)
────────────────────────────────
Total: ~100%
Actual computation: ~5% (your app, kubelet, etc.)
Kernel swap management: ~30% (deciding what to evict, page table updates)
I/O wait: ~65% (waiting for disk reads/writes)
────────────────────────────────
Total: ~100%
Application: send("GET /health HTTP/1.1\r\n...") ↓
Kernel: allocate an sk_buff ├── skbuff_head_cache entry (256 bytes) — metadata, pointers, protocol info └── kmalloc-1k entry (1024 bytes) — the actual packet data ↓
Network stack: add TCP header, IP header, Ethernet header ↓
Network driver: transmit the packet ↓
Kernel: free the sk_buff ← THIS is what wasn't happening
Application: send("GET /health HTTP/1.1\r\n...") ↓
Kernel: allocate an sk_buff ├── skbuff_head_cache entry (256 bytes) — metadata, pointers, protocol info └── kmalloc-1k entry (1024 bytes) — the actual packet data ↓
Network stack: add TCP header, IP header, Ethernet header ↓
Network driver: transmit the packet ↓
Kernel: free the sk_buff ← THIS is what wasn't happening
Application: send("GET /health HTTP/1.1\r\n...") ↓
Kernel: allocate an sk_buff ├── skbuff_head_cache entry (256 bytes) — metadata, pointers, protocol info └── kmalloc-1k entry (1024 bytes) — the actual packet data ↓
Network stack: add TCP header, IP header, Ethernet header ↓
Network driver: transmit the packet ↓
Kernel: free the sk_buff ← THIS is what wasn't happening
skbuff_head_cache: 1,657,980 objects (414 MB)
kmalloc-1k: 1,667,384 objects (1,632 MB)
skbuff_head_cache: 1,657,980 objects (414 MB)
kmalloc-1k: 1,667,384 objects (1,632 MB)
skbuff_head_cache: 1,657,980 objects (414 MB)
kmalloc-1k: 1,667,384 objects (1,632 MB)
cat /proc/pressure/memory
cat /proc/pressure/memory
cat /proc/pressure/memory
some avg10=98.98 avg60=98.90 avg300=98.38 total=381246311078
full avg10=62.85 avg60=63.91 avg300=63.33 total=281968539996
some avg10=98.98 avg60=98.90 avg300=98.38 total=381246311078
full avg10=62.85 avg60=63.91 avg300=63.33 total=281968539996
some avg10=98.98 avg60=98.90 avg300=98.38 total=381246311078
full avg10=62.85 avg60=63.91 avg300=63.33 total=281968539996
grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree|AnonPages|Committed_AS" /proc/meminfo
grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree|AnonPages|Committed_AS" /proc/meminfo
grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree|AnonPages|Committed_AS" /proc/meminfo
MemTotal → Total physical RAM
MemFree → Completely unused RAM
MemAvailable → Free + reclaimable (what's actually available)
AnonPages → Application memory (what -weight: 500;">kubectl roughly shows)
Cached + Buffers → Page cache (reclaimable, usually harmless)
Slab → Kernel internal allocations SReclaimable → Kernel caches (can be freed) SUnreclaim → Active kernel objects (cannot be freed!)
SwapTotal → Total swap space
SwapFree → Unused swap (SwapTotal - SwapFree = swap used)
Committed_AS → Total memory promised to all processes
MemTotal → Total physical RAM
MemFree → Completely unused RAM
MemAvailable → Free + reclaimable (what's actually available)
AnonPages → Application memory (what -weight: 500;">kubectl roughly shows)
Cached + Buffers → Page cache (reclaimable, usually harmless)
Slab → Kernel internal allocations SReclaimable → Kernel caches (can be freed) SUnreclaim → Active kernel objects (cannot be freed!)
SwapTotal → Total swap space
SwapFree → Unused swap (SwapTotal - SwapFree = swap used)
Committed_AS → Total memory promised to all processes
MemTotal → Total physical RAM
MemFree → Completely unused RAM
MemAvailable → Free + reclaimable (what's actually available)
AnonPages → Application memory (what -weight: 500;">kubectl roughly shows)
Cached + Buffers → Page cache (reclaimable, usually harmless)
Slab → Kernel internal allocations SReclaimable → Kernel caches (can be freed) SUnreclaim → Active kernel objects (cannot be freed!)
SwapTotal → Total swap space
SwapFree → Unused swap (SwapTotal - SwapFree = swap used)
Committed_AS → Total memory promised to all processes
Committed_AS
MemTotal + SwapTotal
MemAvailable
# Show top slab consumers by object count
cat /proc/slabinfo | sort -k3 -rn | head -10
# Show top slab consumers by object count
cat /proc/slabinfo | sort -k3 -rn | head -10
# Show top slab consumers by object count
cat /proc/slabinfo | sort -k3 -rn | head -10
skbuff_head_cache
inode_cache
ext4_inode_cache
nf_conntrack
# Load average (should be < number of CPUs)
cat /proc/loadavg # Swap usage
grep -E "SwapTotal|SwapFree" /proc/meminfo # If swap is being actively used, check swap I/O
cat /proc/vmstat | grep -E "pswpin|pswpout"
# Load average (should be < number of CPUs)
cat /proc/loadavg # Swap usage
grep -E "SwapTotal|SwapFree" /proc/meminfo # If swap is being actively used, check swap I/O
cat /proc/vmstat | grep -E "pswpin|pswpout"
# Load average (should be < number of CPUs)
cat /proc/loadavg # Swap usage
grep -E "SwapTotal|SwapFree" /proc/meminfo # If swap is being actively used, check swap I/O
cat /proc/vmstat | grep -E "pswpin|pswpout"
# Alert when non-reclaimable slab memory exceeds 500MB
- alert: HighKernelSlabMemory expr: node_memory_SUnreclaim_bytes > 500 * 1024 * 1024 for: 30m labels: severity: warning annotations: summary: "High non-reclaimable kernel slab memory on {{ $labels.instance }}" # Alert when swap usage exceeds 50%
- alert: HighSwapUsage expr: (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) > 0.5 for: 15m labels: severity: warning # Alert when memory pressure is high (PSI)
- alert: MemoryPressureHigh expr: node_pressure_memory_stalled_seconds_total rate > 0.5 for: 5m labels: severity: critical # Alert when available memory is critically low
- alert: LowAvailableMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 10m labels: severity: critical
# Alert when non-reclaimable slab memory exceeds 500MB
- alert: HighKernelSlabMemory expr: node_memory_SUnreclaim_bytes > 500 * 1024 * 1024 for: 30m labels: severity: warning annotations: summary: "High non-reclaimable kernel slab memory on {{ $labels.instance }}" # Alert when swap usage exceeds 50%
- alert: HighSwapUsage expr: (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) > 0.5 for: 15m labels: severity: warning # Alert when memory pressure is high (PSI)
- alert: MemoryPressureHigh expr: node_pressure_memory_stalled_seconds_total rate > 0.5 for: 5m labels: severity: critical # Alert when available memory is critically low
- alert: LowAvailableMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 10m labels: severity: critical
# Alert when non-reclaimable slab memory exceeds 500MB
- alert: HighKernelSlabMemory expr: node_memory_SUnreclaim_bytes > 500 * 1024 * 1024 for: 30m labels: severity: warning annotations: summary: "High non-reclaimable kernel slab memory on {{ $labels.instance }}" # Alert when swap usage exceeds 50%
- alert: HighSwapUsage expr: (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) > 0.5 for: 15m labels: severity: warning # Alert when memory pressure is high (PSI)
- alert: MemoryPressureHigh expr: node_pressure_memory_stalled_seconds_total rate > 0.5 for: 5m labels: severity: critical # Alert when available memory is critically low
- alert: LowAvailableMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 10m labels: severity: critical
-weight: 500;">kubectl top
/proc/meminfo
/proc/slabinfo
MemAvailable
MemAvailable
node_memory_SUnreclaim_bytes - Memory the kernel allocated on behalf of your process (network buffers, file descriptors)
- Kernel data structures for managing your containers (cgroups, namespaces)
- Shared libraries loaded once but used by multiple containers - -weight: 500;">kubectl top won't show it
- Prometheus container metrics won't show it
- Your pod's memory limit won't be hit by it
- But it still uses physical RAM on the node - some > 50% → memory pressure exists
- full > 10% → severe memory pressure (all tasks stalling)
- full > 50% → critical — system is barely functional - SUnreclaim > 500 MB on a small node → possible kernel memory leak
- Committed_AS > MemTotal + SwapTotal → system is overcommitted
- SwapFree much less than SwapTotal → active swapping
- MemAvailable < 10% of MemTotal → trouble ahead - pswpin = pages swapped in from disk (high = thrashing)
- pswpout = pages swapped out to disk (high = thrashing) - -weight: 500;">kubectl top only shows container memory. The kernel can consume gigabytes that are invisible to Kubernetes. Always check /proc/meminfo when debugging node-level memory issues.
- High SUnreclaim means something is wrong. Normal is 50-200 MB. If it's in the gigabytes, you have a kernel memory leak — find the leaking slab cache in /proc/slabinfo.
- Swap thrashing masquerades as a CPU problem. If you see high CPU + high load average + swap usage, the CPU isn't busy computing — it's busy waiting for disk I/O from swap.
- Page cache is not a problem. Low MemFree with healthy MemAvailable is normal — the kernel is caching files intelligently. Only worry when MemAvailable drops.
- Network monitoring tools can leak socket buffers. Any tool that intercepts packets at the kernel level (Weave Scope, long-running tcpdump, certain -weight: 500;">service mesh sidecars) can accumulate sk_buff objects in slab memory over time.
- Monitor node_memory_SUnreclaim_bytes. This is the one metric that would have caught our issue months before it caused an outage.