Measured application time: 47ms Total request time: 84ms Mystery gap: 37ms (where the hell IS this?) Traditional profilers: *crickets* Team morale: circling the drain
Measured application time: 47ms Total request time: 84ms Mystery gap: 37ms (where the hell IS this?) Traditional profilers: *crickets* Team morale: circling the drain
Measured application time: 47ms Total request time: 84ms Mystery gap: 37ms (where the hell IS this?) Traditional profilers: *crickets* Team morale: circling the drain
// Your userspace profiler sees this and goes "yep, looks good!" fn handle_request() -> Response { start_timer(); let data = load_config(); // Profiler: "2ms, nice!" let result = process(data); // Profiler: "42ms, fine" let response = format(result); // Profiler: "3ms, whatever" end_timer(); // Profiler's final verdict: "47ms total, ship it" response } // But the actual wall clock time? 84ms. // Those missing 37ms? In the kernel, baby. // Where your profiler literally cannot see.
// Your userspace profiler sees this and goes "yep, looks good!" fn handle_request() -> Response { start_timer(); let data = load_config(); // Profiler: "2ms, nice!" let result = process(data); // Profiler: "42ms, fine" let response = format(result); // Profiler: "3ms, whatever" end_timer(); // Profiler's final verdict: "47ms total, ship it" response } // But the actual wall clock time? 84ms. // Those missing 37ms? In the kernel, baby. // Where your profiler literally cannot see.
// Your userspace profiler sees this and goes "yep, looks good!" fn handle_request() -> Response { start_timer(); let data = load_config(); // Profiler: "2ms, nice!" let result = process(data); // Profiler: "42ms, fine" let response = format(result); // Profiler: "3ms, whatever" end_timer(); // Profiler's final verdict: "47ms total, ship it" response } // But the actual wall clock time? 84ms. // Those missing 37ms? In the kernel, baby. // Where your profiler literally cannot see.
load_config()
// eBPF program (this literally runs in kernel space, wild) use aya_bpf::{ macros::{kprobe, kretprobe, map}, maps::HashMap, programs::ProbeContext, }; // Storage for when syscalls started (per-thread) #[map] static mut SYSCALL_START: HashMap<u64, u64> = HashMap::with_max_entries(10240, 0); // one slot per thread, should be enough // Accumulated latency by syscall type #[map] static mut SYSCALL_LATENCY: HashMap<u32, u64> = HashMap::with_max_entries(512, 0); // 512 syscall types is overkill but whatever // This fires when ANY syscall starts #[kprobe] pub fn syscall_entry(ctx: ProbeContext) -> u32 { // Get current thread ID (this is actually thread+process ID combined) let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; // Grab the timestamp right now, in nanoseconds since boot let ts = unsafe { bpf_ktime_get_ns() }; // Store when this syscall started, we'll need it later unsafe { SYSCALL_START.insert(&pid_tgid, &ts, 0) .unwrap_or(()); // if insert fails, whatever, we'll miss one measurement } 0 // return 0 means "continue normal execution" } // This fires when the syscall returns #[kretprobe] pub fn syscall_return(ctx: ProbeContext) -> u32 { let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; // Find when this syscall started let start = unsafe { SYSCALL_START.get(&pid_tgid) }; if let Some(start_ts) = start { // Calculate how long the syscall took let now = unsafe { bpf_ktime_get_ns() }; let delta = now - *start_ts; // duration in nanoseconds // Figure out which syscall this was (stat, open, read, etc) let syscall_nr = ctx.arg::<u32>(0) .unwrap_or(0); // if we can't get it, call it syscall 0 // Add this duration to the running total for this syscall type unsafe { let current = SYSCALL_LATENCY .get(&syscall_nr) .copied() .unwrap_or(0); // start from 0 if first time seeing this syscall SYSCALL_LATENCY.insert( &syscall_nr, &(current + delta), // accumulate the time 0 ).unwrap_or(()); // again, if it fails, we just miss one data point } } 0 // all good, carry on }
// eBPF program (this literally runs in kernel space, wild) use aya_bpf::{ macros::{kprobe, kretprobe, map}, maps::HashMap, programs::ProbeContext, }; // Storage for when syscalls started (per-thread) #[map] static mut SYSCALL_START: HashMap<u64, u64> = HashMap::with_max_entries(10240, 0); // one slot per thread, should be enough // Accumulated latency by syscall type #[map] static mut SYSCALL_LATENCY: HashMap<u32, u64> = HashMap::with_max_entries(512, 0); // 512 syscall types is overkill but whatever // This fires when ANY syscall starts #[kprobe] pub fn syscall_entry(ctx: ProbeContext) -> u32 { // Get current thread ID (this is actually thread+process ID combined) let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; // Grab the timestamp right now, in nanoseconds since boot let ts = unsafe { bpf_ktime_get_ns() }; // Store when this syscall started, we'll need it later unsafe { SYSCALL_START.insert(&pid_tgid, &ts, 0) .unwrap_or(()); // if insert fails, whatever, we'll miss one measurement } 0 // return 0 means "continue normal execution" } // This fires when the syscall returns #[kretprobe] pub fn syscall_return(ctx: ProbeContext) -> u32 { let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; // Find when this syscall started let start = unsafe { SYSCALL_START.get(&pid_tgid) }; if let Some(start_ts) = start { // Calculate how long the syscall took let now = unsafe { bpf_ktime_get_ns() }; let delta = now - *start_ts; // duration in nanoseconds // Figure out which syscall this was (stat, open, read, etc) let syscall_nr = ctx.arg::<u32>(0) .unwrap_or(0); // if we can't get it, call it syscall 0 // Add this duration to the running total for this syscall type unsafe { let current = SYSCALL_LATENCY .get(&syscall_nr) .copied() .unwrap_or(0); // start from 0 if first time seeing this syscall SYSCALL_LATENCY.insert( &syscall_nr, &(current + delta), // accumulate the time 0 ).unwrap_or(()); // again, if it fails, we just miss one data point } } 0 // all good, carry on }
// eBPF program (this literally runs in kernel space, wild) use aya_bpf::{ macros::{kprobe, kretprobe, map}, maps::HashMap, programs::ProbeContext, }; // Storage for when syscalls started (per-thread) #[map] static mut SYSCALL_START: HashMap<u64, u64> = HashMap::with_max_entries(10240, 0); // one slot per thread, should be enough // Accumulated latency by syscall type #[map] static mut SYSCALL_LATENCY: HashMap<u32, u64> = HashMap::with_max_entries(512, 0); // 512 syscall types is overkill but whatever // This fires when ANY syscall starts #[kprobe] pub fn syscall_entry(ctx: ProbeContext) -> u32 { // Get current thread ID (this is actually thread+process ID combined) let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; // Grab the timestamp right now, in nanoseconds since boot let ts = unsafe { bpf_ktime_get_ns() }; // Store when this syscall started, we'll need it later unsafe { SYSCALL_START.insert(&pid_tgid, &ts, 0) .unwrap_or(()); // if insert fails, whatever, we'll miss one measurement } 0 // return 0 means "continue normal execution" } // This fires when the syscall returns #[kretprobe] pub fn syscall_return(ctx: ProbeContext) -> u32 { let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; // Find when this syscall started let start = unsafe { SYSCALL_START.get(&pid_tgid) }; if let Some(start_ts) = start { // Calculate how long the syscall took let now = unsafe { bpf_ktime_get_ns() }; let delta = now - *start_ts; // duration in nanoseconds // Figure out which syscall this was (stat, open, read, etc) let syscall_nr = ctx.arg::<u32>(0) .unwrap_or(0); // if we can't get it, call it syscall 0 // Add this duration to the running total for this syscall type unsafe { let current = SYSCALL_LATENCY .get(&syscall_nr) .copied() .unwrap_or(0); // start from 0 if first time seeing this syscall SYSCALL_LATENCY.insert( &syscall_nr, &(current + delta), // accumulate the time 0 ).unwrap_or(()); // again, if it fails, we just miss one data point } } 0 // all good, carry on }
use aya::{Bpf, programs::KProbe, maps::HashMap}; use std::time::Duration; fn main() -> Result<(), Box<dyn std::error::Error>> { // Load the compiled eBPF program (we compiled it separately) let mut bpf = Bpf::load_file("syscall_trace.o")?; // Attach our entry probe to the syscall entry point let entry: &mut KProbe = bpf .program_mut("syscall_entry") // find our entry function .unwrap() .try_into()?; entry.load()?; // load into kernel entry.attach("sys_enter", 0)?; // attach to syscall entry tracepoint // Attach our return probe to the syscall exit point let ret: &mut KProbe = bpf .program_mut("syscall_return") // find our return function .unwrap() .try_into()?; ret.load()?; ret.attach("sys_exit", 0)?; // attach to syscall exit tracepoint println!("Tracing syscalls. Press Ctrl-C to exit."); println!("(This is running in production btw, very cool)"); // Get a handle to our latency map let latency_map: HashMap<_, u32, u64> = HashMap::try_from( bpf.map_mut("SYSCALL_LATENCY").unwrap() )?; // Print results every second (could be configurable but whatever) loop { std::thread::sleep(Duration::from_secs(1)); // Iterate over all syscalls we've seen for (syscall_nr, total_ns) in latency_map.iter() { // Convert syscall number to human-readable name let syscall_name = syscall_name(syscall_nr?); // Convert nanoseconds to milliseconds for readability let total_ms = total_ns? as f64 / 1_000_000.0; // Only print if it took more than 1ms total if total_ms > 1.0 { println!( "{}: {:.2}ms", syscall_name, total_ms ); } } println!("---"); // separator between time periods } Ok(()) }
use aya::{Bpf, programs::KProbe, maps::HashMap}; use std::time::Duration; fn main() -> Result<(), Box<dyn std::error::Error>> { // Load the compiled eBPF program (we compiled it separately) let mut bpf = Bpf::load_file("syscall_trace.o")?; // Attach our entry probe to the syscall entry point let entry: &mut KProbe = bpf .program_mut("syscall_entry") // find our entry function .unwrap() .try_into()?; entry.load()?; // load into kernel entry.attach("sys_enter", 0)?; // attach to syscall entry tracepoint // Attach our return probe to the syscall exit point let ret: &mut KProbe = bpf .program_mut("syscall_return") // find our return function .unwrap() .try_into()?; ret.load()?; ret.attach("sys_exit", 0)?; // attach to syscall exit tracepoint println!("Tracing syscalls. Press Ctrl-C to exit."); println!("(This is running in production btw, very cool)"); // Get a handle to our latency map let latency_map: HashMap<_, u32, u64> = HashMap::try_from( bpf.map_mut("SYSCALL_LATENCY").unwrap() )?; // Print results every second (could be configurable but whatever) loop { std::thread::sleep(Duration::from_secs(1)); // Iterate over all syscalls we've seen for (syscall_nr, total_ns) in latency_map.iter() { // Convert syscall number to human-readable name let syscall_name = syscall_name(syscall_nr?); // Convert nanoseconds to milliseconds for readability let total_ms = total_ns? as f64 / 1_000_000.0; // Only print if it took more than 1ms total if total_ms > 1.0 { println!( "{}: {:.2}ms", syscall_name, total_ms ); } } println!("---"); // separator between time periods } Ok(()) }
use aya::{Bpf, programs::KProbe, maps::HashMap}; use std::time::Duration; fn main() -> Result<(), Box<dyn std::error::Error>> { // Load the compiled eBPF program (we compiled it separately) let mut bpf = Bpf::load_file("syscall_trace.o")?; // Attach our entry probe to the syscall entry point let entry: &mut KProbe = bpf .program_mut("syscall_entry") // find our entry function .unwrap() .try_into()?; entry.load()?; // load into kernel entry.attach("sys_enter", 0)?; // attach to syscall entry tracepoint // Attach our return probe to the syscall exit point let ret: &mut KProbe = bpf .program_mut("syscall_return") // find our return function .unwrap() .try_into()?; ret.load()?; ret.attach("sys_exit", 0)?; // attach to syscall exit tracepoint println!("Tracing syscalls. Press Ctrl-C to exit."); println!("(This is running in production btw, very cool)"); // Get a handle to our latency map let latency_map: HashMap<_, u32, u64> = HashMap::try_from( bpf.map_mut("SYSCALL_LATENCY").unwrap() )?; // Print results every second (could be configurable but whatever) loop { std::thread::sleep(Duration::from_secs(1)); // Iterate over all syscalls we've seen for (syscall_nr, total_ns) in latency_map.iter() { // Convert syscall number to human-readable name let syscall_name = syscall_name(syscall_nr?); // Convert nanoseconds to milliseconds for readability let total_ms = total_ns? as f64 / 1_000_000.0; // Only print if it took more than 1ms total if total_ms > 1.0 { println!( "{}: {:.2}ms", syscall_name, total_ms ); } } println!("---"); // separator between time periods } Ok(()) }
stat: 34,847.23ms open: 2,341.45ms close: 1,847.92ms mmap: 847.34ms futex: 234.21ms ---
stat: 34,847.23ms open: 2,341.45ms close: 1,847.92ms mmap: 847.34ms futex: 234.21ms ---
stat: 34,847.23ms open: 2,341.45ms close: 1,847.92ms mmap: 847.34ms futex: 234.21ms ---
// This innocent-looking code was killing us fn load_plugin(name: &str) -> Result<Plugin> { // Load a shared library (.so file) let lib = unsafe { libloading::Library::new( format!("./plugins/{}.so", name) )? }; // Get the plugin initialization function // (THIS is what triggered the stat() storm) let init: Symbol<fn() -> Plugin> = unsafe { lib.get(b"plugin_init")? }; Ok(init()) // initialize and return }
// This innocent-looking code was killing us fn load_plugin(name: &str) -> Result<Plugin> { // Load a shared library (.so file) let lib = unsafe { libloading::Library::new( format!("./plugins/{}.so", name) )? }; // Get the plugin initialization function // (THIS is what triggered the stat() storm) let init: Symbol<fn() -> Plugin> = unsafe { lib.get(b"plugin_init")? }; Ok(init()) // initialize and return }
// This innocent-looking code was killing us fn load_plugin(name: &str) -> Result<Plugin> { // Load a shared library (.so file) let lib = unsafe { libloading::Library::new( format!("./plugins/{}.so", name) )? }; // Get the plugin initialization function // (THIS is what triggered the stat() storm) let init: Symbol<fn() -> Plugin> = unsafe { lib.get(b"plugin_init")? }; Ok(init()) // initialize and return }
use once_cell::sync::Lazy; use std::collections::HashMap; // Cache plugins after first load (duh) static PLUGIN_CACHE: Lazy<Mutex<HashMap<String, Plugin>>> = Lazy::new(|| Mutex::new(HashMap::new())); fn load_plugin(name: &str) -> Result<Plugin> { let mut cache = PLUGIN_CACHE.lock().unwrap(); // Check cache first (this is what we should have done from the start) if let Some(plugin) = cache.get(name) { return Ok(plugin.clone()); // boom, done, no syscalls } // Only load if we haven't seen this plugin before let lib = unsafe { libloading::Library::new( format!("./plugins/{}.so", name) )? }; let init: Symbol<fn() -> Plugin> = unsafe { lib.get(b"plugin_init")? }; let plugin = init(); // Store in cache for next time cache.insert(name.to_string(), plugin.clone()); Ok(plugin) }
use once_cell::sync::Lazy; use std::collections::HashMap; // Cache plugins after first load (duh) static PLUGIN_CACHE: Lazy<Mutex<HashMap<String, Plugin>>> = Lazy::new(|| Mutex::new(HashMap::new())); fn load_plugin(name: &str) -> Result<Plugin> { let mut cache = PLUGIN_CACHE.lock().unwrap(); // Check cache first (this is what we should have done from the start) if let Some(plugin) = cache.get(name) { return Ok(plugin.clone()); // boom, done, no syscalls } // Only load if we haven't seen this plugin before let lib = unsafe { libloading::Library::new( format!("./plugins/{}.so", name) )? }; let init: Symbol<fn() -> Plugin> = unsafe { lib.get(b"plugin_init")? }; let plugin = init(); // Store in cache for next time cache.insert(name.to_string(), plugin.clone()); Ok(plugin) }
use once_cell::sync::Lazy; use std::collections::HashMap; // Cache plugins after first load (duh) static PLUGIN_CACHE: Lazy<Mutex<HashMap<String, Plugin>>> = Lazy::new(|| Mutex::new(HashMap::new())); fn load_plugin(name: &str) -> Result<Plugin> { let mut cache = PLUGIN_CACHE.lock().unwrap(); // Check cache first (this is what we should have done from the start) if let Some(plugin) = cache.get(name) { return Ok(plugin.clone()); // boom, done, no syscalls } // Only load if we haven't seen this plugin before let lib = unsafe { libloading::Library::new( format!("./plugins/{}.so", name) )? }; let init: Symbol<fn() -> Plugin> = unsafe { lib.get(b"plugin_init")? }; let plugin = init(); // Store in cache for next time cache.insert(name.to_string(), plugin.clone()); Ok(plugin) }
// Track TCP retransmissions (packet loss detection) #[kprobe(function = "tcp_retransmit_skb")] pub fn trace_tcp_retransmit(ctx: ProbeContext) -> u32 { // Get the socket struct pointer let sk = ctx.arg::<*const u8>(0).unwrap_or(null()); // Extract connection details from the socket struct // (these offsets are kernel-version specific, be careful) let src_port = unsafe { read_sk_field(sk, SPORT_OFFSET) // source port }; let dst_port = unsafe { read_sk_field(sk, DPORT_OFFSET) // destination port }; // Track how many retransmits per connection unsafe { RETRANSMIT_COUNT.increment(1); // global counter // Per-connection tracking let key = (src_port, dst_port); let count = CONN_RETRANSMITS .get(&key) .copied() .unwrap_or(0); CONN_RETRANSMITS.insert( &key, &(count + 1), // increment this connection's count 0 ).unwrap_or(()); } 0 }
// Track TCP retransmissions (packet loss detection) #[kprobe(function = "tcp_retransmit_skb")] pub fn trace_tcp_retransmit(ctx: ProbeContext) -> u32 { // Get the socket struct pointer let sk = ctx.arg::<*const u8>(0).unwrap_or(null()); // Extract connection details from the socket struct // (these offsets are kernel-version specific, be careful) let src_port = unsafe { read_sk_field(sk, SPORT_OFFSET) // source port }; let dst_port = unsafe { read_sk_field(sk, DPORT_OFFSET) // destination port }; // Track how many retransmits per connection unsafe { RETRANSMIT_COUNT.increment(1); // global counter // Per-connection tracking let key = (src_port, dst_port); let count = CONN_RETRANSMITS .get(&key) .copied() .unwrap_or(0); CONN_RETRANSMITS.insert( &key, &(count + 1), // increment this connection's count 0 ).unwrap_or(()); } 0 }
// Track TCP retransmissions (packet loss detection) #[kprobe(function = "tcp_retransmit_skb")] pub fn trace_tcp_retransmit(ctx: ProbeContext) -> u32 { // Get the socket struct pointer let sk = ctx.arg::<*const u8>(0).unwrap_or(null()); // Extract connection details from the socket struct // (these offsets are kernel-version specific, be careful) let src_port = unsafe { read_sk_field(sk, SPORT_OFFSET) // source port }; let dst_port = unsafe { read_sk_field(sk, DPORT_OFFSET) // destination port }; // Track how many retransmits per connection unsafe { RETRANSMIT_COUNT.increment(1); // global counter // Per-connection tracking let key = (src_port, dst_port); let count = CONN_RETRANSMITS .get(&key) .copied() .unwrap_or(0); CONN_RETRANSMITS.insert( &key, &(count + 1), // increment this connection's count 0 ).unwrap_or(()); } 0 }
// Hook into VFS read operations (all filesystem reads go through this) #[kprobe(function = "vfs_read")] pub fn trace_read(ctx: ProbeContext) -> u32 { let bytes = ctx.arg::<usize>(2).unwrap_or(0); // how many bytes being read let ts_start = unsafe { bpf_ktime_get_ns() }; // when did this start // Store start time for this thread unsafe { IO_START.insert( &bpf_get_current_pid_tgid(), &ts_start, 0 ).unwrap_or(()); } 0 } // Hook the return from vfs_read #[kretprobe(function = "vfs_read")] pub fn trace_read_return(ctx: ProbeContext) -> u32 { let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; if let Some(start) = unsafe { IO_START.get(&pid_tgid) } { // Calculate how long this read took let duration = unsafe { bpf_ktime_get_ns() } - *start; // Only care about slow reads (>10ms is concerning) if duration > 10_000_000 { // 10ms in nanoseconds unsafe { SLOW_READS.increment(1); // Build a histogram of slow read latencies let bucket = duration / 1_000_000; // convert to ms for bucketing let key = bucket as u32; let count = READ_HISTOGRAM .get(&key) .copied() .unwrap_or(0); READ_HISTOGRAM.insert( &key, &(count + 1), // one more read in this latency bucket 0 ).unwrap_or(()); } } } 0 }
// Hook into VFS read operations (all filesystem reads go through this) #[kprobe(function = "vfs_read")] pub fn trace_read(ctx: ProbeContext) -> u32 { let bytes = ctx.arg::<usize>(2).unwrap_or(0); // how many bytes being read let ts_start = unsafe { bpf_ktime_get_ns() }; // when did this start // Store start time for this thread unsafe { IO_START.insert( &bpf_get_current_pid_tgid(), &ts_start, 0 ).unwrap_or(()); } 0 } // Hook the return from vfs_read #[kretprobe(function = "vfs_read")] pub fn trace_read_return(ctx: ProbeContext) -> u32 { let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; if let Some(start) = unsafe { IO_START.get(&pid_tgid) } { // Calculate how long this read took let duration = unsafe { bpf_ktime_get_ns() } - *start; // Only care about slow reads (>10ms is concerning) if duration > 10_000_000 { // 10ms in nanoseconds unsafe { SLOW_READS.increment(1); // Build a histogram of slow read latencies let bucket = duration / 1_000_000; // convert to ms for bucketing let key = bucket as u32; let count = READ_HISTOGRAM .get(&key) .copied() .unwrap_or(0); READ_HISTOGRAM.insert( &key, &(count + 1), // one more read in this latency bucket 0 ).unwrap_or(()); } } } 0 }
// Hook into VFS read operations (all filesystem reads go through this) #[kprobe(function = "vfs_read")] pub fn trace_read(ctx: ProbeContext) -> u32 { let bytes = ctx.arg::<usize>(2).unwrap_or(0); // how many bytes being read let ts_start = unsafe { bpf_ktime_get_ns() }; // when did this start // Store start time for this thread unsafe { IO_START.insert( &bpf_get_current_pid_tgid(), &ts_start, 0 ).unwrap_or(()); } 0 } // Hook the return from vfs_read #[kretprobe(function = "vfs_read")] pub fn trace_read_return(ctx: ProbeContext) -> u32 { let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; if let Some(start) = unsafe { IO_START.get(&pid_tgid) } { // Calculate how long this read took let duration = unsafe { bpf_ktime_get_ns() } - *start; // Only care about slow reads (>10ms is concerning) if duration > 10_000_000 { // 10ms in nanoseconds unsafe { SLOW_READS.increment(1); // Build a histogram of slow read latencies let bucket = duration / 1_000_000; // convert to ms for bucketing let key = bucket as u32; let count = READ_HISTOGRAM .get(&key) .copied() .unwrap_or(0); READ_HISTOGRAM.insert( &key, &(count + 1), // one more read in this latency bucket 0 ).unwrap_or(()); } } } 0 }
// Hook into kernel memory allocation #[kprobe(function = "__kmalloc")] pub fn trace_kmalloc(ctx: ProbeContext) -> u32 { let size = ctx.arg::<usize>(0).unwrap_or(0); // allocation size let flags = ctx.arg::<u32>(1).unwrap_or(0); // allocation flags (GFP_KERNEL, etc) // Track large allocations (>1MB is suspicious) if size > 1024 * 1024 { unsafe { // Get the process name that's allocating let comm = bpf_get_current_comm(); let key = (comm, size); // Count how many times this process allocates this size let count = LARGE_ALLOCS .get(&key) .copied() .unwrap_or(0); LARGE_ALLOCS.insert( &key, &(count + 1), 0 ).unwrap_or(()); } } 0 }
// Hook into kernel memory allocation #[kprobe(function = "__kmalloc")] pub fn trace_kmalloc(ctx: ProbeContext) -> u32 { let size = ctx.arg::<usize>(0).unwrap_or(0); // allocation size let flags = ctx.arg::<u32>(1).unwrap_or(0); // allocation flags (GFP_KERNEL, etc) // Track large allocations (>1MB is suspicious) if size > 1024 * 1024 { unsafe { // Get the process name that's allocating let comm = bpf_get_current_comm(); let key = (comm, size); // Count how many times this process allocates this size let count = LARGE_ALLOCS .get(&key) .copied() .unwrap_or(0); LARGE_ALLOCS.insert( &key, &(count + 1), 0 ).unwrap_or(()); } } 0 }
// Hook into kernel memory allocation #[kprobe(function = "__kmalloc")] pub fn trace_kmalloc(ctx: ProbeContext) -> u32 { let size = ctx.arg::<usize>(0).unwrap_or(0); // allocation size let flags = ctx.arg::<u32>(1).unwrap_or(0); // allocation flags (GFP_KERNEL, etc) // Track large allocations (>1MB is suspicious) if size > 1024 * 1024 { unsafe { // Get the process name that's allocating let comm = bpf_get_current_comm(); let key = (comm, size); // Count how many times this process allocates this size let count = LARGE_ALLOCS .get(&key) .copied() .unwrap_or(0); LARGE_ALLOCS.insert( &key, &(count + 1), 0 ).unwrap_or(()); } } 0 }
// eBPF maps are strongly typed at compile time static mut LATENCY: HashMap<u32, u64> = HashMap::with_max_entries(1024, 0); unsafe { // This compiles fine - types match LATENCY.insert(&syscall_nr, &duration, 0)?; // This won't compile - type mismatch // LATENCY.insert(&duration, &syscall_nr, 0)?; // The compiler catches this before you even run it }
// eBPF maps are strongly typed at compile time static mut LATENCY: HashMap<u32, u64> = HashMap::with_max_entries(1024, 0); unsafe { // This compiles fine - types match LATENCY.insert(&syscall_nr, &duration, 0)?; // This won't compile - type mismatch // LATENCY.insert(&duration, &syscall_nr, 0)?; // The compiler catches this before you even run it }
// eBPF maps are strongly typed at compile time static mut LATENCY: HashMap<u32, u64> = HashMap::with_max_entries(1024, 0); unsafe { // This compiles fine - types match LATENCY.insert(&syscall_nr, &duration, 0)?; // This won't compile - type mismatch // LATENCY.insert(&duration, &syscall_nr, 0)?; // The compiler catches this before you even run it }
// The aya verifier ensures no buffer overflows at compile time let comm: [u8; 16] = unsafe { bpf_get_current_comm() // always returns exactly 16 bytes }; // Can't overflow - array size is compile-time known comm[0..16].copy_from_slice(&data); // If you try to access comm[17], compiler error
// The aya verifier ensures no buffer overflows at compile time let comm: [u8; 16] = unsafe { bpf_get_current_comm() // always returns exactly 16 bytes }; // Can't overflow - array size is compile-time known comm[0..16].copy_from_slice(&data); // If you try to access comm[17], compiler error
// The aya verifier ensures no buffer overflows at compile time let comm: [u8; 16] = unsafe { bpf_get_current_comm() // always returns exactly 16 bytes }; // Can't overflow - array size is compile-time known comm[0..16].copy_from_slice(&data); // If you try to access comm[17], compiler error
// This high-level iterator code... for (key, value) in map.iter() { process(key?, value?); } // ...compiles to efficient eBPF map iteration // No hidden allocations, no overhead // Just clean, fast code
// This high-level iterator code... for (key, value) in map.iter() { process(key?, value?); } // ...compiles to efficient eBPF map iteration // No hidden allocations, no overhead // Just clean, fast code
// This high-level iterator code... for (key, value) in map.iter() { process(key?, value?); } // ...compiles to efficient eBPF map iteration // No hidden allocations, no overhead // Just clean, fast code
pub struct SystemTracer { bpf: Bpf, // the loaded eBPF program syscall_map: HashMap<u32, u64>, // syscall latencies network_map: HashMap<(u16, u16), u64>, // per-connection stats io_histogram: HashMap<u32, u64>, // I/O latency distribution } impl SystemTracer { pub fn new() -> Result<Self> { // Load the compiled eBPF program from disk let mut bpf = Bpf::load_file( "kernel_tracer.o" )?; // Attach all our probes (this is the magic part) Self::attach_probe(&mut bpf, "syscall_entry", "sys_enter")?; // syscall entry point Self::attach_probe(&mut bpf, "syscall_return", "sys_exit")?; // syscall exit point Self::attach_probe(&mut bpf, "tcp_retransmit", "tcp_retransmit_skb")?; // TCP retransmits Self::attach_probe(&mut bpf, "vfs_read", "vfs_read")?; // filesystem reads Ok(Self { bpf, // Get handles to all our eBPF maps syscall_map: Self::get_map(&bpf, "SYSCALL_LATENCY")?, network_map: Self::get_map(&bpf, "CONN_RETRANSMITS")?, io_histogram: Self::get_map(&bpf, "READ_HISTOGRAM")?, }) } pub fn collect_metrics(&self) -> Metrics { let mut metrics = Metrics::default(); // Read syscall latencies from kernel for (nr, ns) in self.syscall_map.iter() { let name = syscall_name(nr?); // convert number to name let ms = ns? as f64 / 1_000_000.0; // nanoseconds to milliseconds metrics.syscalls.insert(name, ms); } // Read network retransmit stats for (conn, count) in self.network_map.iter() { metrics.retransmits += count?; // total across all connections } // Read I/O latency histogram for (bucket, count) in self.io_histogram.iter() { metrics.io_latency_histogram .insert(bucket?, count?); } metrics } }
pub struct SystemTracer { bpf: Bpf, // the loaded eBPF program syscall_map: HashMap<u32, u64>, // syscall latencies network_map: HashMap<(u16, u16), u64>, // per-connection stats io_histogram: HashMap<u32, u64>, // I/O latency distribution } impl SystemTracer { pub fn new() -> Result<Self> { // Load the compiled eBPF program from disk let mut bpf = Bpf::load_file( "kernel_tracer.o" )?; // Attach all our probes (this is the magic part) Self::attach_probe(&mut bpf, "syscall_entry", "sys_enter")?; // syscall entry point Self::attach_probe(&mut bpf, "syscall_return", "sys_exit")?; // syscall exit point Self::attach_probe(&mut bpf, "tcp_retransmit", "tcp_retransmit_skb")?; // TCP retransmits Self::attach_probe(&mut bpf, "vfs_read", "vfs_read")?; // filesystem reads Ok(Self { bpf, // Get handles to all our eBPF maps syscall_map: Self::get_map(&bpf, "SYSCALL_LATENCY")?, network_map: Self::get_map(&bpf, "CONN_RETRANSMITS")?, io_histogram: Self::get_map(&bpf, "READ_HISTOGRAM")?, }) } pub fn collect_metrics(&self) -> Metrics { let mut metrics = Metrics::default(); // Read syscall latencies from kernel for (nr, ns) in self.syscall_map.iter() { let name = syscall_name(nr?); // convert number to name let ms = ns? as f64 / 1_000_000.0; // nanoseconds to milliseconds metrics.syscalls.insert(name, ms); } // Read network retransmit stats for (conn, count) in self.network_map.iter() { metrics.retransmits += count?; // total across all connections } // Read I/O latency histogram for (bucket, count) in self.io_histogram.iter() { metrics.io_latency_histogram .insert(bucket?, count?); } metrics } }
pub struct SystemTracer { bpf: Bpf, // the loaded eBPF program syscall_map: HashMap<u32, u64>, // syscall latencies network_map: HashMap<(u16, u16), u64>, // per-connection stats io_histogram: HashMap<u32, u64>, // I/O latency distribution } impl SystemTracer { pub fn new() -> Result<Self> { // Load the compiled eBPF program from disk let mut bpf = Bpf::load_file( "kernel_tracer.o" )?; // Attach all our probes (this is the magic part) Self::attach_probe(&mut bpf, "syscall_entry", "sys_enter")?; // syscall entry point Self::attach_probe(&mut bpf, "syscall_return", "sys_exit")?; // syscall exit point Self::attach_probe(&mut bpf, "tcp_retransmit", "tcp_retransmit_skb")?; // TCP retransmits Self::attach_probe(&mut bpf, "vfs_read", "vfs_read")?; // filesystem reads Ok(Self { bpf, // Get handles to all our eBPF maps syscall_map: Self::get_map(&bpf, "SYSCALL_LATENCY")?, network_map: Self::get_map(&bpf, "CONN_RETRANSMITS")?, io_histogram: Self::get_map(&bpf, "READ_HISTOGRAM")?, }) } pub fn collect_metrics(&self) -> Metrics { let mut metrics = Metrics::default(); // Read syscall latencies from kernel for (nr, ns) in self.syscall_map.iter() { let name = syscall_name(nr?); // convert number to name let ms = ns? as f64 / 1_000_000.0; // nanoseconds to milliseconds metrics.syscalls.insert(name, ms); } // Read network retransmit stats for (conn, count) in self.network_map.iter() { metrics.retransmits += count?; // total across all connections } // Read I/O latency histogram for (bucket, count) in self.io_histogram.iter() { metrics.io_latency_histogram .insert(bucket?, count?); } metrics } } - P99 latency: 84ms → 51ms (39% drop, finally!)
- That mystery gap: gone, just completely gone
- Stat() syscalls: 127 → 3 per request (yeah, 127)
- Kernel time: 37ms → 4ms
- CPU utilization: 67% → 41% (servers basically started coasting) - 127 stat() calls (checking file metadata over and over)
- 43 open()/close() operations (why though?)
- 28 mmap() calls for shared libraries (this seems excessive?)
- 14 futex() calls for locking (okay this one makes sense) - Trace every syscall (entry and exit, the whole journey)
- Measure kernel time with nanosecond precision (nanoseconds!)
- Capture stack traces (both userspace and kernel)
- Run in production with basically zero overhead (like, 2%)
- Aggregate data in the kernel (so no constant copying to userspace) - aya — Pure Rust, no C dependencies (this is what we used)
- libbpf-rs — Safe Rust bindings if you want the C library underneath
- redbpf — Full-featured framework (heard good things) - Scan all 16 directories in LD_LIBRARY_PATH (looking everywhere)
- Call stat() on every potential location (127 calls per load)
- Open and parse ELF headers (43 files opened)
- Map shared libraries into memory (28 mmap operations) - stat() calls: 127 → 3 per request (98% reduction, oh my god)
- P99 latency: 84ms → 51ms (39% faster, finally breathing)
- CPU utilization: 67% → 41% (servers basically idling now)
- Throughput: 4,200 → 6,800 req/sec (62% increase, wild) - Incorrect TCP window size tuning (oops)
- Buffer overflow in high-throughput path (not good)
- Specific to one AWS availability zone (???) - Log rotation held a file lock (blocking everything)
- New writes blocked readers (cascading delays)
- Everything waiting on this one file - Incorrect socket buffer size (we set it wrong)
- MTU mismatch causing fragmentation (network config issue)
- Memory pressure triggering GC (cascading effects) - P99 latency: 51ms
- Throughput: 6,800 req/sec
- CPU utilization: 41% - P99 latency: 52ms (+2% overhead)
- Throughput: 6,700 req/sec (-1.5% overhead)
- CPU utilization: 42% (+1% overhead) - Kernel-side aggregation (no data copying per-event)
- BPF verifier ensures safe, bounded execution (no runaway loops)
- JIT compilation (eBPF programs run as native code)
- Zero sampling overhead (we track 100% of events efficiently) - Printf debugging: 34% overhead (LOL)
- Userspace profiling: 12% overhead
- DTrace-style tracing: 8% overhead
- eBPF: 2% overhead - Performance problems that userspace profilers can’t see (like our case)
- Syscall overhead is suspected (stat(), open(), etc.)
- Network stack issues need diagnosis (retransmits, timeouts)
- Kernel-level bottlenecks affecting your app
- Production overhead must stay minimal (<5%)
- You need continuous traces, not samples - Userspace profilers already show the issue (don’t overcomplicate)
- Problem is clearly application logic (fix your code first)
- Team lacks kernel knowledge (steep learning curve)
- Deployment complexity is a problem (it’s not trivial)
- Kernel version < 4.8 (no eBPF support, sorry) - Hidden syscall bottlenecks: 34 found, 31 fixed
- Network retransmit issues: 12 identified, 12 resolved
- Filesystem hot spots: 8 discovered, 7 optimized
- Memory allocation waste: 847MB/sec → 52MB/sec - P50: 32ms → 18ms (44% better, users noticed)
- P99: 84ms → 51ms (39% better, SLA safe now)
- P99.9: 234ms → 87ms (63% better, tail latency crushed) - Server count: 32 → 24 instances (8 fewer servers!)
- Infrastructure: $47K/month → $35K/month
- Total savings: $144K/year (paid for a lot of engineer time) - Incident resolution: 71% faster (we can see what’s wrong)
- False hunches: 89% reduction (data beats guesses)
- Team confidence: “Much higher” (actual survey result) - “Maybe it’s the database?” (narrator: it wasn’t)
- “Could be network latency?” (nope)
- “Probably GC pauses?” (also no) - “Show me syscall distribution” (here’s the exact data)
- “Where are retransmits occurring?” (this connection, right here)
- “Which files have slow reads?” (this one log file)