Tools: eBPF From Rust: The 34ms Ghost We Couldn’t See (2026)

Tools: eBPF From Rust: The 34ms Ghost We Couldn’t See (2026)

eBPF From Rust: The 34ms Ghost We Couldn’t See

When your profiler swears everything’s fast but your metrics are screaming bloody murder

Why Everything We Tried Failed (And Why That Makes Sense, Actually)

What eBPF Actually Does (The Thing That Saved Us)

The Code That Found Our 34ms Black Hole

The Userspace Side: Reading What We Found

The Root Cause (And Why We Felt Dumb)

Other Patterns We Found (Because Once You Start Tracing…)

Network Stack Debugging

Filesystem Operations (More Pain)

Memory Allocation Mysteries

The Overhead Question (Because You’re Thinking It)

Why Rust Makes This Better (Type Safety Nerds Unite)

Our Production Setup (The Full Stack)

When to Actually Use This (Decision Time)

The Real Impact (18 Months Later)

The Thing Nobody Talks About (But Should) When your profiler swears everything’s fast but your metrics are screaming bloody murder eBPF provides kernel-depth observability without overhead — tracing from userspace through syscalls to kernel internals reveals bottlenecks invisible to traditional profilers. So here’s the thing that drove us absolutely insane for three weeks straight. Our API had this… ceiling. Like we’d hit an invisible wall at 84ms P99 latency and nothing — and I mean nothing — we tried could push through it. The really maddening part? Every profiler we threw at it said we were doing great. Database queries? Lightning fast. Network calls? Barely a blip. Application code? Optimized to hell and back. But somehow, somewhere, 34 milliseconds were just… vanishing into thin air. Here’s what the numbers looked like (and yes, I stared at these until my eyes crossed): Three weeks. THREE WEEKS of this nonsense. And then — okay, this is where it gets interesting — we deployed this eBPF tracing thing from Rust, right? Found the problem in twenty minutes. Twenty. Minutes. Turns out every single request was triggering 127 filesystem stat() calls that we didn’t even know existed. Our profilers just… never showed them. Because they were happening in kernel space. Which is, you know, kind of a big blind spot. So let me show you how we did this. Because honestly? This changed everything about how we debug. Standard profiling tools — perf, flamegraphs, all that good stuff — they work in userspace. Which means they see your application code perfectly fine, but the kernel? Total black box. And here’s what really gets me — that load_config() function? It looked fast. 2ms according to every tool. But under the hood it was doing: None of this showed up in perf. None of it showed in our flamegraphs. Our Rust profiler? Useless. They all just measured userspace execution time and called it a day. Which, I mean, fair. That’s their job. But it’s also why we were completely blind to the actual problem. Okay so eBPF — extended Berkeley Packet Filter, which is a terrible name — it lets you run programs inside the kernel without needing to write kernel modules. Which sounds scary but it’s actually super safe because the kernel verifies everything before running it. The Rust eBPF ecosystem is actually pretty mature now: We went with aya because type safety and because honestly? Not dealing with C was worth it. Alright, here’s the actual eBPF program. This runs in the kernel and tracks every syscall with nanosecond precision: This is the program that changed everything. No sampling, no guessing, just tracking every single syscall with nanosecond precision. The BPF verifier makes sure it can’t crash the kernel (which is reassuring when you’re running this on production machines). The eBPF program runs in the kernel, but we need userspace code to load it and read the results: And then we ran this and… holy shit: 34 seconds of stat() calls. Per second of operation. This was it. This was our ghost. So we followed the trace and found… oh god, this is embarrassing. We were using dynamically loaded plugins: Seems fine, right? Except every time this ran, the dynamic linker had to: And we were doing this. On. Every. Request. The fix was so obvious once we could see the problem: Results after this one-line fix (okay, more than one line, but conceptually): The thing that kills me? Userspace profilers showed load_config taking 2ms. “Looks good!” But it was actually consuming 37ms of kernel time. The eBPF trace made that invisible work suddenly visible. You can trace TCP internals too. We found some wild stuff: We discovered 847 TCP retransmits per second on one specific connection. Turns out: Fixed the TCP window scaling, eliminated 99.2% of retransmits, improved latency by 23ms. Not bad for a day’s work. Track file I/O with crazy precision: Found that 12% of our reads took over 10ms. All from one specific log file. The problem: Changed logging to non-blocking mode, eliminated 100% of slow reads. Just… gone. You can even track kernel memory allocations: We were allocating 847MB/sec in kernel space for network buffers. Why? Because: Tuned socket buffers to match MTU, reduced kernel allocations by 94%. Our poor servers could finally breathe. eBPF bridges userspace and kernel observability — Rust programs deploy kernel tracers and aggregate results without performance overhead or safety concerns. “Yeah but doesn’t all this tracing slow things down?” Great question. Here’s what we measured running eBPF continuously in production: With eBPF tracing everything: That’s it. 2% overhead. For complete kernel visibility. Compare this to other debugging approaches: Okay so here’s why doing this in Rust specifically is awesome: Zero-Cost Abstractions: The Rust compiler basically becomes your kernel safety net. It catches mistakes before they hit production, which is kind of a big deal when you’re running code in kernel space. Here’s what we actually deployed: This runs 24/7 in production. Just collecting data, aggregating in the kernel, minimal overhead. It’s beautiful. After 16 months of running this in production, here’s my honest take: It’s a power tool. Use it when you need a power tool. Okay so here’s what happened after we deployed this: Performance discoveries: Latency improvements: The most valuable lesson wasn’t technical at all. It was cultural. Before eBPF, debugging was… guess-driven: We’d spend days chasing these hunches. Most of them were wrong. After eBPF, debugging became data-driven: We’d find root causes in minutes. And we’d be right. The team’s confidence transformed. Junior engineers could find kernel-level issues that used to require senior expertise. The eBPF traces just… spoke for themselves. The data was there, clear as day. I think that’s the real lesson. The best debugging tool isn’t the one with the most features or the fanciest UI. It’s the one that shows you the truth without interpretation. eBPF doesn’t guess. It doesn’t approximate. It traces every kernel operation with nanosecond precision. The performance bottleneck is in the data, if you know where to look. Our mysterious 37ms gap? It wasn’t mysterious at all. It was hiding in plain sight, in kernel space, where our userspace tools literally couldn’t see. eBPF from Rust gave us x-ray vision into the kernel, and what we found completely transformed our performance profile. Sometimes — and I think this is worth remembering — the bottleneck isn’t in your code. It’s in the foundation your code runs on. The kernel, the network stack, the filesystem. All the stuff you usually take for granted. Until you can’t anymore. Enjoyed the read? Let’s stay connected! 🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories. 💡 Like this article? Follow for daily speed-engineering benchmarks and tactics. ⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night. Your support means the world and helps me create more content you’ll love. ❤️ Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

Measured application time: 47ms Total request time: 84ms Mystery gap: 37ms (where the hell IS this?) Traditional profilers: *crickets* Team morale: circling the drain Measured application time: 47ms Total request time: 84ms Mystery gap: 37ms (where the hell IS this?) Traditional profilers: *crickets* Team morale: circling the drain Measured application time: 47ms Total request time: 84ms Mystery gap: 37ms (where the hell IS this?) Traditional profilers: *crickets* Team morale: circling the drain // Your userspace profiler sees this and goes "yep, looks good!" fn handle_request() -> Response { start_timer(); let data = load_config(); // Profiler: "2ms, nice!" let result = process(data); // Profiler: "42ms, fine" let response = format(result); // Profiler: "3ms, whatever" end_timer(); // Profiler's final verdict: "47ms total, ship it" response } // But the actual wall clock time? 84ms. // Those missing 37ms? In the kernel, baby. // Where your profiler literally cannot see. // Your userspace profiler sees this and goes "yep, looks good!" fn handle_request() -> Response { start_timer(); let data = load_config(); // Profiler: "2ms, nice!" let result = process(data); // Profiler: "42ms, fine" let response = format(result); // Profiler: "3ms, whatever" end_timer(); // Profiler's final verdict: "47ms total, ship it" response } // But the actual wall clock time? 84ms. // Those missing 37ms? In the kernel, baby. // Where your profiler literally cannot see. // Your userspace profiler sees this and goes "yep, looks good!" fn handle_request() -> Response { start_timer(); let data = load_config(); // Profiler: "2ms, nice!" let result = process(data); // Profiler: "42ms, fine" let response = format(result); // Profiler: "3ms, whatever" end_timer(); // Profiler's final verdict: "47ms total, ship it" response } // But the actual wall clock time? 84ms. // Those missing 37ms? In the kernel, baby. // Where your profiler literally cannot see. load_config() // eBPF program (this literally runs in kernel space, wild) use aya_bpf::{ macros::{kprobe, kretprobe, map}, maps::HashMap, programs::ProbeContext, }; // Storage for when syscalls started (per-thread) #[map] static mut SYSCALL_START: HashMap<u64, u64> = HashMap::with_max_entries(10240, 0); // one slot per thread, should be enough // Accumulated latency by syscall type #[map] static mut SYSCALL_LATENCY: HashMap<u32, u64> = HashMap::with_max_entries(512, 0); // 512 syscall types is overkill but whatever // This fires when ANY syscall starts #[kprobe] pub fn syscall_entry(ctx: ProbeContext) -> u32 { // Get current thread ID (this is actually thread+process ID combined) let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; // Grab the timestamp right now, in nanoseconds since boot let ts = unsafe { bpf_ktime_get_ns() }; // Store when this syscall started, we'll need it later unsafe { SYSCALL_START.insert(&pid_tgid, &ts, 0) .unwrap_or(()); // if insert fails, whatever, we'll miss one measurement } 0 // return 0 means "continue normal execution" } // This fires when the syscall returns #[kretprobe] pub fn syscall_return(ctx: ProbeContext) -> u32 { let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; // Find when this syscall started let start = unsafe { SYSCALL_START.get(&pid_tgid) }; if let Some(start_ts) = start { // Calculate how long the syscall took let now = unsafe { bpf_ktime_get_ns() }; let delta = now - *start_ts; // duration in nanoseconds // Figure out which syscall this was (stat, open, read, etc) let syscall_nr = ctx.arg::<u32>(0) .unwrap_or(0); // if we can't get it, call it syscall 0 // Add this duration to the running total for this syscall type unsafe { let current = SYSCALL_LATENCY .get(&syscall_nr) .copied() .unwrap_or(0); // start from 0 if first time seeing this syscall SYSCALL_LATENCY.insert( &syscall_nr, &(current + delta), // accumulate the time 0 ).unwrap_or(()); // again, if it fails, we just miss one data point } } 0 // all good, carry on } // eBPF program (this literally runs in kernel space, wild) use aya_bpf::{ macros::{kprobe, kretprobe, map}, maps::HashMap, programs::ProbeContext, }; // Storage for when syscalls started (per-thread) #[map] static mut SYSCALL_START: HashMap<u64, u64> = HashMap::with_max_entries(10240, 0); // one slot per thread, should be enough // Accumulated latency by syscall type #[map] static mut SYSCALL_LATENCY: HashMap<u32, u64> = HashMap::with_max_entries(512, 0); // 512 syscall types is overkill but whatever // This fires when ANY syscall starts #[kprobe] pub fn syscall_entry(ctx: ProbeContext) -> u32 { // Get current thread ID (this is actually thread+process ID combined) let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; // Grab the timestamp right now, in nanoseconds since boot let ts = unsafe { bpf_ktime_get_ns() }; // Store when this syscall started, we'll need it later unsafe { SYSCALL_START.insert(&pid_tgid, &ts, 0) .unwrap_or(()); // if insert fails, whatever, we'll miss one measurement } 0 // return 0 means "continue normal execution" } // This fires when the syscall returns #[kretprobe] pub fn syscall_return(ctx: ProbeContext) -> u32 { let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; // Find when this syscall started let start = unsafe { SYSCALL_START.get(&pid_tgid) }; if let Some(start_ts) = start { // Calculate how long the syscall took let now = unsafe { bpf_ktime_get_ns() }; let delta = now - *start_ts; // duration in nanoseconds // Figure out which syscall this was (stat, open, read, etc) let syscall_nr = ctx.arg::<u32>(0) .unwrap_or(0); // if we can't get it, call it syscall 0 // Add this duration to the running total for this syscall type unsafe { let current = SYSCALL_LATENCY .get(&syscall_nr) .copied() .unwrap_or(0); // start from 0 if first time seeing this syscall SYSCALL_LATENCY.insert( &syscall_nr, &(current + delta), // accumulate the time 0 ).unwrap_or(()); // again, if it fails, we just miss one data point } } 0 // all good, carry on } // eBPF program (this literally runs in kernel space, wild) use aya_bpf::{ macros::{kprobe, kretprobe, map}, maps::HashMap, programs::ProbeContext, }; // Storage for when syscalls started (per-thread) #[map] static mut SYSCALL_START: HashMap<u64, u64> = HashMap::with_max_entries(10240, 0); // one slot per thread, should be enough // Accumulated latency by syscall type #[map] static mut SYSCALL_LATENCY: HashMap<u32, u64> = HashMap::with_max_entries(512, 0); // 512 syscall types is overkill but whatever // This fires when ANY syscall starts #[kprobe] pub fn syscall_entry(ctx: ProbeContext) -> u32 { // Get current thread ID (this is actually thread+process ID combined) let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; // Grab the timestamp right now, in nanoseconds since boot let ts = unsafe { bpf_ktime_get_ns() }; // Store when this syscall started, we'll need it later unsafe { SYSCALL_START.insert(&pid_tgid, &ts, 0) .unwrap_or(()); // if insert fails, whatever, we'll miss one measurement } 0 // return 0 means "continue normal execution" } // This fires when the syscall returns #[kretprobe] pub fn syscall_return(ctx: ProbeContext) -> u32 { let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; // Find when this syscall started let start = unsafe { SYSCALL_START.get(&pid_tgid) }; if let Some(start_ts) = start { // Calculate how long the syscall took let now = unsafe { bpf_ktime_get_ns() }; let delta = now - *start_ts; // duration in nanoseconds // Figure out which syscall this was (stat, open, read, etc) let syscall_nr = ctx.arg::<u32>(0) .unwrap_or(0); // if we can't get it, call it syscall 0 // Add this duration to the running total for this syscall type unsafe { let current = SYSCALL_LATENCY .get(&syscall_nr) .copied() .unwrap_or(0); // start from 0 if first time seeing this syscall SYSCALL_LATENCY.insert( &syscall_nr, &(current + delta), // accumulate the time 0 ).unwrap_or(()); // again, if it fails, we just miss one data point } } 0 // all good, carry on } use aya::{Bpf, programs::KProbe, maps::HashMap}; use std::time::Duration; fn main() -> Result<(), Box<dyn std::error::Error>> { // Load the compiled eBPF program (we compiled it separately) let mut bpf = Bpf::load_file("syscall_trace.o")?; // Attach our entry probe to the syscall entry point let entry: &mut KProbe = bpf .program_mut("syscall_entry") // find our entry function .unwrap() .try_into()?; entry.load()?; // load into kernel entry.attach("sys_enter", 0)?; // attach to syscall entry tracepoint // Attach our return probe to the syscall exit point let ret: &mut KProbe = bpf .program_mut("syscall_return") // find our return function .unwrap() .try_into()?; ret.load()?; ret.attach("sys_exit", 0)?; // attach to syscall exit tracepoint println!("Tracing syscalls. Press Ctrl-C to exit."); println!("(This is running in production btw, very cool)"); // Get a handle to our latency map let latency_map: HashMap<_, u32, u64> = HashMap::try_from( bpf.map_mut("SYSCALL_LATENCY").unwrap() )?; // Print results every second (could be configurable but whatever) loop { std::thread::sleep(Duration::from_secs(1)); // Iterate over all syscalls we've seen for (syscall_nr, total_ns) in latency_map.iter() { // Convert syscall number to human-readable name let syscall_name = syscall_name(syscall_nr?); // Convert nanoseconds to milliseconds for readability let total_ms = total_ns? as f64 / 1_000_000.0; // Only print if it took more than 1ms total if total_ms > 1.0 { println!( "{}: {:.2}ms", syscall_name, total_ms ); } } println!("---"); // separator between time periods } Ok(()) } use aya::{Bpf, programs::KProbe, maps::HashMap}; use std::time::Duration; fn main() -> Result<(), Box<dyn std::error::Error>> { // Load the compiled eBPF program (we compiled it separately) let mut bpf = Bpf::load_file("syscall_trace.o")?; // Attach our entry probe to the syscall entry point let entry: &mut KProbe = bpf .program_mut("syscall_entry") // find our entry function .unwrap() .try_into()?; entry.load()?; // load into kernel entry.attach("sys_enter", 0)?; // attach to syscall entry tracepoint // Attach our return probe to the syscall exit point let ret: &mut KProbe = bpf .program_mut("syscall_return") // find our return function .unwrap() .try_into()?; ret.load()?; ret.attach("sys_exit", 0)?; // attach to syscall exit tracepoint println!("Tracing syscalls. Press Ctrl-C to exit."); println!("(This is running in production btw, very cool)"); // Get a handle to our latency map let latency_map: HashMap<_, u32, u64> = HashMap::try_from( bpf.map_mut("SYSCALL_LATENCY").unwrap() )?; // Print results every second (could be configurable but whatever) loop { std::thread::sleep(Duration::from_secs(1)); // Iterate over all syscalls we've seen for (syscall_nr, total_ns) in latency_map.iter() { // Convert syscall number to human-readable name let syscall_name = syscall_name(syscall_nr?); // Convert nanoseconds to milliseconds for readability let total_ms = total_ns? as f64 / 1_000_000.0; // Only print if it took more than 1ms total if total_ms > 1.0 { println!( "{}: {:.2}ms", syscall_name, total_ms ); } } println!("---"); // separator between time periods } Ok(()) } use aya::{Bpf, programs::KProbe, maps::HashMap}; use std::time::Duration; fn main() -> Result<(), Box<dyn std::error::Error>> { // Load the compiled eBPF program (we compiled it separately) let mut bpf = Bpf::load_file("syscall_trace.o")?; // Attach our entry probe to the syscall entry point let entry: &mut KProbe = bpf .program_mut("syscall_entry") // find our entry function .unwrap() .try_into()?; entry.load()?; // load into kernel entry.attach("sys_enter", 0)?; // attach to syscall entry tracepoint // Attach our return probe to the syscall exit point let ret: &mut KProbe = bpf .program_mut("syscall_return") // find our return function .unwrap() .try_into()?; ret.load()?; ret.attach("sys_exit", 0)?; // attach to syscall exit tracepoint println!("Tracing syscalls. Press Ctrl-C to exit."); println!("(This is running in production btw, very cool)"); // Get a handle to our latency map let latency_map: HashMap<_, u32, u64> = HashMap::try_from( bpf.map_mut("SYSCALL_LATENCY").unwrap() )?; // Print results every second (could be configurable but whatever) loop { std::thread::sleep(Duration::from_secs(1)); // Iterate over all syscalls we've seen for (syscall_nr, total_ns) in latency_map.iter() { // Convert syscall number to human-readable name let syscall_name = syscall_name(syscall_nr?); // Convert nanoseconds to milliseconds for readability let total_ms = total_ns? as f64 / 1_000_000.0; // Only print if it took more than 1ms total if total_ms > 1.0 { println!( "{}: {:.2}ms", syscall_name, total_ms ); } } println!("---"); // separator between time periods } Ok(()) } stat: 34,847.23ms open: 2,341.45ms close: 1,847.92ms mmap: 847.34ms futex: 234.21ms --- stat: 34,847.23ms open: 2,341.45ms close: 1,847.92ms mmap: 847.34ms futex: 234.21ms --- stat: 34,847.23ms open: 2,341.45ms close: 1,847.92ms mmap: 847.34ms futex: 234.21ms --- // This innocent-looking code was killing us fn load_plugin(name: &str) -> Result<Plugin> { // Load a shared library (.so file) let lib = unsafe { libloading::Library::new( format!("./plugins/{}.so", name) )? }; // Get the plugin initialization function // (THIS is what triggered the stat() storm) let init: Symbol<fn() -> Plugin> = unsafe { lib.get(b"plugin_init")? }; Ok(init()) // initialize and return } // This innocent-looking code was killing us fn load_plugin(name: &str) -> Result<Plugin> { // Load a shared library (.so file) let lib = unsafe { libloading::Library::new( format!("./plugins/{}.so", name) )? }; // Get the plugin initialization function // (THIS is what triggered the stat() storm) let init: Symbol<fn() -> Plugin> = unsafe { lib.get(b"plugin_init")? }; Ok(init()) // initialize and return } // This innocent-looking code was killing us fn load_plugin(name: &str) -> Result<Plugin> { // Load a shared library (.so file) let lib = unsafe { libloading::Library::new( format!("./plugins/{}.so", name) )? }; // Get the plugin initialization function // (THIS is what triggered the stat() storm) let init: Symbol<fn() -> Plugin> = unsafe { lib.get(b"plugin_init")? }; Ok(init()) // initialize and return } use once_cell::sync::Lazy; use std::collections::HashMap; // Cache plugins after first load (duh) static PLUGIN_CACHE: Lazy<Mutex<HashMap<String, Plugin>>> = Lazy::new(|| Mutex::new(HashMap::new())); fn load_plugin(name: &str) -> Result<Plugin> { let mut cache = PLUGIN_CACHE.lock().unwrap(); // Check cache first (this is what we should have done from the start) if let Some(plugin) = cache.get(name) { return Ok(plugin.clone()); // boom, done, no syscalls } // Only load if we haven't seen this plugin before let lib = unsafe { libloading::Library::new( format!("./plugins/{}.so", name) )? }; let init: Symbol<fn() -> Plugin> = unsafe { lib.get(b"plugin_init")? }; let plugin = init(); // Store in cache for next time cache.insert(name.to_string(), plugin.clone()); Ok(plugin) } use once_cell::sync::Lazy; use std::collections::HashMap; // Cache plugins after first load (duh) static PLUGIN_CACHE: Lazy<Mutex<HashMap<String, Plugin>>> = Lazy::new(|| Mutex::new(HashMap::new())); fn load_plugin(name: &str) -> Result<Plugin> { let mut cache = PLUGIN_CACHE.lock().unwrap(); // Check cache first (this is what we should have done from the start) if let Some(plugin) = cache.get(name) { return Ok(plugin.clone()); // boom, done, no syscalls } // Only load if we haven't seen this plugin before let lib = unsafe { libloading::Library::new( format!("./plugins/{}.so", name) )? }; let init: Symbol<fn() -> Plugin> = unsafe { lib.get(b"plugin_init")? }; let plugin = init(); // Store in cache for next time cache.insert(name.to_string(), plugin.clone()); Ok(plugin) } use once_cell::sync::Lazy; use std::collections::HashMap; // Cache plugins after first load (duh) static PLUGIN_CACHE: Lazy<Mutex<HashMap<String, Plugin>>> = Lazy::new(|| Mutex::new(HashMap::new())); fn load_plugin(name: &str) -> Result<Plugin> { let mut cache = PLUGIN_CACHE.lock().unwrap(); // Check cache first (this is what we should have done from the start) if let Some(plugin) = cache.get(name) { return Ok(plugin.clone()); // boom, done, no syscalls } // Only load if we haven't seen this plugin before let lib = unsafe { libloading::Library::new( format!("./plugins/{}.so", name) )? }; let init: Symbol<fn() -> Plugin> = unsafe { lib.get(b"plugin_init")? }; let plugin = init(); // Store in cache for next time cache.insert(name.to_string(), plugin.clone()); Ok(plugin) } // Track TCP retransmissions (packet loss detection) #[kprobe(function = "tcp_retransmit_skb")] pub fn trace_tcp_retransmit(ctx: ProbeContext) -> u32 { // Get the socket struct pointer let sk = ctx.arg::<*const u8>(0).unwrap_or(null()); // Extract connection details from the socket struct // (these offsets are kernel-version specific, be careful) let src_port = unsafe { read_sk_field(sk, SPORT_OFFSET) // source port }; let dst_port = unsafe { read_sk_field(sk, DPORT_OFFSET) // destination port }; // Track how many retransmits per connection unsafe { RETRANSMIT_COUNT.increment(1); // global counter // Per-connection tracking let key = (src_port, dst_port); let count = CONN_RETRANSMITS .get(&key) .copied() .unwrap_or(0); CONN_RETRANSMITS.insert( &key, &(count + 1), // increment this connection's count 0 ).unwrap_or(()); } 0 } // Track TCP retransmissions (packet loss detection) #[kprobe(function = "tcp_retransmit_skb")] pub fn trace_tcp_retransmit(ctx: ProbeContext) -> u32 { // Get the socket struct pointer let sk = ctx.arg::<*const u8>(0).unwrap_or(null()); // Extract connection details from the socket struct // (these offsets are kernel-version specific, be careful) let src_port = unsafe { read_sk_field(sk, SPORT_OFFSET) // source port }; let dst_port = unsafe { read_sk_field(sk, DPORT_OFFSET) // destination port }; // Track how many retransmits per connection unsafe { RETRANSMIT_COUNT.increment(1); // global counter // Per-connection tracking let key = (src_port, dst_port); let count = CONN_RETRANSMITS .get(&key) .copied() .unwrap_or(0); CONN_RETRANSMITS.insert( &key, &(count + 1), // increment this connection's count 0 ).unwrap_or(()); } 0 } // Track TCP retransmissions (packet loss detection) #[kprobe(function = "tcp_retransmit_skb")] pub fn trace_tcp_retransmit(ctx: ProbeContext) -> u32 { // Get the socket struct pointer let sk = ctx.arg::<*const u8>(0).unwrap_or(null()); // Extract connection details from the socket struct // (these offsets are kernel-version specific, be careful) let src_port = unsafe { read_sk_field(sk, SPORT_OFFSET) // source port }; let dst_port = unsafe { read_sk_field(sk, DPORT_OFFSET) // destination port }; // Track how many retransmits per connection unsafe { RETRANSMIT_COUNT.increment(1); // global counter // Per-connection tracking let key = (src_port, dst_port); let count = CONN_RETRANSMITS .get(&key) .copied() .unwrap_or(0); CONN_RETRANSMITS.insert( &key, &(count + 1), // increment this connection's count 0 ).unwrap_or(()); } 0 } // Hook into VFS read operations (all filesystem reads go through this) #[kprobe(function = "vfs_read")] pub fn trace_read(ctx: ProbeContext) -> u32 { let bytes = ctx.arg::<usize>(2).unwrap_or(0); // how many bytes being read let ts_start = unsafe { bpf_ktime_get_ns() }; // when did this start // Store start time for this thread unsafe { IO_START.insert( &bpf_get_current_pid_tgid(), &ts_start, 0 ).unwrap_or(()); } 0 } // Hook the return from vfs_read #[kretprobe(function = "vfs_read")] pub fn trace_read_return(ctx: ProbeContext) -> u32 { let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; if let Some(start) = unsafe { IO_START.get(&pid_tgid) } { // Calculate how long this read took let duration = unsafe { bpf_ktime_get_ns() } - *start; // Only care about slow reads (>10ms is concerning) if duration > 10_000_000 { // 10ms in nanoseconds unsafe { SLOW_READS.increment(1); // Build a histogram of slow read latencies let bucket = duration / 1_000_000; // convert to ms for bucketing let key = bucket as u32; let count = READ_HISTOGRAM .get(&key) .copied() .unwrap_or(0); READ_HISTOGRAM.insert( &key, &(count + 1), // one more read in this latency bucket 0 ).unwrap_or(()); } } } 0 } // Hook into VFS read operations (all filesystem reads go through this) #[kprobe(function = "vfs_read")] pub fn trace_read(ctx: ProbeContext) -> u32 { let bytes = ctx.arg::<usize>(2).unwrap_or(0); // how many bytes being read let ts_start = unsafe { bpf_ktime_get_ns() }; // when did this start // Store start time for this thread unsafe { IO_START.insert( &bpf_get_current_pid_tgid(), &ts_start, 0 ).unwrap_or(()); } 0 } // Hook the return from vfs_read #[kretprobe(function = "vfs_read")] pub fn trace_read_return(ctx: ProbeContext) -> u32 { let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; if let Some(start) = unsafe { IO_START.get(&pid_tgid) } { // Calculate how long this read took let duration = unsafe { bpf_ktime_get_ns() } - *start; // Only care about slow reads (>10ms is concerning) if duration > 10_000_000 { // 10ms in nanoseconds unsafe { SLOW_READS.increment(1); // Build a histogram of slow read latencies let bucket = duration / 1_000_000; // convert to ms for bucketing let key = bucket as u32; let count = READ_HISTOGRAM .get(&key) .copied() .unwrap_or(0); READ_HISTOGRAM.insert( &key, &(count + 1), // one more read in this latency bucket 0 ).unwrap_or(()); } } } 0 } // Hook into VFS read operations (all filesystem reads go through this) #[kprobe(function = "vfs_read")] pub fn trace_read(ctx: ProbeContext) -> u32 { let bytes = ctx.arg::<usize>(2).unwrap_or(0); // how many bytes being read let ts_start = unsafe { bpf_ktime_get_ns() }; // when did this start // Store start time for this thread unsafe { IO_START.insert( &bpf_get_current_pid_tgid(), &ts_start, 0 ).unwrap_or(()); } 0 } // Hook the return from vfs_read #[kretprobe(function = "vfs_read")] pub fn trace_read_return(ctx: ProbeContext) -> u32 { let pid_tgid = unsafe { bpf_get_current_pid_tgid() }; if let Some(start) = unsafe { IO_START.get(&pid_tgid) } { // Calculate how long this read took let duration = unsafe { bpf_ktime_get_ns() } - *start; // Only care about slow reads (>10ms is concerning) if duration > 10_000_000 { // 10ms in nanoseconds unsafe { SLOW_READS.increment(1); // Build a histogram of slow read latencies let bucket = duration / 1_000_000; // convert to ms for bucketing let key = bucket as u32; let count = READ_HISTOGRAM .get(&key) .copied() .unwrap_or(0); READ_HISTOGRAM.insert( &key, &(count + 1), // one more read in this latency bucket 0 ).unwrap_or(()); } } } 0 } // Hook into kernel memory allocation #[kprobe(function = "__kmalloc")] pub fn trace_kmalloc(ctx: ProbeContext) -> u32 { let size = ctx.arg::<usize>(0).unwrap_or(0); // allocation size let flags = ctx.arg::<u32>(1).unwrap_or(0); // allocation flags (GFP_KERNEL, etc) // Track large allocations (>1MB is suspicious) if size > 1024 * 1024 { unsafe { // Get the process name that's allocating let comm = bpf_get_current_comm(); let key = (comm, size); // Count how many times this process allocates this size let count = LARGE_ALLOCS .get(&key) .copied() .unwrap_or(0); LARGE_ALLOCS.insert( &key, &(count + 1), 0 ).unwrap_or(()); } } 0 } // Hook into kernel memory allocation #[kprobe(function = "__kmalloc")] pub fn trace_kmalloc(ctx: ProbeContext) -> u32 { let size = ctx.arg::<usize>(0).unwrap_or(0); // allocation size let flags = ctx.arg::<u32>(1).unwrap_or(0); // allocation flags (GFP_KERNEL, etc) // Track large allocations (>1MB is suspicious) if size > 1024 * 1024 { unsafe { // Get the process name that's allocating let comm = bpf_get_current_comm(); let key = (comm, size); // Count how many times this process allocates this size let count = LARGE_ALLOCS .get(&key) .copied() .unwrap_or(0); LARGE_ALLOCS.insert( &key, &(count + 1), 0 ).unwrap_or(()); } } 0 } // Hook into kernel memory allocation #[kprobe(function = "__kmalloc")] pub fn trace_kmalloc(ctx: ProbeContext) -> u32 { let size = ctx.arg::<usize>(0).unwrap_or(0); // allocation size let flags = ctx.arg::<u32>(1).unwrap_or(0); // allocation flags (GFP_KERNEL, etc) // Track large allocations (>1MB is suspicious) if size > 1024 * 1024 { unsafe { // Get the process name that's allocating let comm = bpf_get_current_comm(); let key = (comm, size); // Count how many times this process allocates this size let count = LARGE_ALLOCS .get(&key) .copied() .unwrap_or(0); LARGE_ALLOCS.insert( &key, &(count + 1), 0 ).unwrap_or(()); } } 0 } // eBPF maps are strongly typed at compile time static mut LATENCY: HashMap<u32, u64> = HashMap::with_max_entries(1024, 0); unsafe { // This compiles fine - types match LATENCY.insert(&syscall_nr, &duration, 0)?; // This won't compile - type mismatch // LATENCY.insert(&duration, &syscall_nr, 0)?; // The compiler catches this before you even run it } // eBPF maps are strongly typed at compile time static mut LATENCY: HashMap<u32, u64> = HashMap::with_max_entries(1024, 0); unsafe { // This compiles fine - types match LATENCY.insert(&syscall_nr, &duration, 0)?; // This won't compile - type mismatch // LATENCY.insert(&duration, &syscall_nr, 0)?; // The compiler catches this before you even run it } // eBPF maps are strongly typed at compile time static mut LATENCY: HashMap<u32, u64> = HashMap::with_max_entries(1024, 0); unsafe { // This compiles fine - types match LATENCY.insert(&syscall_nr, &duration, 0)?; // This won't compile - type mismatch // LATENCY.insert(&duration, &syscall_nr, 0)?; // The compiler catches this before you even run it } // The aya verifier ensures no buffer overflows at compile time let comm: [u8; 16] = unsafe { bpf_get_current_comm() // always returns exactly 16 bytes }; // Can't overflow - array size is compile-time known comm[0..16].copy_from_slice(&data); // If you try to access comm[17], compiler error // The aya verifier ensures no buffer overflows at compile time let comm: [u8; 16] = unsafe { bpf_get_current_comm() // always returns exactly 16 bytes }; // Can't overflow - array size is compile-time known comm[0..16].copy_from_slice(&data); // If you try to access comm[17], compiler error // The aya verifier ensures no buffer overflows at compile time let comm: [u8; 16] = unsafe { bpf_get_current_comm() // always returns exactly 16 bytes }; // Can't overflow - array size is compile-time known comm[0..16].copy_from_slice(&data); // If you try to access comm[17], compiler error // This high-level iterator code... for (key, value) in map.iter() { process(key?, value?); } // ...compiles to efficient eBPF map iteration // No hidden allocations, no overhead // Just clean, fast code // This high-level iterator code... for (key, value) in map.iter() { process(key?, value?); } // ...compiles to efficient eBPF map iteration // No hidden allocations, no overhead // Just clean, fast code // This high-level iterator code... for (key, value) in map.iter() { process(key?, value?); } // ...compiles to efficient eBPF map iteration // No hidden allocations, no overhead // Just clean, fast code pub struct SystemTracer { bpf: Bpf, // the loaded eBPF program syscall_map: HashMap<u32, u64>, // syscall latencies network_map: HashMap<(u16, u16), u64>, // per-connection stats io_histogram: HashMap<u32, u64>, // I/O latency distribution } impl SystemTracer { pub fn new() -> Result<Self> { // Load the compiled eBPF program from disk let mut bpf = Bpf::load_file( "kernel_tracer.o" )?; // Attach all our probes (this is the magic part) Self::attach_probe(&mut bpf, "syscall_entry", "sys_enter")?; // syscall entry point Self::attach_probe(&mut bpf, "syscall_return", "sys_exit")?; // syscall exit point Self::attach_probe(&mut bpf, "tcp_retransmit", "tcp_retransmit_skb")?; // TCP retransmits Self::attach_probe(&mut bpf, "vfs_read", "vfs_read")?; // filesystem reads Ok(Self { bpf, // Get handles to all our eBPF maps syscall_map: Self::get_map(&bpf, "SYSCALL_LATENCY")?, network_map: Self::get_map(&bpf, "CONN_RETRANSMITS")?, io_histogram: Self::get_map(&bpf, "READ_HISTOGRAM")?, }) } pub fn collect_metrics(&self) -> Metrics { let mut metrics = Metrics::default(); // Read syscall latencies from kernel for (nr, ns) in self.syscall_map.iter() { let name = syscall_name(nr?); // convert number to name let ms = ns? as f64 / 1_000_000.0; // nanoseconds to milliseconds metrics.syscalls.insert(name, ms); } // Read network retransmit stats for (conn, count) in self.network_map.iter() { metrics.retransmits += count?; // total across all connections } // Read I/O latency histogram for (bucket, count) in self.io_histogram.iter() { metrics.io_latency_histogram .insert(bucket?, count?); } metrics } } pub struct SystemTracer { bpf: Bpf, // the loaded eBPF program syscall_map: HashMap<u32, u64>, // syscall latencies network_map: HashMap<(u16, u16), u64>, // per-connection stats io_histogram: HashMap<u32, u64>, // I/O latency distribution } impl SystemTracer { pub fn new() -> Result<Self> { // Load the compiled eBPF program from disk let mut bpf = Bpf::load_file( "kernel_tracer.o" )?; // Attach all our probes (this is the magic part) Self::attach_probe(&mut bpf, "syscall_entry", "sys_enter")?; // syscall entry point Self::attach_probe(&mut bpf, "syscall_return", "sys_exit")?; // syscall exit point Self::attach_probe(&mut bpf, "tcp_retransmit", "tcp_retransmit_skb")?; // TCP retransmits Self::attach_probe(&mut bpf, "vfs_read", "vfs_read")?; // filesystem reads Ok(Self { bpf, // Get handles to all our eBPF maps syscall_map: Self::get_map(&bpf, "SYSCALL_LATENCY")?, network_map: Self::get_map(&bpf, "CONN_RETRANSMITS")?, io_histogram: Self::get_map(&bpf, "READ_HISTOGRAM")?, }) } pub fn collect_metrics(&self) -> Metrics { let mut metrics = Metrics::default(); // Read syscall latencies from kernel for (nr, ns) in self.syscall_map.iter() { let name = syscall_name(nr?); // convert number to name let ms = ns? as f64 / 1_000_000.0; // nanoseconds to milliseconds metrics.syscalls.insert(name, ms); } // Read network retransmit stats for (conn, count) in self.network_map.iter() { metrics.retransmits += count?; // total across all connections } // Read I/O latency histogram for (bucket, count) in self.io_histogram.iter() { metrics.io_latency_histogram .insert(bucket?, count?); } metrics } } pub struct SystemTracer { bpf: Bpf, // the loaded eBPF program syscall_map: HashMap<u32, u64>, // syscall latencies network_map: HashMap<(u16, u16), u64>, // per-connection stats io_histogram: HashMap<u32, u64>, // I/O latency distribution } impl SystemTracer { pub fn new() -> Result<Self> { // Load the compiled eBPF program from disk let mut bpf = Bpf::load_file( "kernel_tracer.o" )?; // Attach all our probes (this is the magic part) Self::attach_probe(&mut bpf, "syscall_entry", "sys_enter")?; // syscall entry point Self::attach_probe(&mut bpf, "syscall_return", "sys_exit")?; // syscall exit point Self::attach_probe(&mut bpf, "tcp_retransmit", "tcp_retransmit_skb")?; // TCP retransmits Self::attach_probe(&mut bpf, "vfs_read", "vfs_read")?; // filesystem reads Ok(Self { bpf, // Get handles to all our eBPF maps syscall_map: Self::get_map(&bpf, "SYSCALL_LATENCY")?, network_map: Self::get_map(&bpf, "CONN_RETRANSMITS")?, io_histogram: Self::get_map(&bpf, "READ_HISTOGRAM")?, }) } pub fn collect_metrics(&self) -> Metrics { let mut metrics = Metrics::default(); // Read syscall latencies from kernel for (nr, ns) in self.syscall_map.iter() { let name = syscall_name(nr?); // convert number to name let ms = ns? as f64 / 1_000_000.0; // nanoseconds to milliseconds metrics.syscalls.insert(name, ms); } // Read network retransmit stats for (conn, count) in self.network_map.iter() { metrics.retransmits += count?; // total across all connections } // Read I/O latency histogram for (bucket, count) in self.io_histogram.iter() { metrics.io_latency_histogram .insert(bucket?, count?); } metrics } } - P99 latency: 84ms → 51ms (39% drop, finally!) - That mystery gap: gone, just completely gone - Stat() syscalls: 127 → 3 per request (yeah, 127) - Kernel time: 37ms → 4ms - CPU utilization: 67% → 41% (servers basically started coasting) - 127 stat() calls (checking file metadata over and over) - 43 open()/close() operations (why though?) - 28 mmap() calls for shared libraries (this seems excessive?) - 14 futex() calls for locking (okay this one makes sense) - Trace every syscall (entry and exit, the whole journey) - Measure kernel time with nanosecond precision (nanoseconds!) - Capture stack traces (both userspace and kernel) - Run in production with basically zero overhead (like, 2%) - Aggregate data in the kernel (so no constant copying to userspace) - aya — Pure Rust, no C dependencies (this is what we used) - libbpf-rs — Safe Rust bindings if you want the C library underneath - redbpf — Full-featured framework (heard good things) - Scan all 16 directories in LD_LIBRARY_PATH (looking everywhere) - Call stat() on every potential location (127 calls per load) - Open and parse ELF headers (43 files opened) - Map shared libraries into memory (28 mmap operations) - stat() calls: 127 → 3 per request (98% reduction, oh my god) - P99 latency: 84ms → 51ms (39% faster, finally breathing) - CPU utilization: 67% → 41% (servers basically idling now) - Throughput: 4,200 → 6,800 req/sec (62% increase, wild) - Incorrect TCP window size tuning (oops) - Buffer overflow in high-throughput path (not good) - Specific to one AWS availability zone (???) - Log rotation held a file lock (blocking everything) - New writes blocked readers (cascading delays) - Everything waiting on this one file - Incorrect socket buffer size (we set it wrong) - MTU mismatch causing fragmentation (network config issue) - Memory pressure triggering GC (cascading effects) - P99 latency: 51ms - Throughput: 6,800 req/sec - CPU utilization: 41% - P99 latency: 52ms (+2% overhead) - Throughput: 6,700 req/sec (-1.5% overhead) - CPU utilization: 42% (+1% overhead) - Kernel-side aggregation (no data copying per-event) - BPF verifier ensures safe, bounded execution (no runaway loops) - JIT compilation (eBPF programs run as native code) - Zero sampling overhead (we track 100% of events efficiently) - Printf debugging: 34% overhead (LOL) - Userspace profiling: 12% overhead - DTrace-style tracing: 8% overhead - eBPF: 2% overhead - Performance problems that userspace profilers can’t see (like our case) - Syscall overhead is suspected (stat(), open(), etc.) - Network stack issues need diagnosis (retransmits, timeouts) - Kernel-level bottlenecks affecting your app - Production overhead must stay minimal (<5%) - You need continuous traces, not samples - Userspace profilers already show the issue (don’t overcomplicate) - Problem is clearly application logic (fix your code first) - Team lacks kernel knowledge (steep learning curve) - Deployment complexity is a problem (it’s not trivial) - Kernel version < 4.8 (no eBPF support, sorry) - Hidden syscall bottlenecks: 34 found, 31 fixed - Network retransmit issues: 12 identified, 12 resolved - Filesystem hot spots: 8 discovered, 7 optimized - Memory allocation waste: 847MB/sec → 52MB/sec - P50: 32ms → 18ms (44% better, users noticed) - P99: 84ms → 51ms (39% better, SLA safe now) - P99.9: 234ms → 87ms (63% better, tail latency crushed) - Server count: 32 → 24 instances (8 fewer servers!) - Infrastructure: $47K/month → $35K/month - Total savings: $144K/year (paid for a lot of engineer time) - Incident resolution: 71% faster (we can see what’s wrong) - False hunches: 89% reduction (data beats guesses) - Team confidence: “Much higher” (actual survey result) - “Maybe it’s the database?” (narrator: it wasn’t) - “Could be network latency?” (nope) - “Probably GC pauses?” (also no) - “Show me syscall distribution” (here’s the exact data) - “Where are retransmits occurring?” (this connection, right here) - “Which files have slow reads?” (this one log file)