Tools

Tools: How to Tweak Linux Kernel Settings for Maximum Throughput on 10G Links (2026)

2026-03-31 0 views admin

How to Tweak Linux Kernel Settings for Maximum Throughput on 10G Links

Most packet loss doesn’t happen on the wire — it happens in a 512-slot queue that nobody knew existed.

If Your 10G Link Behaves Like Hotel Wi-Fi

The Market Data Incident Nobody Wants to Talk About

The Three Bouncers (And Why Two Are Asleep)

When The Kernel Ghosts Your Packets

What Finally Stopped The Bleeding

BBR: When I Stopped Fighting The Algorithm

The Hardware Queue I Forgot Existed

When I Accidentally DOSed My Own CPU

How I Finally Connected The Dots

What Actually Changed

The RSS Disaster I Nearly Shipped

Where This Goes Next Most packet loss doesn’t happen on the wire — it happens in a 512-slot queue that nobody knew existed. That’s your 10Gbps NIC trying to fit into kernel buffers sized for dial-up era. If your “10G” link stalls around 1G, start with three suspects: tiny NIC ring buffers, a netdev_max_backlog stuck at 1000, and RSS dumping everything onto one CPU. I spent three weeks thinking our switches were defective. Clean captures, zero NIC errors, but throughput would spike to 10Gbps for maybe two seconds then crater to 800Mbps and just camp there. Turned out the bottleneck was staring at me from /proc/sys/net/core the whole time, laughing. If you’ve ever stared at Grafana showing 8% link utilization while everyone swears “it’s the network,” this is your bug. Replicated DBs, Kafka, object storage, CDN origins — anything that should saturate 10G but mysteriously plateaus for no good reason lives here. We were building a high-frequency pipeline for market ticks. Intel X710 NICs, direct fiber, NUMA-pinned processes — the whole nine yards. But under burst load we’d silently lose 15–20% of packets. No kernel warnings. No ethtool counters. Just gaps that the finance team’s reconciliation found later, which led to some deeply unpleasant Slack conversations with traders who thought we’d cost them their edge. I assumed fabric packet loss. Switch counters showed zero drops. Meaning packets were vanishing after the NIC received them but before our app saw them. That’s when I learned the kernel still thinks you’re running Thunderbird on DSL and allocates memory accordingly. When a frame hits your NIC, it goes through three checkpoints: NIC ring buffer sits in hardware memory on the card — think of it as the holding pen before anything touches kernel land. netdev_max_backlog is the per-CPU software queue before protocol processing kicks in. This is where bursts go to die if you haven’t tuned it — like a bouncer who only remembers the last thousand faces. Socket receive buffer is the last checkpoint — if this bouncer is full, your app never even gets to complain. The packets just vanish outside the club. Default config assumes light browsing. For a 10G NIC under real burst traffic, queues start dropping packets before you even blink. My first attempt was adorably naive: Maximum buffer sizes mean nothing if the defaults never grow. I didn’t understand TCP auto-tuning yet. Just kidding — I had no idea those were even separate settings. I thought buffer overflows would scream at me. Syslog messages, error counters, something. The kernel just drops packets and moves on with its life. Basically shrugged and said “not my problem.” Then I ran netstat -s | grep -i drop and watched drop counters spin like crazy. Not NIC drops, not switch fabric—kernel-side backlog drops. And netdev_max_backlog was still sitting at 1000 while we were pushing 800,000 packets per second through that tiny queue. I thought the driver was choking — then I actually looked at what the queue depth was screaming at me. Quick diagnosis: check the second column of /proc/net/softnet_stat. If those numbers climb, your backlog is drowning. Here’s what worked on our 10G boxes. Steal these, then scale down if you’re not flinging firehose traffic: That tcp_mem line almost ended me. It's in pages. I set it to 16777216 thinking "16MB, plenty of headroom" and couldn't figure out why connections throttled. At 4KB per page, that's actually 64GB of TCP memory. Reading kernel docs at 3AM hits different. The backlog jump from 1000 to 250000 was the real win. That’s where the silent drops were actually happening, and once we fixed it, drop counters flatlined. Bigger buffers handle bursts. But they don’t solve the actual question: how does TCP know how fast to send without triggering congestion? CUBIC (the default) uses packet loss as a signal. Fill buffers until something drops, back off, repeat. On high-bandwidth, high-latency links this creates a cycle where you never hit full utilization — you’re always either filling or recovering. BBR measures actual delivery rate and RTT to model the path. It sends at the calculated bottleneck rate without needing to cause congestion first. Our WAN throughput jumped 40% immediately. No buffer tweaks needed — just smarter congestion sensing. Catch: BBR can be aggressive on shared links. Long-haul BBR flows versus short CUBIC flows, BBR tends to grab more bandwidth. Our pipeline was isolated so we didn’t care, but watch for this if you’re on shared infrastructure. Oh wait — there’s a hardware queue before the kernel even sees packets. The NIC ring buffer is where frames sit while the driver DMAs them into kernel memory. Default: 512 packets. Our burst: 2000 packets in 100 microseconds. Cool, cool, definitely not a problem. That change alone recovered 8% of our lost packets. But — and this bit me later — if you’re chasing ultra-low latency, huge rings add queueing delay. We needed throughput over latency so maxing out made sense. If you’re doing sub-millisecond stuff, step up gradually and watch your p99s. Once we accepted that we couldn’t have infinite buffers and low latency, the tradeoffs became obvious. Constraints force clarity. Every incoming packet fires a hardware interrupt to a CPU. At 800,000 packets/sec, that’s 800,000 context switches. Your cores spend more time fielding interrupts than doing actual work. Interrupt coalescing batches packets. The NIC waits a few microseconds or until N packets arrive, then interrupts once for the batch. We landed on rx-usecs 10 rx-frames 8 for market ticks because we needed p99 under 50 microseconds but couldn't afford drowning in interrupt storms. Gotcha that wrecked my weekend: these reset on reboot. Persist them in systemd, udev rules, whatever. I found out during maintenance when everything came back with defaults and our alerts went nuclear. At 3AM, staring at graphs, it clicked. Ring buffer size sets your burst absorption at hardware. netdev_max_backlog is your software surge tank before protocol processing. tcp_rmem/wmem controls individual flow throughput. tcp_mem caps everything combined across all connections. BBR versus CUBIC changes how aggressively you fill those buffers. Interrupt coalescing determines how often CPUs even check for new packets. Tune one, ignore the others, and you just relocate the bottleneck. I fixed ring buffers but left backlog at default — all I did was move packet drops 50 microseconds downstream. Imagine a 10K packet burst. Your 512-slot ring buffer thrashes — anything the driver can’t service in time just disappears. Whatever survives piles into a 1,000-slot netdev_max_backlog, where more packets quietly vanish under load. Only what's left ever reaches your socket buffers. At every stage, the kernel is making drop decisions based on settings from when 100Mbps Fast Ethernet was exotic. After tuning everything: The real moment was during a major market data spike — the kind that used to trigger loss alerts and panicked messages. This time? Nothing. Metrics stayed green. The system just absorbed it. That’s infrastructure becoming boring in the best way. All those buffers and queues are per-CPU. RSS (Receive Side Scaling) spreads packets across cores. If RSS is misconfigured, you can tune everything perfectly and still bottleneck because all traffic lands on CPU 0. I had perfectly tuned buffers on 16 cores but RSS only used 4 queues. Bottlenecked on those 4 while the other 12 sat idle. Redistributing interrupts gave us another 20% throughput. If you see only a few eth0 IRQ lines in /proc/interrupts with huge counts while other CPUs show almost nothing, RSS isn't distributing. Fix it before you tune anything else or you're just optimizing 25% of your hardware. This covers receive path. Transmit has its own maze: TSO/GSO offloads, qdisc queuing, how app send buffers interact with kernel TCP pacing. If I needed sub-50 microsecond tick handling, I’d probably look at XDP next — it bypasses the kernel stack entirely for line-rate packet filtering. But that’s a different kind of pain. If you want a Monday-morning checklist: The dirty truth: hardware stopped being the bottleneck years ago. It’s the software defaults — and most systems still ship with config from when YouTube didn’t exist. Go check netdev_max_backlog, your RX ring size, and /proc/net/softnet_stat tomorrow. I'd bet good money at least one is wrong. Enjoyed the read? Let’s stay connected! Your support means the world and helps me create more content you’ll love. ❤️ Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

netdev_max_backlog /proc/sys/net/core sysctl -w net.core.rmem_max=16777216 # max recv - thought this would save me sysctl -w net.core.wmem_max=16777216 # max send - narrator: it did not sysctl -w net.core.rmem_max=16777216 # max recv - thought this would save me sysctl -w net.core.wmem_max=16777216 # max send - narrator: it did not sysctl -w net.core.rmem_max=16777216 # max recv - thought this would save me sysctl -w net.core.wmem_max=16777216 # max send - narrator: it did not netstat -s | grep -i drop netdev_max_backlog /proc/net/softnet_stat net.core.rmem_max = 134217728 # 128MB ceiling (recv) net.core.wmem_max = 134217728 # 128MB ceiling (send) net.core.rmem_default = 16777216 # 16MB starting point net.core.wmem_default = 16777216 # lets auto-tuning grow from here # Per-socket TCP tuning: min, default, max in bytes net.ipv4.tcp_rmem = 4096 87380 134217728 # recv can grow to 128MB net.ipv4.tcp_wmem = 4096 65536 134217728 # send grows too # CRITICAL: This is in 4KB pages, NOT bytes net.ipv4.tcp_mem = 6291456 8388608 12582912 # ~24GB total across all sockets net.core.netdev_max_backlog = 250000 # the hero that saved us net.core.rmem_max = 134217728 # 128MB ceiling (recv) net.core.wmem_max = 134217728 # 128MB ceiling (send) net.core.rmem_default = 16777216 # 16MB starting point net.core.wmem_default = 16777216 # lets auto-tuning grow from here # Per-socket TCP tuning: min, default, max in bytes net.ipv4.tcp_rmem = 4096 87380 134217728 # recv can grow to 128MB net.ipv4.tcp_wmem = 4096 65536 134217728 # send grows too # CRITICAL: This is in 4KB pages, NOT bytes net.ipv4.tcp_mem = 6291456 8388608 12582912 # ~24GB total across all sockets net.core.netdev_max_backlog = 250000 # the hero that saved us net.core.rmem_max = 134217728 # 128MB ceiling (recv) net.core.wmem_max = 134217728 # 128MB ceiling (send) net.core.rmem_default = 16777216 # 16MB starting point net.core.wmem_default = 16777216 # lets auto-tuning grow from here # Per-socket TCP tuning: min, default, max in bytes net.ipv4.tcp_rmem = 4096 87380 134217728 # recv can grow to 128MB net.ipv4.tcp_wmem = 4096 65536 134217728 # send grows too # CRITICAL: This is in 4KB pages, NOT bytes net.ipv4.tcp_mem = 6291456 8388608 12582912 # ~24GB total across all sockets net.core.netdev_max_backlog = 250000 # the hero that saved us net.core.default_qdisc = fq # fair queuing for pacing net.ipv4.tcp_congestion_control = bbr # goodbye CUBIC net.core.default_qdisc = fq # fair queuing for pacing net.ipv4.tcp_congestion_control = bbr # goodbye CUBIC net.core.default_qdisc = fq # fair queuing for pacing net.ipv4.tcp_congestion_control = bbr # goodbye CUBIC ethtool -g eth0 # check current and max ring sizes # What I saw: # Pre-set maximums: # RX: 4096 ← NIC supports this # Current hardware settings: # RX: 512 ← are you kidding me ethtool -g eth0 # check current and max ring sizes # What I saw: # Pre-set maximums: # RX: 4096 ← NIC supports this # Current hardware settings: # RX: 512 ← are you kidding me ethtool -g eth0 # check current and max ring sizes # What I saw: # Pre-set maximums: # RX: 4096 ← NIC supports this # Current hardware settings: # RX: 512 ← are you kidding me ethtool -G eth0 rx 4096 tx 4096 # bump to hardware max ethtool -G eth0 rx 4096 tx 4096 # bump to hardware max ethtool -G eth0 rx 4096 tx 4096 # bump to hardware max ethtool -c eth0 # see current coalescing settings # What worked for bulk replication on our 10G boxes: ethtool -C eth0 rx-usecs 128 rx-frames 64 # wait 128µs or 64 packets # What you'd use for twitchy latency (we used something in between): ethtool -C eth0 rx-usecs 0 rx-frames 1 # interrupt immediately ethtool -c eth0 # see current coalescing settings # What worked for bulk replication on our 10G boxes: ethtool -C eth0 rx-usecs 128 rx-frames 64 # wait 128µs or 64 packets # What you'd use for twitchy latency (we used something in between): ethtool -C eth0 rx-usecs 0 rx-frames 1 # interrupt immediately ethtool -c eth0 # see current coalescing settings # What worked for bulk replication on our 10G boxes: ethtool -C eth0 rx-usecs 128 rx-frames 64 # wait 128µs or 64 packets # What you'd use for twitchy latency (we used something in between): ethtool -C eth0 rx-usecs 0 rx-frames 1 # interrupt immediately rx-usecs 10 rx-frames 8 netdev_max_backlog ethtool -l eth0 # how many RSS queues are active ethtool -L eth0 combined 16 # match your CPU count cat /proc/interrupts | grep eth0 # see which CPUs handle which queues ethtool -l eth0 # how many RSS queues are active ethtool -L eth0 combined 16 # match your CPU count cat /proc/interrupts | grep eth0 # see which CPUs handle which queues ethtool -l eth0 # how many RSS queues are active ethtool -L eth0 combined 16 # match your CPU count cat /proc/interrupts | grep eth0 # see which CPUs handle which queues /proc/interrupts ethtool -g eth0 net.core.netdev_max_backlog /proc/net/softnet_stat ethtool -l/-L /proc/interrupts netdev_max_backlog /proc/net/softnet_stat - Zero packet loss under production bursts (was losing 15–20%) - 10Gbps sustained (was averaging 7.2Gbps) - WAN transfers: 3.5Gbps → 5.8Gbps with BBR - P99 latency: 2.1ms → 180µs - CPU cost per packet dropped 15% (fewer interrupts = more useful work) - Check your RX ring: ethtool -g eth0 and bump it if you're still at 512. - Check your backlog and drops: net.core.netdev_max_backlog + /proc/net/softnet_stat. - Check RSS: ethtool -l/-L and /proc/interrupts to make sure all cores are actually getting traffic. - 🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories. - 💡 Like this article? Follow for daily speed-engineering benchmarks and tactics. - ⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolstweaklinuxkernelsettingsmaximumthroughputlinks

More from Tools

Tools: Ingress-NGINX Is Officially Retired — Complete Gateway API Migration Guide With ingress2gateway 1.0

2026-03-31 0

Tools: How to Set Up Free Home Network Monitoring (Before Something Breaks) (2026)

2026-03-31 0

Tools: Update: Turn Your Home Server into a NAS — Access Files from Any Device with Samba

2026-03-31 0

Tools: Who killed my service: collecting kernel kill logs with OTEL (2026)

2026-03-31 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: How to Tweak Linux Kernel Settings for Maximum Throughput on 10G Links (2026)

How to Tweak Linux Kernel Settings for Maximum Throughput on 10G Links

Most packet loss doesn’t happen on the wire — it happens in a 512-slot queue that nobody knew existed.

If Your 10G Link Behaves Like Hotel Wi-Fi

The Market Data Incident Nobody Wants to Talk About

The Three Bouncers (And Why Two Are Asleep)

When The Kernel Ghosts Your Packets

What Finally Stopped The Bleeding

BBR: When I Stopped Fighting The Algorithm

The Hardware Queue I Forgot Existed

When I Accidentally DOSed My Own CPU

How I Finally Connected The Dots

What Actually Changed

The RSS Disaster I Nearly Shipped

🏷️ Tags

More from Tools

Tools: Ingress-NGINX Is Officially Retired — Complete Gateway API Migration Guide With ingress2gateway 1.0

Tools: How to Set Up Free Home Network Monitoring (Before Something Breaks) (2026)

Tools: Update: Turn Your Home Server into a NAS — Access Files from Any Device with Samba

Tools: Who killed my service: collecting kernel kill logs with OTEL (2026)

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting