Tools: The Router Couldn't See My NAS — A 3-Hour Debug Into a Silent Intel NIC Bug - Full Analysis

Tools: The Router Couldn't See My NAS — A 3-Hour Debug Into a Silent Intel NIC Bug - Full Analysis

The Router Couldn't See My NAS

Phase 1: The Obvious Suspect — Mihomo

Phase 2: Looking Deeper — System Logs Tell a Story

Phase 3: Hardware Fingerprint

Phase 4: The Real Culprit — EEE

Phase 5: The Three Fixes

Fix 1 — Kill EEE (immediate)

Fix 2 — Double the Ring Buffers

Fix 3 — Make It Stick (persistence)

What I Learned

Current State My TRIM NAS at 192.168.3.135 had been running Hermes Agent 24/7 for months — handling Telegram gateway, proxy routing (mihomo), cron jobs, and file serving. It's a solid Debian 12 box with a custom 6.12.18 kernel. Then one evening I noticed I couldn't reach Telegram. Then my Windows machine (T2) lost internet. Then even the router's device list showed the NAS as offline. Ping to the router gateway (192.168.3.1) worked fine from the NAS itself. But the router couldn't see it. This is the story of what I found — and how the real culprit was a feature meant to save electricity. The first thing I checked was mihomo, my proxy daemon. It's the gateway between my LAN and the outside world. If it crashes, everything behind it loses connectivity — Telegram, web browsing, API calls — the whole stack. Clear enough. The config had rules referencing proxy groups (📱DEFAULT and 📱Telegram) that no longer existed — likely from a subscription update that replaced the proxy-groups section but left the old rules untouched. Mihomo refused to start. Fix was straightforward: point those rules to an existing group (🚀节点选择). But this was the consequence, not the root cause. Why did I reset the NAS in the first place? The reset happened at 22:47. Working backwards through /var/log/syslog.1, I found the real timeline: This was the Hermes Telegram gateway — a Node.js process — unable to reach Telegram's API. Each request timed out after 10 seconds. Meanwhile: The proxy nodes were dropping one by one. Curl through the proxy to Google worked fine, but sustained connections kept timing out. Key insight: The NAS was technically online (I could SSH in), but under high proxy load, the NIC was becoming invisible to the router. The router's ARP table expired and couldn't re-resolve the NAS's MAC address. No kernel panics. No OOM. No driver crash. No link-down events in dmesg, syslog, or kern.log. The NIC just... stopped responding at layer 2 under sustained load. Intel I211. Not the notoriously buggy I225/I226 that plagues many 2.5GbE boards — just a plain, reliable old 1GbE chip. I211 has been shipping since 2012. It should be bulletproof. But the boot params told a different story: pcie_aspm=off — someone had already encountered PCIe power management issues with this NIC before me. I checked the NIC's Energy Efficient Ethernet settings: EEE (IEEE 802.3az) lets the NIC drop into a low-power idle state between packets. On paper it saves a few watts. In practice, on the igb driver with an I211, the NIC sometimes fails to properly re-establish the link when exiting EEE under high connection churn. This is exactly what was happening: EEE is the most common cause of "link light is on but device is unreachable" on Intel NICs. It affects I210, I211, I225, and I226 to varying degrees. This disables EEE immediately without a reboot. The NIC stays in full-power mode and never enters low-power idle. Confirmed: The default ring buffer on the igb driver is 256 descriptors for both RX and TX. Maximum is 4096. Under sustained proxy load with hundreds of concurrent connections, 256 is a bottleneck — the NIC runs out of buffer space and starts dropping packets. This increases the buffer to 2048 descriptors each. The NIC now has 8× more room to queue packets before dropping them. igb.eee=0 tells the igb kernel module to never enable EEE, regardless of what the link partner advertises. For the ring buffer and EEE state, I created a systemd oneshot service: This runs before the network is declared online, so every service that follows sees the tuned NIC. The hardest bugs leave no logs. The NIC didn't crash, didn't report errors, didn't trigger a kernel oops. It just silently stopped responding to ARP. If I hadn't checked the EEE status, I'd still be blaming mihomo or the router. EEE is a false economy. The power savings on a 1GbE desktop NIC are negligible — maybe 0.3-0.5 watts. The stability cost far exceeds the benefit. For any always-on server, NAS, or gateway running the igb driver: turn EEE off. Ring buffer defaults are tuned for desktops, not servers. 256 descriptors is fine for a web browser but chokes under proxy load. Doubling or quadrupling it costs zero overhead in practice and eliminates an entire class of packet-drop edge cases. The mihomo config bug was a distraction. It was the symptom that I noticed, but the real problem was at layer 2. If I'd just fixed mihomo and moved on, the EEE drop would have come back within days. Router has been seeing the NAS continuously for 24+ hours since the fixes. Telegram gateway stable. Proxy health checks clean. Using this? A ⭐ or a one-word issue tells me what to build next — helps more than you'd think. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

May 29 23:52:17 trim-0c5b mihomo[692755]: level=fatal msg="Parse config error: rules[14] [DOMAIN-SUFFIX,api.telegram.org,📱DEFAULT] error: proxy [📱DEFAULT] not found" May 29 23:52:17 trim-0c5b mihomo[692755]: level=fatal msg="Parse config error: rules[14] [DOMAIN-SUFFIX,api.telegram.org,📱DEFAULT] error: proxy [📱DEFAULT] not found" May 29 23:52:17 trim-0c5b mihomo[692755]: level=fatal msg="Parse config error: rules[14] [DOMAIN-SUFFIX,api.telegram.org,📱DEFAULT] error: proxy [📱DEFAULT] not found" 21:00 — node[368932]: [fetch-timeout] fetch timeout after 10000ms url=https://api.telegram.org/bot***/getMe 21:01 — same timeout 21:02 — same timeout ...repeats every 60 seconds until 22:47... 21:00 — node[368932]: [fetch-timeout] fetch timeout after 10000ms url=https://api.telegram.org/bot***/getMe 21:01 — same timeout 21:02 — same timeout ...repeats every 60 seconds until 22:47... 21:00 — node[368932]: [fetch-timeout] fetch timeout after 10000ms url=https://api.telegram.org/bot***/getMe 21:01 — same timeout 21:02 — same timeout ...repeats every 60 seconds until 22:47... 21:09 — mihomo error: 🇺🇸美国圣何塞06 failed health check: context deadline exceeded 21:19 — 🇯🇵日本东京03, 🇯🇵日本东京04 also failing 21:24 — 🇹🇼台湾, 🇺🇸洛杉矶 also failing ...proxy node health checks kept failing in waves... 21:09 — mihomo error: 🇺🇸美国圣何塞06 failed health check: context deadline exceeded 21:19 — 🇯🇵日本东京03, 🇯🇵日本东京04 also failing 21:24 — 🇹🇼台湾, 🇺🇸洛杉矶 also failing ...proxy node health checks kept failing in waves... 21:09 — mihomo error: 🇺🇸美国圣何塞06 failed health check: context deadline exceeded 21:19 — 🇯🇵日本东京03, 🇯🇵日本东京04 also failing 21:24 — 🇹🇼台湾, 🇺🇸洛杉矶 also failing ...proxy node health checks kept failing in waves... 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection Driver: igb (in-kernel, version 6.12.18-trim) Firmware: 0.4-1 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection Driver: igb (in-kernel, version 6.12.18-trim) Firmware: 0.4-1 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection Driver: igb (in-kernel, version 6.12.18-trim) Firmware: 0.4-1 GRUB_CMDLINE_LINUX="modprobe.blacklist=pcspkr pcie_aspm=off" GRUB_CMDLINE_LINUX="modprobe.blacklist=pcspkr pcie_aspm=off" GRUB_CMDLINE_LINUX="modprobe.blacklist=pcspkr pcie_aspm=off" EEE status: enabled - inactive Supported EEE link modes: 100baseT/Full, 1000baseT/Full Advertised EEE link modes: 100baseT/Full, 1000baseT/Full EEE status: enabled - inactive Supported EEE link modes: 100baseT/Full, 1000baseT/Full Advertised EEE link modes: 100baseT/Full, 1000baseT/Full EEE status: enabled - inactive Supported EEE link modes: 100baseT/Full, 1000baseT/Full Advertised EEE link modes: 100baseT/Full, 1000baseT/Full ethtool --set-eee enp3s0 eee off ethtool --set-eee enp3s0 eee off ethtool --set-eee enp3s0 eee off EEE status: disabled EEE status: disabled EEE status: disabled ethtool -G enp3s0 rx 2048 tx 2048 ethtool -G enp3s0 rx 2048 tx 2048 ethtool -G enp3s0 rx 2048 tx 2048 # /etc/default/grub GRUB_CMDLINE_LINUX="modprobe.blacklist=pcspkr pcie_aspm=off igb.eee=0" # /etc/default/grub GRUB_CMDLINE_LINUX="modprobe.blacklist=pcspkr pcie_aspm=off igb.eee=0" # /etc/default/grub GRUB_CMDLINE_LINUX="modprobe.blacklist=pcspkr pcie_aspm=off igb.eee=0" [Unit] Description=NIC tuning - Intel I211 fixes After=network.target Before=network-online.target [Service] Type=oneshot ExecStart=/usr/sbin/ethtool --set-eee enp3s0 eee off ExecStart=/usr/sbin/ethtool -G enp3s0 rx 2048 tx 2048 RemainAfterExit=true [Install] WantedBy=multi-user.target [Unit] Description=NIC tuning - Intel I211 fixes After=network.target Before=network-online.target [Service] Type=oneshot ExecStart=/usr/sbin/ethtool --set-eee enp3s0 eee off ExecStart=/usr/sbin/ethtool -G enp3s0 rx 2048 tx 2048 RemainAfterExit=true [Install] WantedBy=multi-user.target [Unit] Description=NIC tuning - Intel I211 fixes After=network.target Before=network-online.target [Service] Type=oneshot ExecStart=/usr/sbin/ethtool --set-eee enp3s0 eee off ExecStart=/usr/sbin/ethtool -G enp3s0 rx 2048 tx 2048 RemainAfterExit=true [Install] WantedBy=multi-user.target EEE status: disabled Ring buffer RX/TX: 2048 / 2048 Boot param: pcie_aspm=off igb.eee=0 Systemd service: hermes-nic.service (enabled) EEE status: disabled Ring buffer RX/TX: 2048 / 2048 Boot param: pcie_aspm=off igb.eee=0 Systemd service: hermes-nic.service (enabled) EEE status: disabled Ring buffer RX/TX: 2048 / 2048 Boot param: pcie_aspm=off igb.eee=0 Systemd service: hermes-nic.service (enabled) - The Telegram gateway was creating and closing connections at high frequency (one timeout-retry every 60 seconds × hours = hundreds of connections) - Each connection teardown triggers EEE negotiation - At some point, the link doesn't come back cleanly - The router sees the port as active (switch lights are on) but gets no response to ARP queries - From the router's perspective: NAS is gone