Tools: Solving the Noisy Neighbor Problem: A Multi-Year Journey to IO Isolation on Kubernetes (2026)

Tools: Solving the Noisy Neighbor Problem: A Multi-Year Journey to IO Isolation on Kubernetes (2026)

Section 1: Introduction

Why I'm Writing This

The Context: A Platform Under Pressure

Section 2: The Problem - Kubernetes is IO-Unaware

The Fundamental Gap

The Symptoms

The Investigation

Testing the Xen Hypothesis

The IO-Blindness Trap

Why Containerd Helps (But Isn't Enough)

The Solution Requirements

Section 3: The Foundation - Preparing the Platform

1. Kubernetes v1.22 Minimum

2. Custom AMI with cgroup v2 Enabled

3. Dual Filesystem Compatibility

Section 4: The Solution - IO Isolation Architecture

Part A: System Overview

Component 1: Scheduler Plugin

Component 2: Node Agent

1. Bandwidth Profiling (Disk Only)

2. Pool Management

3. Dynamic Bandwidth Enforcement

Component 4: Aggregator (Optional)

Complete Workflow

Part B: Design Trade-offs: Flexibility and Robustness

1. Container Lifecycle Detection: OOB vs NRI

2. Bandwidth Monitoring: io.stat vs eBPF

3. Architecture: With Aggregator vs Without Aggregator

4. Network Pool Sizing: Conservative vs Aggressive

Section 5: Validation

Test Cluster Setup

Section 6: Conclusion

Section 7: Future Enhancements

Acknowledgments TL;DR: Vanilla Kubernetes is IO-unaware, causing noisy neighbors to hang Docker daemons via PLEG timeouts. We upgraded thousands of nodes to K8s v1.22, enabled cgroup v2, and partnered with Intel to build a custom scheduler plugin and node agent that throttles disk (io.max) and network (Linux TC) bursts. Result: Validated technical readiness for safe stateful K8s migrations. This is the story of a multi-year effort to solve a large-scale data company's noisy neighbor problem on Kubernetes—a fundamental limitation that blocked the migration of critical stateful workloads to our platform. By the end of my tenure, we had validated a solution through a partnership with Intel. The test cluster proved the approach worked, four cross-functional stakeholders approved the design, and a clear rollback strategy ensured operational safety. I'm writing this blog to preserve the knowledge, recognize the collaborative effort, and help others facing similar challenges. Multi-year infrastructure transformations are hard. The biggest challenges aren't always technical—they're about identifying the right problem, convincing people the solution is necessary, and getting teams to work together. This writeup documents what we learned so the effort doesn't get lost. The company's Kubernetes platform isn't just running stateless web applications—it's the engine for massive, stateful data systems. At this scale, vanilla Kubernetes starts to break down. We maintained our own internal fork with some custom patches to survive our operational requirements. Solving the noisy neighbor problem required a multi-year transformation: This is the story of that journey. Kubernetes scheduling has a fundamental limitation: it's IO-unaware. The scheduler considers CPU and memory when placing pods, but completely ignores IO capacity: This isn't a bug—it's a design gap. Kubernetes assumes IO resources are either infinite or managed externally. For many workloads, this assumption is fine. Given our large scale and workload mix, it was a critical problem. The IO-unaware scheduling led to recurring production incidents: A particularly insidious failure mode: If a high-IO workload caused a kernel-level IO hang, the Docker daemon would often block, triggering a PLEG timeout. This made the Kubelet think the node was dead, even if CPU was at 0%. Multiple teams were filing incidents. Internal engineering jobs were disrupted. But the most critical impact was strategic: we couldn't migrate database-like applications to Kubernetes because the noisy neighbor risk was too high. When the Docker daemon hung, restarting it didn't help. In fact, it often made things worse. When system CPU was high, the new Docker daemon could fail to start. Reactive recovery (restarting Docker) was unreliable. I initially suspected AWS's Xen hypervisor might be the culprit. On older instance types (m4, c4, r4), Xen's dom0 handles disk I/O virtualization in software, which can cause severe CPU steal time during high I/O operations. So I tested the same workloads on Nitro instances (c5, r5), which offload EBS I/O and network virtualization to dedicated hardware cards. There's no dom0 stealing CPU cycles. The same failure occurred on Nitro. This confirmed what I expected: hardware offloading cannot fix OS-level kernel bottlenecks. Even with Nitro's hardware acceleration, the kernel still processes I/O interrupts, context switches, and completion handlers. When noisy neighbors saturated I/O, kernel CPU spiked, and the Docker daemon couldn't recover. Moving to containerd (standard in Kubernetes v1.22+) removes Dockershim and the heavy Docker daemon. Containerd is a "lean" runtime—significantly more resilient to being locked up by a single pod's behavior. This improves node stability. But even with containerd, the underlying Linux kernel still has the same limitation under cgroups v1. Containerd makes the management layer (the runtime) more stable, but it does nothing to protect the data layer (the performance of neighboring pods). Containerd reduces the risk of node-level failures, but doesn't solve the noisy neighbor problem. Solving this required two things: IOIsolation had strict technical prerequisites that required significant preparation work: IOIsolation required Kubernetes v1.22 or higher with systemd as the cgroup driver. We built a custom Amazon Machine Image (AMI) with the OS kernel configured to enable cgroup v2 by default. This required: Another requirement was ensuring the IOIsolation code worked with both cgroup v1 and v2 filesystems. Why? IOIsolation provides IO-aware scheduling and enforcement through cgroup v2. The system consists of four main components working together to ensure pods get the IO bandwidth they need while preventing noisy neighbors from impacting others. Figure 1: High-level design of the IOIsolation framework. The system integrates a custom Kubernetes Scheduler Plugin with a node-level Enforcement Agent to solve the "noisy neighbor" problem. It utilizes CRDs for state management, cgroup v2 for throttling, and a specialized Aggregator to maintain scalability across thousands of nodes. As shown in Figure 1, the Control Plane contains the Disk IO Scheduler Plugin and Resource IO Aggregator, while each Worker Node runs the Node IO Agent (which handles eBPF monitoring, cgroup enforcement, and NRI integration). The CRDs (NodeStaticIOInfo, NodeIOStatus) flow between these layers to maintain consistent state. Let's examine each component and how they work together. The scheduler plugin extends Kubernetes' default scheduler with IO awareness. When a pod requests IO resources, the scheduler ensures it's placed on a node with sufficient available bandwidth. 2. Scheduler Plugin Reads Annotations 4. Score Phase (LeastAllocated Strategy) The Scheduler's View of Node Capacity: The scheduler maintains a local cache of node IO capacity by watching NodeIOStatus CRDs: The scheduler reads node capacity from CRDs via standard Kubernetes Informers (watch mechanism). The node agent is the "brain" of the system, running as a DaemonSet on each node. It works with ioi-service to monitor and enforce IO limits. Note: In Figure 1, the "Node IO Agent" box represents the combination of two components working together: the Node Agent (running as a Kubernetes DaemonSet) and the ioi-service (running as a systemd service with root privileges). This separation allows the DaemonSet to handle Kubernetes integration while the privileged systemd service performs low-level cgroup operations. The ioi-service runs as a privileged system service (runs via systemd) that handles the low-level IO monitoring and enforcement: The node agent has three main responsibilities: When the node-agent starts, it measures actual disk bandwidth capacity using fio (Flexible IO Tester): The agent tests multiple block sizes (512B, 1K, 4K, 8K, 16K, 32K) for both read and write operations. Results are stored and reused (profiling takes ~10 minutes per disk). Network bandwidth is simpler: The agent reads the link speed from sysfs (/sys/class/net/eth0/speed) or uses a configured value. The agent divides bandwidth into pools based on admin configuration: The node-agent continuously monitors actual IO usage and adjusts limits dynamically. It enforces these limits at two distinct layers: A. Block Layer Enforcement (cgroups) B. Network Layer Enforcement (tc & ifb) Because cgroups don't throttle networking, the node agent uses Linux Traffic Control (tc). It creates an Intermediate Functional Block (ifb) device for the pod. This allows us to reliably shape both ingress and egress traffic, ensuring BE pods don't starve GA pods of network bandwidth. The system dynamically adjusts BE bandwidth based on actual GA/BE usage. For disk IO, BE can be squeezed to almost nothing. For network IO, each BE pod maintains a minimum bandwidth to prevent TCP connection timeouts. The system uses three CRDs to manage configuration and state: 1. NodeStaticIOInfo - Static node capacity (from profiling) 2. NodeIOStatus - Dynamic node status (updated frequently) 3. IOIPolicy - IO classes and policies The aggregator sits between node-agents and the Kubernetes API server, collecting metrics and batching updates. Data Flow With Aggregator: Data Flow Without Aggregator (Direct Mode): Each node-agent writes directly to its own NodeIOStatus CRD. A complete workflow from pod creation to IO enforcement: While integrating IOIsolation, we faced several architectural decisions. Rather than viewing these as binary choices, the ideal approach is to implement both options and let administrators choose based on their specific context. Here's how we approached these trade-offs: Note: Figure 1 shows the NRI Client path for container lifecycle detection. OOB (via inotify) achieves the same goal—detecting when containers start/stop to apply IO limits—but through a different mechanism that doesn't require runtime integration. Our Decision: OOB (via inotify) The Better Long-Term Solution: Implement both and make it configurable. Different environments have different constraints: System Robustness Perspective: Having both options increases flexibility and reduces deployment risk. Administrators can choose based on their operational maturity and risk tolerance. Our Decision: io.stat The Critical Issue with eBPF: Sometimes eBPF metrics are unavailable (kernel version incompatibility, eBPF program failures). If you rely solely on eBPF, you lose monitoring when it fails. For Long-Term Solution, we can use both and a fallback strategy: The Component: The aggregator collects bandwidth metrics from all node-agents via gRPC and writes batched updates to Kubernetes CRDs (NodeIOStatus). This reduces API server write load. Without Aggregator (Direct Mode): With Aggregator (Batched Mode): Our Approach: We planned for Aggregator High Availability: moving the Aggregator from a single instance to a replicated, highly-available deployment (via Kubernetes Leases) to eliminate the SPOF. The Challenge: Unlike disk bandwidth (which we measured through profiling), network bandwidth capacity is uncertain: Our Approach: Reserve bandwidth based on the number of BE pods and their minimum requirements: The Trade-off: This conservative approach leaves potential network capacity on the table to ensure BE workloads remain functional even under worst-case network conditions. We validated IOIsolation on a test cluster (15-20 nodes) to prove the technical approach before committing to production rollout. Organizational Approval: Getting 4 cross-functional approvers (Security, Data Engineering, Compute Platform, Infrastructure) required addressing concerns about: The design doc was approved, clearing the path for production rollout. We integrated IOIsolation into the company's Kubernetes platform: The system addresses the noisy neighbor problem by isolating IO resources between pods. Before production rollout could begin, the company made a strategic shift to a managed cloud Kubernetes offering, and Intel's team was impacted by organizational changes. The foundation is in place. The knowledge is preserved here for teams facing similar challenges. While this architecture successfully mitigated our noisy neighbor kernel lockups, infrastructure evolution never stops. For future iterations, future teams could explore: IOPS and Latency Awareness: Currently, the system isolates based on raw bandwidth (MB/s). For strict database performance, extending io.max enforcement to include IOPS (riops/wiops) is the next logical step. L3 Cache Isolation (Intel RDT/CAT): At high density, L3 cache contention causes P99 latency degradation even when CPU priority is correct. Intel CAT (Cache Allocation Technology) can partition L3 cache between GA and BE workloads - already supported by the IOIsolation framework. This project was a collaboration across teams and companies: The company's Compute Platform team: Provided the Kubernetes foundation and operational expertise. Cross-functional stakeholders: Asked hard questions about reliability and operational complexity that made the design stronger. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

apiVersion: v1 kind: Pod metadata: annotations: ioi.intel.com/disk-bandwidth: "100MB/s read, 50MB/s write" ioi.intel.com/network-bandwidth: "200Mbps" apiVersion: v1 kind: Pod metadata: annotations: ioi.intel.com/disk-bandwidth: "100MB/s read, 50MB/s write" ioi.intel.com/network-bandwidth: "200Mbps" apiVersion: v1 kind: Pod metadata: annotations: ioi.intel.com/disk-bandwidth: "100MB/s read, 50MB/s write" ioi.intel.com/network-bandwidth: "200Mbps" // Simplified example type NodeIOStatus struct { NodeName string DisksStatus map[string]DiskStatus { "/dev/sda": { GA: { // Guaranteed Allocation pool In: 500, // 500 MB/s read available Out: 400, // 400 MB/s write available }, BE: { // Best Effort pool In: 100, // 100 MB/s read available Out: 80, // 80 MB/s write available } } } NetworkStatus: { GA: { In: 800, Out: 800 }, // 800 Mbps available BE: { In: 100, Out: 100 }, // 100 Mbps available } } // Simplified example type NodeIOStatus struct { NodeName string DisksStatus map[string]DiskStatus { "/dev/sda": { GA: { // Guaranteed Allocation pool In: 500, // 500 MB/s read available Out: 400, // 400 MB/s write available }, BE: { // Best Effort pool In: 100, // 100 MB/s read available Out: 80, // 80 MB/s write available } } } NetworkStatus: { GA: { In: 800, Out: 800 }, // 800 Mbps available BE: { In: 100, Out: 100 }, // 100 Mbps available } } // Simplified example type NodeIOStatus struct { NodeName string DisksStatus map[string]DiskStatus { "/dev/sda": { GA: { // Guaranteed Allocation pool In: 500, // 500 MB/s read available Out: 400, // 400 MB/s write available }, BE: { // Best Effort pool In: 100, // 100 MB/s read available Out: 80, // 80 MB/s write available } } } NetworkStatus: { GA: { In: 800, Out: 800 }, // 800 Mbps available BE: { In: 100, Out: 100 }, // 100 Mbps available } } # Example: Test 4K random read on /dev/sda fio --filename=/mnt/sda/test \ --direct=1 \ --rw=randread \ --bs=4k \ --size=20G \ --runtime=60s \ --output=results.json # Example: Test 4K random read on /dev/sda fio --filename=/mnt/sda/test \ --direct=1 \ --rw=randread \ --bs=4k \ --size=20G \ --runtime=60s \ --output=results.json # Example: Test 4K random read on /dev/sda fio --filename=/mnt/sda/test \ --direct=1 \ --rw=randread \ --bs=4k \ --size=20G \ --runtime=60s \ --output=results.json # Admin ConfigMap diskpools: | ga=100 # Guaranteed Allocation: 100% of capacity be=20 # Best Effort: 20% of capacity networkpools: | ga=95 # Guaranteed Allocation: 95% of capacity (950 Mbps) be=5 # Best Effort: 5% of capacity (50 Mbps) # Admin ConfigMap diskpools: | ga=100 # Guaranteed Allocation: 100% of capacity be=20 # Best Effort: 20% of capacity networkpools: | ga=95 # Guaranteed Allocation: 95% of capacity (950 Mbps) be=5 # Best Effort: 5% of capacity (50 Mbps) # Admin ConfigMap diskpools: | ga=100 # Guaranteed Allocation: 100% of capacity be=20 # Best Effort: 20% of capacity networkpools: | ga=95 # Guaranteed Allocation: 95% of capacity (950 Mbps) be=5 # Best Effort: 5% of capacity (50 Mbps) Disk Capacity (from profiling): 650 MB/s read, 600 MB/s write ↓ GA Pool: 650 MB/s × 100% = 650 MB/s read, 600 MB/s write BE Pool: 650 MB/s × 20% = 130 MB/s read, 120 MB/s write Disk Capacity (from profiling): 650 MB/s read, 600 MB/s write ↓ GA Pool: 650 MB/s × 100% = 650 MB/s read, 600 MB/s write BE Pool: 650 MB/s × 20% = 130 MB/s read, 120 MB/s write Disk Capacity (from profiling): 650 MB/s read, 600 MB/s write ↓ GA Pool: 650 MB/s × 100% = 650 MB/s read, 600 MB/s write BE Pool: 650 MB/s × 20% = 130 MB/s read, 120 MB/s write apiVersion: ioi.intel.com/v1 kind: NodeStaticIOInfo metadata: name: node1-staticinfo spec: nodeName: node1 disks: - id: "disk-sda" path: "/dev/sda" capacity: read: 650 # MB/s write: 600 # MB/s network: linkSpeed: 1000 # Mbps apiVersion: ioi.intel.com/v1 kind: NodeStaticIOInfo metadata: name: node1-staticinfo spec: nodeName: node1 disks: - id: "disk-sda" path: "/dev/sda" capacity: read: 650 # MB/s write: 600 # MB/s network: linkSpeed: 1000 # Mbps apiVersion: ioi.intel.com/v1 kind: NodeStaticIOInfo metadata: name: node1-staticinfo spec: nodeName: node1 disks: - id: "disk-sda" path: "/dev/sda" capacity: read: 650 # MB/s write: 600 # MB/s network: linkSpeed: 1000 # Mbps apiVersion: ioi.intel.com/v1 kind: NodeIOStatus metadata: name: node1-nodeioinfo status: disksStatus: disk-sda: GA: In: 500 # 500 MB/s read available Out: 400 # 400 MB/s write available BE: In: 130 # 130 MB/s read available Out: 120 # 120 MB/s write available networkStatus: GA: { In: 800, Out: 800 } BE: { In: 50, Out: 50 } apiVersion: ioi.intel.com/v1 kind: NodeIOStatus metadata: name: node1-nodeioinfo status: disksStatus: disk-sda: GA: In: 500 # 500 MB/s read available Out: 400 # 400 MB/s write available BE: In: 130 # 130 MB/s read available Out: 120 # 120 MB/s write available networkStatus: GA: { In: 800, Out: 800 } BE: { In: 50, Out: 50 } apiVersion: ioi.intel.com/v1 kind: NodeIOStatus metadata: name: node1-nodeioinfo status: disksStatus: disk-sda: GA: In: 500 # 500 MB/s read available Out: 400 # 400 MB/s write available BE: In: 130 # 130 MB/s read available Out: 120 # 120 MB/s write available networkStatus: GA: { In: 800, Out: 800 } BE: { In: 50, Out: 50 } apiVersion: ioi.intel.com/v1 kind: IOIPolicy metadata: name: default-policy spec: ioClasses: - name: "GA" # Guaranteed Allocation priority: 1 - name: "BE" # Best Effort priority: 2 apiVersion: ioi.intel.com/v1 kind: IOIPolicy metadata: name: default-policy spec: ioClasses: - name: "GA" # Guaranteed Allocation priority: 1 - name: "BE" # Best Effort priority: 2 apiVersion: ioi.intel.com/v1 kind: IOIPolicy metadata: name: default-policy spec: ioClasses: - name: "GA" # Guaranteed Allocation priority: 1 - name: "BE" # Best Effort priority: 2 Node-agent (node1) ──┐ Node-agent (node2) ──┤ Node-agent (node3) ──┼──> Aggregator ──> Batch Update ──> NodeIOStatus CRDs ... │ Node-agent (nodeN) ──┘ Node-agent (node1) ──┐ Node-agent (node2) ──┤ Node-agent (node3) ──┼──> Aggregator ──> Batch Update ──> NodeIOStatus CRDs ... │ Node-agent (nodeN) ──┘ Node-agent (node1) ──┐ Node-agent (node2) ──┤ Node-agent (node3) ──┼──> Aggregator ──> Batch Update ──> NodeIOStatus CRDs ... │ Node-agent (nodeN) ──┘ Node-agent (node1) ──> Write NodeIOStatus CRD ──> API Server Node-agent (node2) ──> Write NodeIOStatus CRD ──> API Server Node-agent (node3) ──> Write NodeIOStatus CRD ──> API Server ... Node-agent (node1) ──> Write NodeIOStatus CRD ──> API Server Node-agent (node2) ──> Write NodeIOStatus CRD ──> API Server Node-agent (node3) ──> Write NodeIOStatus CRD ──> API Server ... Node-agent (node1) ──> Write NodeIOStatus CRD ──> API Server Node-agent (node2) ──> Write NodeIOStatus CRD ──> API Server Node-agent (node3) ──> Write NodeIOStatus CRD ──> API Server ... Node-agent (500 nodes) → Write NodeIOStatus CRD → API Server Frequency: Every 2-5 seconds per node Node-agent (500 nodes) → Write NodeIOStatus CRD → API Server Frequency: Every 2-5 seconds per node Node-agent (500 nodes) → Write NodeIOStatus CRD → API Server Frequency: Every 2-5 seconds per node Node-agent (500 nodes) → gRPC → Aggregator → Batched CRD writes → API Server Frequency: Batched every 5 seconds Node-agent (500 nodes) → gRPC → Aggregator → Batched CRD writes → API Server Frequency: Batched every 5 seconds Node-agent (500 nodes) → gRPC → Aggregator → Batched CRD writes → API Server Frequency: Batched every 5 seconds Scenario: 20 BE pods × 5 Mbps minimum = 100 Mbps reserved for BE GA available: 1000 Mbps - 100 Mbps = 900 Mbps maximum Scenario: 20 BE pods × 5 Mbps minimum = 100 Mbps reserved for BE GA available: 1000 Mbps - 100 Mbps = 900 Mbps maximum Scenario: 20 BE pods × 5 Mbps minimum = 100 Mbps reserved for BE GA available: 1000 Mbps - 100 Mbps = 900 Mbps maximum - Phase 1: The Foundation - Upgrade Kubernetes from v1.18 to v1.22 to enable cgroup v2 support (required for IO throttling) - Phase 2: The Solution - Integrate IOIsolation to provide IO-aware scheduling and preventive isolation - No visibility into node IO bandwidth (disk or network) - No way for pods to request IO resources - No mechanism to prevent multiple high-IO workloads from landing on the same node - Multiple IO-hungry pods would be scheduled on the same node (no IO capacity awareness) - These pods would consume excessive disk or network bandwidth - Kernel CPU would spike due to interrupt handling and context switching - The Docker daemon would hang, unable to respond to requests - Kubelet's Pod Lifecycle Event Generator (PLEG) would timeout trying to reconcile container state - Nodes would be marked NotReady, triggering cascading alerts and pod evictions - IRQ counts spiking during IO-heavy workload activity - Excessive context switching correlating with IO load - Docker daemon behavior and recovery patterns - The Scheduler Issue: Standard K8s is "IO-Unaware." It treats a node with 10 pods and a node with 50 pods as equally "available" if the CPU/Mem metrics are the same, completely ignoring the IOPS/Throughput saturation on the underlying EBS/Disk. - The Stack Issue (Dockershim): High IO workloads don't just slow down neighbors; they saturate the Docker daemon. Because Dockershim/Docker is a centralized bottleneck, a single pod's IO burst can cause the daemon to hang, leading to a NodeNotReady state. - The cgroup v1 Limitation: You can't fix this with software tweaks because cgroup v1 cannot track or throttle buffered IO (writes that go to the page cache first). - IO-aware scheduling: Place pods based on available IO capacity (scheduler plugin) - Preventive isolation: Throttle buffered I/O before it saturates the kernel (cgroups v2) - Setting kernel boot parameters to enable cgroup v2 - Validating compatibility with existing workloads - We couldn't flip the entire fleet to cgroup v2 overnight - Different node pools might be at different migration stages - Rollback scenarios required v1 compatibility - Scheduler Plugin - Makes IO-aware pod placement decisions - Node Agent - Monitors IO usage and enforces limits via cgroups - Custom Resource Definitions (CRDs) - Configuration and state management - Aggregator (optional) - Centralized monitoring and coordination - Parses IO requirements from pod annotations - Determines IO class (GA = Guaranteed Allocation, BE = Best Effort) - Eliminates nodes without sufficient IO capacity - Checks NodeIOStatus CRD for each node's available bandwidth - Ranks remaining nodes by available IO capacity - The LeastAllocated strategy acts as a natural dampener, spreading the pods out and preventing a "thundering herd" from piling onto a single node before the CRDs can update. - Assigns pod to selected node - Updates node's reserved bandwidth - Monitors IO via eBPF or io.stat (cgroup v2) - Writes bandwidth limits to cgroup files - Communicates with node-agent via gRPC over Unix socket - Receive bandwidth data from ioi-service (via gRPC) - Calculate actual usage per pod, per QoS class - Recalculate available bandwidth - Update cgroup io.max limits via ioi-service - Receives bandwidth metrics from all node-agents via gRPC - Batches updates (every 5 seconds, configurable) - Writes to NodeIOStatus CRDs in Kubernetes API server - Reduces API server write load by ~90% at scale - Step 1: Pod Submission & Specification - Step 2: Scheduler Plugin Filters Nodes - Step 3: Node-Agent Registers Pod - Step 4: IOI-Service Applies Limits - Step 5: Continuous Monitoring Every 2-5 seconds: ioi-service reads io.stat Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit) Sends to node-agent via gRPC Node-agent updates NodeIOStatus CRD (directly or via aggregator) Scheduler sees updated capacity for future scheduling decisions - Every 2-5 seconds: ioi-service reads io.stat Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit) Sends to node-agent via gRPC Node-agent updates NodeIOStatus CRD (directly or via aggregator) Scheduler sees updated capacity for future scheduling decisions - ioi-service reads io.stat - Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit) - Sends to node-agent via gRPC - Node-agent updates NodeIOStatus CRD (directly or via aggregator) - Scheduler sees updated capacity for future scheduling decisions - Step 6: Dynamic Adjustment - Every 2-5 seconds: ioi-service reads io.stat Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit) Sends to node-agent via gRPC Node-agent updates NodeIOStatus CRD (directly or via aggregator) Scheduler sees updated capacity for future scheduling decisions - ioi-service reads io.stat - Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit) - Sends to node-agent via gRPC - Node-agent updates NodeIOStatus CRD (directly or via aggregator) - Scheduler sees updated capacity for future scheduling decisions - ioi-service reads io.stat - Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit) - Sends to node-agent via gRPC - Node-agent updates NodeIOStatus CRD (directly or via aggregator) - Scheduler sees updated capacity for future scheduling decisions - NRI (Node Resource Interface): Tighter integration with containerd, official Kubernetes API - OOB (Out-of-Band): Watching Kubernetes API and pod events from outside the container runtime - Less invasive to the container runtime stack - Simpler rollback path (no runtime modifications) - Technically, there is still an asynchronous micro-race condition of a few milliseconds before io.max is written. However, a container doing unthrottled I/O for 50 milliseconds at startup will never cause a PLEG timeout or node lockup. - Use NRI when: You want tighter runtime integration, have experience with NRI, need lower latency detection - Use OOB when: You want simpler deployment, easier rollback, or don't want to modify the runtime stack - eBPF: Kernel-level tracing, more granular visibility - io.stat: Cgroup v2 native statistics, simple file reads - Cgroup v2 native interface, simpler implementation - Sufficient granularity for our use case - Fewer moving parts, easier to debug - Try to collect eBPF metrics - If eBPF fails, fall back to io.stat - Ensure metrics are always available, regardless of eBPF status - Dramatically reduces API server write load at scale (90% reduction) - Batched updates are more efficient - Centralized monitoring view - Single point of failure: If aggregator fails, monitoring stops - Added complexity: One more component to deploy, monitor, debug - Not necessary at small scale: For <100 nodes, the API server can handle direct writes - AWS reports link speed: 1000 Mbps (from /sys/class/net/eth0/speed) - But actual available capacity is unknown and varies by network congestion, switch limitations, etc. - The TCP Measurement Trap: We intentionally decided not to measure live TCP bandwidth dynamically like we do for disk. Measuring live TCP throughput is computationally expensive and highly volatile. Instead, relying on a static, conservative ceiling for GA and a guaranteed minimum floor for BE pods protects the system with near-zero compute overhead. - The EBS Dual-Constraint: In AWS, EBS volumes are network-attached. This means EBS burst limits are constrained by two factors: the disk's io.max AND the instance's network bandwidth. By strictly throttling network traffic, we inadvertently created a secondary safeguard against EBS burst depletion. - Guarantees TCP viability: Each BE pod gets enough bandwidth - Simple and predictable: Easy to reason about and configure - Scheduler-aware: The scheduler knows exactly how much bandwidth is available for GA pods - Prevents starvation: GA cannot accidentally starve BE pods below their minimum - Cluster size: 15-20 nodes - Instance types: Mix of c5 and r5 instances (AWS Nitro) - Workloads: IO-heavy applications similar to production (simulated database workloads, data processing jobs) - Duration: Several weeks of testing - Scheduler correctly placed pods based on IO capacity - Node agents profiled disks and applied bandwidth limits - Aggregator collected metrics and updated CRDs - System ran for several weeks without major failures - Rollback strategy if the system caused issues - Additional complexity in the platform - Foundation: Kubernetes v1.22+, cgroup v2 enabled, dual filesystem compatibility - Components: Scheduler plugin, node agent, aggregator, CRDs - Validation: Test cluster (15-20 nodes) proved the technical approach works - Approval: Design approved by 4 cross-functional stakeholders - IOPS and Latency Awareness: Currently, the system isolates based on raw bandwidth (MB/s). For strict database performance, extending io.max enforcement to include IOPS (riops/wiops) is the next logical step. - L3 Cache Isolation (Intel RDT/CAT): At high density, L3 cache contention causes P99 latency degradation even when CPU priority is correct. Intel CAT (Cache Allocation Technology) can partition L3 cache between GA and BE workloads - already supported by the IOIsolation framework. - Built the core IOIsolation system and iterated on design through extensive technical discussions. - https://github.com/intel/cloud-resource-scheduling-and-isolation