apiVersion: v1
kind: Pod
metadata: annotations: ioi.intel.com/disk-bandwidth: "100MB/s read, 50MB/s write" ioi.intel.com/network-bandwidth: "200Mbps"
apiVersion: v1
kind: Pod
metadata: annotations: ioi.intel.com/disk-bandwidth: "100MB/s read, 50MB/s write" ioi.intel.com/network-bandwidth: "200Mbps"
apiVersion: v1
kind: Pod
metadata: annotations: ioi.intel.com/disk-bandwidth: "100MB/s read, 50MB/s write" ioi.intel.com/network-bandwidth: "200Mbps"
// Simplified example
type NodeIOStatus struct { NodeName string DisksStatus map[string]DiskStatus { "/dev/sda": { GA: { // Guaranteed Allocation pool In: 500, // 500 MB/s read available Out: 400, // 400 MB/s write available }, BE: { // Best Effort pool In: 100, // 100 MB/s read available Out: 80, // 80 MB/s write available } } } NetworkStatus: { GA: { In: 800, Out: 800 }, // 800 Mbps available BE: { In: 100, Out: 100 }, // 100 Mbps available }
}
// Simplified example
type NodeIOStatus struct { NodeName string DisksStatus map[string]DiskStatus { "/dev/sda": { GA: { // Guaranteed Allocation pool In: 500, // 500 MB/s read available Out: 400, // 400 MB/s write available }, BE: { // Best Effort pool In: 100, // 100 MB/s read available Out: 80, // 80 MB/s write available } } } NetworkStatus: { GA: { In: 800, Out: 800 }, // 800 Mbps available BE: { In: 100, Out: 100 }, // 100 Mbps available }
}
// Simplified example
type NodeIOStatus struct { NodeName string DisksStatus map[string]DiskStatus { "/dev/sda": { GA: { // Guaranteed Allocation pool In: 500, // 500 MB/s read available Out: 400, // 400 MB/s write available }, BE: { // Best Effort pool In: 100, // 100 MB/s read available Out: 80, // 80 MB/s write available } } } NetworkStatus: { GA: { In: 800, Out: 800 }, // 800 Mbps available BE: { In: 100, Out: 100 }, // 100 Mbps available }
}
# Example: Test 4K random read on /dev/sda
fio --filename=/mnt/sda/test \ --direct=1 \ --rw=randread \ --bs=4k \ --size=20G \ --runtime=60s \ --output=results.json
# Example: Test 4K random read on /dev/sda
fio --filename=/mnt/sda/test \ --direct=1 \ --rw=randread \ --bs=4k \ --size=20G \ --runtime=60s \ --output=results.json
# Example: Test 4K random read on /dev/sda
fio --filename=/mnt/sda/test \ --direct=1 \ --rw=randread \ --bs=4k \ --size=20G \ --runtime=60s \ --output=results.json
# Admin ConfigMap
diskpools: | ga=100 # Guaranteed Allocation: 100% of capacity be=20 # Best Effort: 20% of capacity networkpools: | ga=95 # Guaranteed Allocation: 95% of capacity (950 Mbps) be=5 # Best Effort: 5% of capacity (50 Mbps)
# Admin ConfigMap
diskpools: | ga=100 # Guaranteed Allocation: 100% of capacity be=20 # Best Effort: 20% of capacity networkpools: | ga=95 # Guaranteed Allocation: 95% of capacity (950 Mbps) be=5 # Best Effort: 5% of capacity (50 Mbps)
# Admin ConfigMap
diskpools: | ga=100 # Guaranteed Allocation: 100% of capacity be=20 # Best Effort: 20% of capacity networkpools: | ga=95 # Guaranteed Allocation: 95% of capacity (950 Mbps) be=5 # Best Effort: 5% of capacity (50 Mbps)
Disk Capacity (from profiling): 650 MB/s read, 600 MB/s write
↓
GA Pool: 650 MB/s × 100% = 650 MB/s read, 600 MB/s write
BE Pool: 650 MB/s × 20% = 130 MB/s read, 120 MB/s write
Disk Capacity (from profiling): 650 MB/s read, 600 MB/s write
↓
GA Pool: 650 MB/s × 100% = 650 MB/s read, 600 MB/s write
BE Pool: 650 MB/s × 20% = 130 MB/s read, 120 MB/s write
Disk Capacity (from profiling): 650 MB/s read, 600 MB/s write
↓
GA Pool: 650 MB/s × 100% = 650 MB/s read, 600 MB/s write
BE Pool: 650 MB/s × 20% = 130 MB/s read, 120 MB/s write
apiVersion: ioi.intel.com/v1
kind: NodeStaticIOInfo
metadata: name: node1-staticinfo
spec: nodeName: node1 disks: - id: "disk-sda" path: "/dev/sda" capacity: read: 650 # MB/s write: 600 # MB/s network: linkSpeed: 1000 # Mbps
apiVersion: ioi.intel.com/v1
kind: NodeStaticIOInfo
metadata: name: node1-staticinfo
spec: nodeName: node1 disks: - id: "disk-sda" path: "/dev/sda" capacity: read: 650 # MB/s write: 600 # MB/s network: linkSpeed: 1000 # Mbps
apiVersion: ioi.intel.com/v1
kind: NodeStaticIOInfo
metadata: name: node1-staticinfo
spec: nodeName: node1 disks: - id: "disk-sda" path: "/dev/sda" capacity: read: 650 # MB/s write: 600 # MB/s network: linkSpeed: 1000 # Mbps
apiVersion: ioi.intel.com/v1
kind: NodeIOStatus
metadata: name: node1-nodeioinfo
status: disksStatus: disk-sda: GA: In: 500 # 500 MB/s read available Out: 400 # 400 MB/s write available BE: In: 130 # 130 MB/s read available Out: 120 # 120 MB/s write available networkStatus: GA: { In: 800, Out: 800 } BE: { In: 50, Out: 50 }
apiVersion: ioi.intel.com/v1
kind: NodeIOStatus
metadata: name: node1-nodeioinfo
status: disksStatus: disk-sda: GA: In: 500 # 500 MB/s read available Out: 400 # 400 MB/s write available BE: In: 130 # 130 MB/s read available Out: 120 # 120 MB/s write available networkStatus: GA: { In: 800, Out: 800 } BE: { In: 50, Out: 50 }
apiVersion: ioi.intel.com/v1
kind: NodeIOStatus
metadata: name: node1-nodeioinfo
status: disksStatus: disk-sda: GA: In: 500 # 500 MB/s read available Out: 400 # 400 MB/s write available BE: In: 130 # 130 MB/s read available Out: 120 # 120 MB/s write available networkStatus: GA: { In: 800, Out: 800 } BE: { In: 50, Out: 50 }
apiVersion: ioi.intel.com/v1
kind: IOIPolicy
metadata: name: default-policy
spec: ioClasses: - name: "GA" # Guaranteed Allocation priority: 1 - name: "BE" # Best Effort priority: 2
apiVersion: ioi.intel.com/v1
kind: IOIPolicy
metadata: name: default-policy
spec: ioClasses: - name: "GA" # Guaranteed Allocation priority: 1 - name: "BE" # Best Effort priority: 2
apiVersion: ioi.intel.com/v1
kind: IOIPolicy
metadata: name: default-policy
spec: ioClasses: - name: "GA" # Guaranteed Allocation priority: 1 - name: "BE" # Best Effort priority: 2
Node-agent (node1) ──┐
Node-agent (node2) ──┤
Node-agent (node3) ──┼──> Aggregator ──> Batch Update ──> NodeIOStatus CRDs ... │
Node-agent (nodeN) ──┘
Node-agent (node1) ──┐
Node-agent (node2) ──┤
Node-agent (node3) ──┼──> Aggregator ──> Batch Update ──> NodeIOStatus CRDs ... │
Node-agent (nodeN) ──┘
Node-agent (node1) ──┐
Node-agent (node2) ──┤
Node-agent (node3) ──┼──> Aggregator ──> Batch Update ──> NodeIOStatus CRDs ... │
Node-agent (nodeN) ──┘
Node-agent (node1) ──> Write NodeIOStatus CRD ──> API Server
Node-agent (node2) ──> Write NodeIOStatus CRD ──> API Server
Node-agent (node3) ──> Write NodeIOStatus CRD ──> API Server ...
Node-agent (node1) ──> Write NodeIOStatus CRD ──> API Server
Node-agent (node2) ──> Write NodeIOStatus CRD ──> API Server
Node-agent (node3) ──> Write NodeIOStatus CRD ──> API Server ...
Node-agent (node1) ──> Write NodeIOStatus CRD ──> API Server
Node-agent (node2) ──> Write NodeIOStatus CRD ──> API Server
Node-agent (node3) ──> Write NodeIOStatus CRD ──> API Server ...
Node-agent (500 nodes) → Write NodeIOStatus CRD → API Server
Frequency: Every 2-5 seconds per node
Node-agent (500 nodes) → Write NodeIOStatus CRD → API Server
Frequency: Every 2-5 seconds per node
Node-agent (500 nodes) → Write NodeIOStatus CRD → API Server
Frequency: Every 2-5 seconds per node
Node-agent (500 nodes) → gRPC → Aggregator → Batched CRD writes → API Server
Frequency: Batched every 5 seconds
Node-agent (500 nodes) → gRPC → Aggregator → Batched CRD writes → API Server
Frequency: Batched every 5 seconds
Node-agent (500 nodes) → gRPC → Aggregator → Batched CRD writes → API Server
Frequency: Batched every 5 seconds
Scenario: 20 BE pods × 5 Mbps minimum = 100 Mbps reserved for BE
GA available: 1000 Mbps - 100 Mbps = 900 Mbps maximum
Scenario: 20 BE pods × 5 Mbps minimum = 100 Mbps reserved for BE
GA available: 1000 Mbps - 100 Mbps = 900 Mbps maximum
Scenario: 20 BE pods × 5 Mbps minimum = 100 Mbps reserved for BE
GA available: 1000 Mbps - 100 Mbps = 900 Mbps maximum - Phase 1: The Foundation - Upgrade Kubernetes from v1.18 to v1.22 to enable cgroup v2 support (required for IO throttling)
- Phase 2: The Solution - Integrate IOIsolation to provide IO-aware scheduling and preventive isolation - No visibility into node IO bandwidth (disk or network)
- No way for pods to request IO resources
- No mechanism to prevent multiple high-IO workloads from landing on the same node - Multiple IO-hungry pods would be scheduled on the same node (no IO capacity awareness)
- These pods would consume excessive disk or network bandwidth
- Kernel CPU would spike due to interrupt handling and context switching
- The Docker daemon would hang, unable to respond to requests
- Kubelet's Pod Lifecycle Event Generator (PLEG) would timeout trying to reconcile container state
- Nodes would be marked NotReady, triggering cascading alerts and pod evictions - IRQ counts spiking during IO-heavy workload activity
- Excessive context switching correlating with IO load
- Docker daemon behavior and recovery patterns - The Scheduler Issue: Standard K8s is "IO-Unaware." It treats a node with 10 pods and a node with 50 pods as equally "available" if the CPU/Mem metrics are the same, completely ignoring the IOPS/Throughput saturation on the underlying EBS/Disk.
- The Stack Issue (Dockershim): High IO workloads don't just slow down neighbors; they saturate the Docker daemon. Because Dockershim/Docker is a centralized bottleneck, a single pod's IO burst can cause the daemon to hang, leading to a NodeNotReady state.
- The cgroup v1 Limitation: You can't fix this with software tweaks because cgroup v1 cannot track or throttle buffered IO (writes that go to the page cache first). - IO-aware scheduling: Place pods based on available IO capacity (scheduler plugin)
- Preventive isolation: Throttle buffered I/O before it saturates the kernel (cgroups v2) - Setting kernel boot parameters to enable cgroup v2
- Validating compatibility with existing workloads - We couldn't flip the entire fleet to cgroup v2 overnight
- Different node pools might be at different migration stages
- Rollback scenarios required v1 compatibility - Scheduler Plugin - Makes IO-aware pod placement decisions
- Node Agent - Monitors IO usage and enforces limits via cgroups
- Custom Resource Definitions (CRDs) - Configuration and state management
- Aggregator (optional) - Centralized monitoring and coordination - Parses IO requirements from pod annotations
- Determines IO class (GA = Guaranteed Allocation, BE = Best Effort) - Eliminates nodes without sufficient IO capacity
- Checks NodeIOStatus CRD for each node's available bandwidth - Ranks remaining nodes by available IO capacity
- The LeastAllocated strategy acts as a natural dampener, spreading the pods out and preventing a "thundering herd" from piling onto a single node before the CRDs can update. - Assigns pod to selected node
- Updates node's reserved bandwidth - Monitors IO via eBPF or io.stat (cgroup v2)
- Writes bandwidth limits to cgroup files
- Communicates with node-agent via gRPC over Unix socket - Receive bandwidth data from ioi-service (via gRPC)
- Calculate actual usage per pod, per QoS class
- Recalculate available bandwidth
- Update cgroup io.max limits via ioi-service - Receives bandwidth metrics from all node-agents via gRPC
- Batches updates (every 5 seconds, configurable)
- Writes to NodeIOStatus CRDs in Kubernetes API server
- Reduces API server write load by ~90% at scale - Step 1: Pod Submission & Specification
- Step 2: Scheduler Plugin Filters Nodes
- Step 3: Node-Agent Registers Pod
- Step 4: IOI-Service Applies Limits
- Step 5: Continuous Monitoring Every 2-5 seconds: ioi-service reads io.stat
Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit)
Sends to node-agent via gRPC
Node-agent updates NodeIOStatus CRD (directly or via aggregator)
Scheduler sees updated capacity for future scheduling decisions
- Every 2-5 seconds: ioi-service reads io.stat
Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit)
Sends to node-agent via gRPC
Node-agent updates NodeIOStatus CRD (directly or via aggregator)
Scheduler sees updated capacity for future scheduling decisions
- ioi-service reads io.stat
- Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit)
- Sends to node-agent via gRPC
- Node-agent updates NodeIOStatus CRD (directly or via aggregator)
- Scheduler sees updated capacity for future scheduling decisions
- Step 6: Dynamic Adjustment - Every 2-5 seconds: ioi-service reads io.stat
Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit)
Sends to node-agent via gRPC
Node-agent updates NodeIOStatus CRD (directly or via aggregator)
Scheduler sees updated capacity for future scheduling decisions
- ioi-service reads io.stat
- Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit)
- Sends to node-agent via gRPC
- Node-agent updates NodeIOStatus CRD (directly or via aggregator)
- Scheduler sees updated capacity for future scheduling decisions - ioi-service reads io.stat
- Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit)
- Sends to node-agent via gRPC
- Node-agent updates NodeIOStatus CRD (directly or via aggregator)
- Scheduler sees updated capacity for future scheduling decisions - NRI (Node Resource Interface): Tighter integration with containerd, official Kubernetes API
- OOB (Out-of-Band): Watching Kubernetes API and pod events from outside the container runtime - Less invasive to the container runtime stack
- Simpler rollback path (no runtime modifications)
- Technically, there is still an asynchronous micro-race condition of a few milliseconds before io.max is written. However, a container doing unthrottled I/O for 50 milliseconds at startup will never cause a PLEG timeout or node lockup. - Use NRI when: You want tighter runtime integration, have experience with NRI, need lower latency detection
- Use OOB when: You want simpler deployment, easier rollback, or don't want to modify the runtime stack - eBPF: Kernel-level tracing, more granular visibility
- io.stat: Cgroup v2 native statistics, simple file reads - Cgroup v2 native interface, simpler implementation
- Sufficient granularity for our use case
- Fewer moving parts, easier to debug - Try to collect eBPF metrics
- If eBPF fails, fall back to io.stat
- Ensure metrics are always available, regardless of eBPF status - Dramatically reduces API server write load at scale (90% reduction)
- Batched updates are more efficient
- Centralized monitoring view - Single point of failure: If aggregator fails, monitoring stops
- Added complexity: One more component to deploy, monitor, debug
- Not necessary at small scale: For <100 nodes, the API server can handle direct writes - AWS reports link speed: 1000 Mbps (from /sys/class/net/eth0/speed)
- But actual available capacity is unknown and varies by network congestion, switch limitations, etc. - The TCP Measurement Trap: We intentionally decided not to measure live TCP bandwidth dynamically like we do for disk. Measuring live TCP throughput is computationally expensive and highly volatile. Instead, relying on a static, conservative ceiling for GA and a guaranteed minimum floor for BE pods protects the system with near-zero compute overhead.
- The EBS Dual-Constraint: In AWS, EBS volumes are network-attached. This means EBS burst limits are constrained by two factors: the disk's io.max AND the instance's network bandwidth. By strictly throttling network traffic, we inadvertently created a secondary safeguard against EBS burst depletion.
- Guarantees TCP viability: Each BE pod gets enough bandwidth
- Simple and predictable: Easy to reason about and configure
- Scheduler-aware: The scheduler knows exactly how much bandwidth is available for GA pods
- Prevents starvation: GA cannot accidentally starve BE pods below their minimum - Cluster size: 15-20 nodes
- Instance types: Mix of c5 and r5 instances (AWS Nitro)
- Workloads: IO-heavy applications similar to production (simulated database workloads, data processing jobs)
- Duration: Several weeks of testing - Scheduler correctly placed pods based on IO capacity
- Node agents profiled disks and applied bandwidth limits
- Aggregator collected metrics and updated CRDs
- System ran for several weeks without major failures - Rollback strategy if the system caused issues
- Additional complexity in the platform - Foundation: Kubernetes v1.22+, cgroup v2 enabled, dual filesystem compatibility
- Components: Scheduler plugin, node agent, aggregator, CRDs
- Validation: Test cluster (15-20 nodes) proved the technical approach works
- Approval: Design approved by 4 cross-functional stakeholders - IOPS and Latency Awareness: Currently, the system isolates based on raw bandwidth (MB/s). For strict database performance, extending io.max enforcement to include IOPS (riops/wiops) is the next logical step.
- L3 Cache Isolation (Intel RDT/CAT): At high density, L3 cache contention causes P99 latency degradation even when CPU priority is correct. Intel CAT (Cache Allocation Technology) can partition L3 cache between GA and BE workloads - already supported by the IOIsolation framework. - Built the core IOIsolation system and iterated on design through extensive technical discussions.
- https://github.com/intel/cloud-resource-scheduling-and-isolation