Tools: Kubernetes Advanced Scheduling ( Hidden gems of Kubernetes )
Source: Dev.to
Kubernetes is often described as a scheduler’s operating system. The default scheduler (kube-scheduler) does an excellent job of placing pods on nodes based on basic resource requests and limits. However, in complex, production-grade environments, the default "fit and spread" logic is often not enough. While Custom Resource Definitions (CRDs) and Operators get most of the spotlight in the ecosystem, the control plane itself hides a treasure trove of powerful scheduling levers. Here is a deep dive into the advanced scheduling features that separate a functional cluster from a finely-tuned, production-ready one. Problem 1: Pod Topology Spread Constraints: Orchestrating Disaster Recovery The Problem:
By default, a ReplicaSet ensures a certain number of replicas are running, but it doesn't care where they run. In a cloud environment, if all pods land on the same Availability Zone (AZ) and that zone fails, you experience a full outage. Similarly, if pods are packed onto the same node, a node failure takes out the entire service. The Solution:
Pod Topology Spread Constraints allow you to enforce strict or best-effort distribution rules across arbitrary failure domains (e.g., zones, nodes, or even custom host labels). How it works:
You define a topologySpreadConstraints field in your Pod spec. You specify:
• topologyKey : The label key on nodes that defines the domain (e.g., topology.kubernetes.io/zone).
• maxSkew : The maximum allowable difference in the number of pods across domains.
• whenUnsatisfiable : What to do if the skew can't be met (DoNotSchedule vs ScheduleAnyway). Technical Deep Dive:
Unlike PodAffinity/AntiAffinity, which are binary (must be/not be), Spread Constraints are quantitative. They look at the distribution delta.
• Use Case 1: Strict HA: You can ensure that if you have 3 zones, a deployment of 6 pods is scheduled exactly 2 per zone. If a zone is unhealthy and a pod reschedules, the scheduler will wait until the zone recovers or another pod moves to maintain balance.
• Use Case 2: Rolling Updates: During a rolling update, new pods are created. Without spread constraints, the scheduler might fill up one zone first to bin-pack. With spread constraints, the new pods are spread evenly from the start, maintaining balance during the transition. Problem 2. Pod Overhead: Accounting for the Invisible Resources The Problem:
When using "microVMs" (Kata Containers, gVisor) or sidecar-heavy meshes (Istio), the user container is not the only thing consuming resources. The runtime shim or the sidecar proxy consumes CPU and memory before the main process starts. If the scheduler ignores this, it will overcommit the node, leading to throttling or node pressure. The Solution:
Pod Overhead is a feature used primarily with RuntimeClasses. It allows you to define the resources consumed by the infrastructure (the runtime) per pod. How it works:
You define a RuntimeClass that includes an overhead field. Any pod using that RuntimeClass automatically adds this overhead to its scheduling calculations. Technical Deep Dive:
Scheduling is a math problem. The scheduler sums pod.spec.containers[*].resources.requests to determine node fit.
Pod Overhead injects an additional "invisible container" into this calculation. • Example: A Kata container might need 50MB of memory and 5% CPU for the VM kernel and agent. The user requests 512MB for their app. The scheduler will see a total demand of 562MB. Without this, the node would be overprovisioned, and the VM might crash.
• Monitoring Impact: This also affects kubectl top node. The "allocatable" resources now account for this overhead, giving a more accurate picture of node utilization. Problem 3. Scheduler Profiles & Extenders: Custom Logic Without Custom Code The Problem:
You need a specific scheduling rule (e.g., "Don't schedule GPU pods on nodes with Spot instances," or "Prefer nodes with SSDs for databases"). Rewriting the entire Kubernetes scheduler from scratch is a daunting and fragile task. The Solution:
Scheduler Profiles (introduced in v1.18) and Scheduler Extenders (legacy but powerful) allow you to inject custom logic into the scheduling pipeline.
• Scheduler Profiles (Multi-point): Allow you to run multiple scheduling configurations in parallel. You can configure which set of plugins (default or custom) run for specific pods.
• Scheduler Extenders: A process external to the scheduler that acts as a "webhook" for scheduling. The scheduler sends it a list of filtered nodes, and the extender filters or prioritizes them further. Technical Deep Dive:
The scheduling cycle is split into phases: Filtering (Predicates), Scoring (Priorities), and Binding.
• Multi-scheduling: With Profiles, you could have one profile for general workloads that bin-packs tightly, and another profile for critical workloads that spreads thinly across nodes.
• Extenders: Imagine you have a hardware accelerator connected via USB to some nodes. The scheduler doesn't know about USB devices. An extender can look at the pod annotation, check an inventory database, and filter out nodes that don't have the USB device plugged in, allowing the pod to land only on physically capable hardware. Problem 4. NodeResourceFit with Pod Overhead: The Hidden Math The Problem:
While we covered Pod Overhead, it exists in isolation. The real magic happens when the scheduler's NodeResourceFit plugin interacts with it. Many engineers assume that if they set Pod Overhead, the scheduler just "adds it." But understanding how it adds it reveals potential pitfalls. The Solution:
The NodeResourceFit plugin is the component that checks if a node has enough resources. Its integration with Pod Overhead ensures that overhead is treated as a first-class citizen during the Filtering and Scoring phases. Technical Deep Dive:
There is a nuance here regarding Scoring. The scoring algorithm usually calculates a score based on the ratio of pod requests to node allocatable resources. With Pod Overhead: These scheduling features represent the difference between "running Kubernetes" and "engineering Kubernetes." By leveraging Topology Spread Constraints, you ensure business continuity. By using Pod Overhead, you maintain financial and operational accuracy in mixed runtime environments. And by utilizing Scheduler Profiles, you unlock the ability to tailor the control plane to your specific hardware and business logic without fighting the upstream project. The scheduler is not a black box; it is a configurable engine. These "hidden gems" are the keys to unlocking its full potential. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse - Total Pod Compute: (Sum of Container Requests) + (Pod Overhead)
- Node Consumption Calculation: When scoring, the scheduler looks at the current node usage + Total Pod Compute.
- MostRequested vs LeastRequested: If you are using MostRequested (bin packing) strategy, the inclusion of overhead means the scheduler will actually pack nodes tighter because it accounts for the "wasted" overhead resource, ensuring the user payload is dense relative to the total claimed resources. If you forget this, your bin-packing algorithm will be inaccurate by the margin of your overhead, potentially leaving CPU on the table.