$ apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: spot-arm64 spec: template: metadata: labels: # ----------------------------------------------- # These labels land on the EC2 node. # Your pod affinity rules match against these. # ----------------------------------------------- node-pool: spot-arm64 capacity-type: spot arch: arm64 workload-class: standard spec: requirements: - key: kubernetes.io/arch operator: In values: ["arm64"] - key: kubernetes.io/os operator: In values: ["linux"] - key: karpenter.sh/capacity-type operator: In values: ["spot"] - key: karpenter.k8s.aws/instance-category operator: In values: ["c", "m", "r"] # c6g, c7g, c8g — compute optimized Graviton # m6g, m7g, m8g — general purpose Graviton # r6g, r7g, r8g — memory optimized Graviton - key: karpenter.k8s.aws/instance-generation operator: Gt values: ["5"] # Graviton2+ only (gen 6,7,8) nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default expireAfter: 168h # 7 days — shorter for spot nodes limits: cpu: 500 memory: 2000Gi disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 2m weight: 100
apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: spot-arm64 spec: template: metadata: labels: # ----------------------------------------------- # These labels land on the EC2 node. # Your pod affinity rules match against these. # ----------------------------------------------- node-pool: spot-arm64 capacity-type: spot arch: arm64 workload-class: standard spec: requirements: - key: kubernetes.io/arch operator: In values: ["arm64"] - key: kubernetes.io/os operator: In values: ["linux"] - key: karpenter.sh/capacity-type operator: In values: ["spot"] - key: karpenter.k8s.aws/instance-category operator: In values: ["c", "m", "r"] # c6g, c7g, c8g — compute optimized Graviton # m6g, m7g, m8g — general purpose Graviton # r6g, r7g, r8g — memory optimized Graviton - key: karpenter.k8s.aws/instance-generation operator: Gt values: ["5"] # Graviton2+ only (gen 6,7,8) nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default expireAfter: 168h # 7 days — shorter for spot nodes limits: cpu: 500 memory: 2000Gi disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 2m weight: 100
apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: spot-arm64 spec: template: metadata: labels: # ----------------------------------------------- # These labels land on the EC2 node. # Your pod affinity rules match against these. # ----------------------------------------------- node-pool: spot-arm64 capacity-type: spot arch: arm64 workload-class: standard spec: requirements: - key: kubernetes.io/arch operator: In values: ["arm64"] - key: kubernetes.io/os operator: In values: ["linux"] - key: karpenter.sh/capacity-type operator: In values: ["spot"] - key: karpenter.k8s.aws/instance-category operator: In values: ["c", "m", "r"] # c6g, c7g, c8g — compute optimized Graviton # m6g, m7g, m8g — general purpose Graviton # r6g, r7g, r8g — memory optimized Graviton - key: karpenter.k8s.aws/instance-generation operator: Gt values: ["5"] # Graviton2+ only (gen 6,7,8) nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default expireAfter: 168h # 7 days — shorter for spot nodes limits: cpu: 500 memory: 2000Gi disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 2m weight: 100 - Rising and unpredictable costs
- Compliance constraints
- The need for deeper infrastructure control - Over-provisioned workloads leading to higher bills
- No access to Spot/Preemptible node strategies with the same level of flexibility
- Very few cost optimization knobs to tune - Fine-grained IAM control at the workload level
- Strict network isolation between services
- Audit-level visibility into infrastructure activity - Enforcing per-pod IAM permissions cleanly was non-trivial
- Network policy enforcement had gaps in our specific setup
- Generating audit-ready logs tied to individual workload actions required workarounds - CPU vs. memory-optimized instance selection
- ARM-based workloads on Graviton processors
- Custom AMIs or low-level networking tuning - Instance families — CPU-optimized, memory-optimized, ARM (Graviton)
- Custom AMIs — hardened images meeting our internal security baseline
- Networking — VPC-native networking with fine-grained subnet and security group control - Watches for unschedulable pods in real time
- Selects the right-sized instance based on actual pod requirements
- Prioritizes Spot instances where workloads allow, falling back to On-Demand seamlessly
- Bin-packs nodes efficiently, reducing idle capacity - IRSA (IAM Roles for Service Accounts) — precise, per-pod IAM permissions with no shared credentials
- VPC-level isolation — full control over ingress, egress, and inter--weight: 500;">service communication
- CloudTrail integration — every API call, every node action, fully auditable out of the box
- AWS Config + Security Hub — continuous compliance checks against CIS benchmarks and custom rules - Spot-first provisioning — workloads that tolerate interruptions run on Spot; stateful services stay on On-Demand
- Multiple instance families — Karpenter picks the cheapest right-sized option across families
- Interruption handling — we use the Karpenter interruption queue (SQS) to gracefully drain Spot nodes before AWS reclaims them
- consolidateAfter: 2m — nodes deprovision 2m seconds after going idle, eliminating ghost capacity - Spot instances covering the majority of our non-critical workloads
- Karpenter's bin-packing eliminating idle node waste
- Right-sized instances instead of over-provisioned static node groups - Faster pod scheduling — Karpenter provisions new nodes in under 60 seconds in most cases
- Better workload isolation through custom node selectors and taints
- Graviton (ARM) instances for compatible workloads gave us a meaningful price-performance improvement - Audit reports now generated directly from CloudTrail without custom tooling
- IRSA eliminated shared IAM credential risks
- Security Hub provides continuous posture monitoring against our compliance framework - Managed ≠ always optimal at scale. Autopilot is excellent for getting started, but production-grade platforms eventually need control surfaces that fully managed offerings deliberately hide.
- Cost optimization requires infrastructure access. You can't tune what you can't see.
- Autoscaling strategy matters more than cluster size. Karpenter's approach of provisioning for the pod rather than scaling a group changed how we think about capacity planning entirely.
- Compliance is easier when the platform is designed for it. AWS's native compliance tooling removed a category of work that we were previously solving with custom scripts and log forwarding pipelines.
- Migration should always be incremental. Parallel environment, gradual DNS cutover, canary deployments — this approach meant we caught issues in staging before they became production incidents.