Tools: Bare Metal vs. AWS RDS: A Deep Dive into NUMA-Aware Tuning and PostgreSQL Performance (Part 2)
Bare Metal vs. AWS RDS: CPU/NUMA Pinning and HugePages — How We Beat Aurora on Write Throughput (Part 2)
Setup Recap
The 3-Layer Tuning Stack
Layer 1: KVM Hypervisor (Bare Metal Host)
Layer 2: VM / OS Level
Layer 3: Kubernetes Pod Spec
Why All Three Layers Matter
Tuning 1 Results
Tuning 2: HugePages
Tuning 2 Results
Full Tuning Journey
Final Comparison: CNPG Tuning 2 vs AWS
Average RW TPS — All Environments
⚠️ The RDS Standard Caveat: Burstable CPU
Per-Client Write TPS Breakdown
Per-Client Write Latency
Why Aurora Loses on Write Latency
Platform Selection Guide
Key Takeaways
Environment Details In Part 1, we established storage baselines — Local SSD vs Longhorn vs AWS managed PostgreSQL. This article goes deeper: CPU/NUMA pinning and HugePages push bare metal write performance past Aurora IO-Optimized at every concurrency level. In Part 1, we ended with CNPG Local SSD — bare metal with direct-attached storage and AWS-matched PostgreSQL config. Already leading Aurora on write TPS at baseline. The question was: how much further can we push it without adding hardware? Two steps. Significant results. Same constraint as Part 1: 2 vCPU / 8 GB RAM, single instance, no HA. Same PostgreSQL config matched to AWS defaults. Same benchmark: pgbench · scale factor 100 · 60s per run · 39 runs · ap-southeast-3. Where we left off — CNPG Local SSD (Baseline): Most PostgreSQL performance articles stop at database config. This one goes deeper. The performance gains in this article come from tuning at three layers simultaneously — bare metal KVM hypervisor, VM/OS, and Kubernetes pod spec. Each layer is required for the next to work correctly. HugePages must be pre-allocated at OS boot before PostgreSQL starts — they cannot be allocated on-demand. 8192 × 2MB = 16GB pre-allocated, enough to cover the 8Gi hugepages requested by the pod with headroom. The performance governor eliminates clock speed throttling for bursty query patterns. Remove any one layer and performance degrades: This is why the benchmark results are reproducible but not trivially so — you need all three layers configured correctly. PostgreSQL was allocated 2 vCPU with no CPU affinity — running on whatever cores the kernel scheduled, potentially crossing NUMA boundaries on every memory access, with clock speed throttled by the default powersave governor. Three changes applied simultaneously: 1. Dedicated CPU cores (Kubernetes CPU Manager: static policy)
Pins the PostgreSQL pod to specific physical cores. Eliminates context switching with other workloads. 2. CPU governor: powersave → performance Default governor throttles clock speed at low load. Every short transaction pays a ramp-up penalty. 3. NUMA pinningPostgreSQL process pinned to cores on the same NUMA node as its memory allocation. Cross-NUMA memory access adds 30–40% latency on NUMA-enabled systems — our 32-core host is NUMA-aware. The single-client write latency drop from 7.48ms to 1.81ms is the most dramatic — this is the NUMA penalty being eliminated. Short transactions no longer wait for cross-NUMA memory access. HugePages reduce TLB (Translation Lookaside Buffer) pressure by mapping PostgreSQL's shared buffer pool with 2 MB pages instead of the default 4 KB. Fewer TLB entries = fewer TLB misses under concurrent access. Enabled at three levels: Why requests = limits? This gives the pod Guaranteed QoS class — Kubernetes will not evict or throttle it under resource pressure. It also enables CPU Manager static policy to pin exclusive physical cores to this pod, which is what makes NUMA affinity effective. Incremental improvement — HugePages reduce TLB contention at high concurrency. The impact is smaller than NUMA pinning but consistent across all workload types. Write latency progression (1 client): 75% write latency reduction from Baseline → Tuning 2. Same hardware, same config. CNPG Tuning 2 Overall Avg (3,351) nearly matches Aurora IO-Optimized (3,480) — just -3.7% difference.On write TPS: CNPG Tuning 2 (1,706) beats Aurora IO-Optimized (1,234) by +38%.
On write latency: CNPG Tuning 2 (24.44ms) beats Aurora IO-Optimized (29.72ms) by -17.7%. The honest picture: With 2 vCPU and mid-range SAS SSD, CNPG Tuning 2 matches Aurora IO-Optimized on overall throughput (-3.7%) while beating it by 38% on write TPS. Aurora leads on reads (~14% higher Avg RO TPS) — this reflects its distributed read cache architecture, not a config gap. We verified by pushing PostgreSQL to its limit (shared_buffers=6GB, random_page_cost=1.1, effective_io_concurrency=200) and the read ceiling held. For write-intensive OLTP, bare metal wins. For read-heavy analytical workloads, Aurora's distributed cache is worth paying for. RDS Standard (t3.large) leads the benchmark at 4,826 overall avg — but this number requires an important caveat. t3 instances use a CPU credit system: Each benchmark run is 60 seconds, with the full test suite taking ~50 minutes total — within the burst window for t3.large. Our results therefore reflect peak burst performance, which is valid for this benchmark duration. However, in a production workload running continuously 24/7, RDS Standard t3.large performance will drop once CPU credits are depleted: If you need sustained, predictable performance on AWS, consider: For a truly fair comparison against bare metal with consistent, non-burstable performance, RDS m6i.large or m7g.large would be the appropriate AWS counterpart — not t3.large. Bottom line: Our benchmark results for RDS Standard are valid — the ~50 minute test ran within the burst window. But if your production workload runs continuously 24/7, RDS t3.large will eventually underperform these numbers once CPU credits exhaust. CNPG Tuning 2's 3,351 overall avg is consistent regardless of duration — no burst credits, no performance cliffs. CNPG Tuning 2 beats Aurora IO-Optimized on RW TPS at every concurrency level. RDS Standard leads at 25–100 clients due to t3 burst credits. Write latency: bare metal wins at every concurrency level vs Aurora. Aurora replicates every write to its distributed storage fleet before acknowledging: On bare metal with NUMA-pinned CPUs and local SSD: At 1 client: Aurora write path = 3.51ms, bare metal = 1.78ms — nearly 2× faster. At 100 clients, the gap narrows as both are network/IO bound, but bare metal still leads. ← Part 1: Storage Baseline — Longhorn vs Local SSD vs Managed Cloud — Iwan Setiawan, Hybrid Cloud & Platform Architect · portfolio.kangservice.cloud Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse