Tools: How We Cut Rails on GKE Costs by 60%: The "Efficiency First" Roadmap - Full Analysis
The vicious cycle: why the cluster needed so many Pods
How the cost actually came down
Part 1: Static Optimization
1.1 GKE Node generation upgrade: n1-highmem-2 → n2d-highmem-2
1.2 Rails process model: from 33 threads to 4 workers
Why Rails App Memory Bloat Happens: Causes and Solutions (and How I Cut It by 20%)
Part 2: Dynamic Optimization
2.0 Prerequisites: making Pods safe to autoscale
2.1 KEDA with Cron trigger
2.2 GKE Node Autoscaling
Results
Lessons learned
Why Rails App Memory Bloat Happens: Causes and Solutions (and How I Cut It by 20%) tl;dr: We reduced Google Kubernetes Engine(GKE) costs by 60%. The biggest wins came not from Kubernetes tuning, but from understanding why our Rails app needed so many Pods in the first place: The order mattered. We improved per-Pod efficiency first, then used autoscaling to stop paying for idle capacity. The interesting part was not any one change by itself, but why these four changes reinforced each other. The rest of this article walks through the reasoning behind each one. I work on a B2B SaaS platform that runs on Google Kubernetes Engine. The API server is a Rails application, and it was the biggest cost driver on our GKE cluster. Over the course of about nine months, we reduced GKE costs by roughly 60%, with most of the core technical changes concentrated in July and August 2025. Most of the savings came from understanding how Rails actually uses CPU and memory, and fixing misconfigurations that had been in place since launch. The Kubernetes-level changes (autoscaling, node upgrades) were important, but they only worked well because we fixed the application layer first. The key was looking at the system holistically, from Ruby's runtime behavior to Kubernetes scaling. This article explains why the savings came from four changes applied in the right order: better GKE node price-performance, a Rails process model that matched Ruby's concurrency model, removing unnecessary per-request CPU work, and only then introducing autoscaling. The high cost was not caused by a single problem. It was a cycle: Each problem amplified the others. Inefficient Pods meant we needed more of them, which meant more nodes to host them, all running around the clock. The reduction did not come from one dramatic optimization. It came from breaking this cycle at multiple points: making each Pod more efficient, reducing unnecessary CPU work, and only then allowing the cluster to scale with demand. Before diving into the technical details, here is what the cost trajectory looked like. The chart below shows monthly GKE spending by SKU. The trend is clear. This was not primarily a traffic reduction. API traffic remained broadly stable during this period, and the savings came mainly from efficiency improvements. The reduction happened in two phases. In the first phase (early 2025 through June 2025), we simply reduced Pod and node counts that were clearly excessive. This brought costs down from the peak, but it was just trimming fat. The underlying inefficiencies remained. The second phase (July 2025 onward) is where the real optimization happened. It breaks down into two parts: Our GKE nodes had been running on n1-highmem-2 since the early days of the service. The n1 family is Google Cloud's first generation of general-purpose instances, based on older Intel Skylake/Broadwell processors. We migrated to n2d-highmem-2, which uses AMD EPYC processors. The performance difference is significant. According to Google's official CoreMark benchmarks: 56% more CPU performance. 23% more RAM. 3% lower cost. The newer generation is faster, has more memory, and is cheaper. For us, this was one of the rare optimizations that improved both performance and cost at the same time. The migration itself was straightforward: update the node pool configuration, cordon and drain the old nodes, and let the new pool take over. We did this in July 2025, and it shows up as a clear inflection point in the billing data. This time, we chose x86-based N2D rather than Arm to preserve binary compatibility and minimize ecosystem risk around Ruby, native extensions, and container images. It was the most practical low-risk step. That said, Arm remains a possible next step for further cost reduction. If your GKE nodes are still running on n1-series instances, check the benchmarks and pricing for n2d or newer. This was probably the biggest single contributor to the cost reduction. Our Rails application was running Puma with the following configuration: So we had one process with 33 threads. On the surface, that sounds like it should handle 33 concurrent requests efficiently. In practice, it did not give us the kind of throughput we expected, because of how Ruby works. Ruby's Global VM Lock (GVL) Ruby has a Global VM Lock (GVL). Only one thread can execute Ruby code at a time within a single process. Multiple threads help when threads are waiting on I/O (database queries, HTTP calls, file reads), because a waiting thread releases the GVL and lets another thread run. But for CPU-bound work, 33 threads in one process gives you essentially the same CPU throughput as 1 thread. This meant our 33-thread setup was an illusion. We were paying for the memory overhead of 33 thread stacks without getting the CPU parallelism we thought we had. On top of the GVL issue, our API used bcrypt for token authentication on every request. bcrypt is designed to be slow. That is its purpose as a password hashing algorithm: it is intentionally CPU-intensive to make brute-force attacks impractical. But we were running bcrypt on every API call, not just on login. Modern versions of bcrypt-ruby release the GVL during computation, so it does not block other threads. However, the CPU cost itself remains significant. Every API request was burning CPU cycles on an expensive hashing operation, eating into the Pod's overall processing capacity. With hundreds of API calls per minute, this added up. We replaced bcrypt with a lighter authentication method for API token validation. The specifics of the new approach are not relevant here, but the key point is: bcrypt is the right tool for password hashing at login. It is the wrong tool for per-request API authentication. We changed the Puma configuration to: Four worker processes, each with its own GVL, running on separate CPU cores. This gives actual CPU parallelism. Thread count was reduced to 8 per worker, which is enough for I/O concurrency without the memory overhead of 33 threads. That said, 8 is still conservative. Since Rails 7.2, the default Puma thread count is 3 per worker, reduced from the previous default of 5. We may reduce this further in the future, which would lower per-worker memory consumption even more. For guidance on tuning worker and thread counts for your environment, see Puma's deployment documentation. The memory impact was counterintuitive. You might expect 4 workers to use 4x the memory of 1 worker. They do not, thanks to Copy-on-Write (CoW). When Puma forks worker processes, the child processes share the parent's memory pages. They only allocate new memory when they write to a page. In practice, a large portion of the Rails application code, gems, and boot-time data remains shared.
One important requirement for this to work is to use preload_app! in your Puma configuration. Without it, each worker loads the application independently after forking, and memory sharing does not happen. We also set the environment variable MALLOC_ARENA_MAX=2 to reduce glibc's memory arena overhead. You can think of it as a small but useful guardrail against glibc holding on to too much memory. By default, glibc creates up to 8 * CPU cores malloc arenas, each of which holds onto freed memory independently. In a multi-process Rails setup, this leads to significant memory waste. Setting MALLOC_ARENA_MAX=2 cut per-Pod memory consumption by roughly 20%. I wrote a more detailed article about this: Between CoW, MALLOC_ARENA_MAX, and fewer threads, our Pod memory request actually decreased from 4.2Gi to 3.5Gi, even after moving from one worker process to four. That means the memory cost of achieving real CPU parallelism dropped significantly. But the memory improvement was secondary. The real cost impact was the reduction in Pod count. Previously, each Pod had only one Puma worker, which limited CPU-parallel Ruby execution and forced us to run more Pods to handle traffic. With 4 workers per Pod, the same traffic can be served by far fewer Pods. Fewer Pods means fewer nodes, and that is where the cost savings came from. With each Pod running more efficiently, the next step was to stop paying for capacity we did not need outside of peak hours. But before turning on autoscaling, we needed to make sure Pods could be safely started and stopped at any time. Autoscaling means Kubernetes is constantly creating and destroying Pods. If your Pods are not set up for this, you trade cost savings for reliability problems. We configured the following before enabling any autoscaling: Startup probe. Our Rails application takes a while to boot. It loads the framework, initializes gems, establishes database connections, and warms up caches. Without a startup probe, Kubernetes may decide the Pod is unhealthy and kill it before it finishes starting. The startup probe gives the Pod the time it needs to complete initialization before liveness checks begin. Readiness probe. This tells Kubernetes whether a Pod is ready to accept traffic. If a Pod becomes temporarily unable to serve requests (for example, during a heavy background job or a lost database connection), the readiness probe fails and the Pod is removed from the Service's endpoints until it recovers. Liveness probe. This detects Pods that are running but stuck (a hung Rails process, for example). Kubernetes automatically restarts them. This is important for long-running Pods. terminationGracePeriodSeconds. When KEDA scales in Pods during the evening, in-flight requests need time to complete. The grace period tells Kubernetes to wait before forcefully killing the Pod, preventing dropped requests during scale-in. In practice, setting the grace period alone may not be enough. Kubernetes removes the Pod from the Service endpoints in parallel with sending the termination signal, so some requests can still arrive after shutdown begins. Adding a short preStop hook (e.g., sleep for a few seconds) gives the endpoints update time to propagate. See this article on graceful Kubernetes deployments for a detailed explanation of the timing issue. These are not optional extras. They are prerequisites. Without them, autoscaling works on paper but fails in production. For details on configuring each probe type, see the Kubernetes documentation on liveness, readiness, and startup probes. Before KEDA, our Pods ran at peak capacity 24/7. The replica count was set to handle the busiest time of day, and it stayed there overnight, on weekends, and on holidays. Our platform is a business application. Usage concentrates on weekday business hours (roughly 8:30 to 19:30 JST). Evenings and weekends are quiet. This traffic pattern is stable and predictable. Given this predictability, we chose KEDA's Cron trigger. KEDA is an event-driven autoscaler for Kubernetes, and the Cron trigger is one of its simplest scaling options: it adjusts the replica count on a time-based schedule, rather than reacting to metrics like CPU or memory usage. When you know the traffic pattern in advance, this is simpler and more reliable than reactive scaling. There is no lag waiting for metrics to cross a threshold, no risk of flapping, and the configuration is easy to understand. No metrics, no thresholds, no reactive logic. For our traffic pattern, that simplicity was a feature, not a limitation. One thing to note: Cron-based scaling assumes the traffic pattern stays stable. As the business grows, the current replica counts may not be enough. To catch this early, we monitor API response times with external synthetic monitoring. If response times start creeping up during business hours, that is the signal to revisit the replica counts. Cron scaling is not set-and-forget. It just changes infrequently. With Pod counts now fluctuating on KEDA's schedule, we enabled GKE's Cluster Autoscaler. The logic is straightforward: when KEDA scales Pods down in the evening, some nodes end up underutilized. The Cluster Autoscaler drains those nodes and removes them. When KEDA scales Pods back up in the morning, the autoscaler provisions new nodes to accommodate them. We configured a minimum node count for availability zone redundancy and a maximum to cap costs. The combination of KEDA controlling Pod count and the Cluster Autoscaler controlling node count means we only pay for the capacity we are actually using. Comparing the H2 2024 average (before optimization) to the Q1 2026 average (after all changes were in place): The cost declined steadily from February 2025 through September 2025 as each optimization was rolled out. The N1 to N2D node migration in July 2025 is visible as a clear step down in the billing data. Costs stabilized from late 2025 onward. None of these were particularly glamorous optimizations. Most were the kind of changes teams postpone because each one seems too small to matter. But together, they changed the economics of the entire system. The ordering mattered as much as the individual changes. The biggest lesson was that cost optimization is not primarily about scaling policies. It is about unit economics. First make each Pod cheaper and more productive. Then let autoscaling amplify that efficiency. If we had introduced autoscaling first, we would simply have scaled an inefficient system up and down. The real savings only appeared after we improved per-Pod efficiency. Understand your runtime's concurrency model. Ruby's GVL means threads do not give you CPU parallelism. If your Rails app has a high thread count and a single Puma worker, you are paying for memory without getting the concurrency you expect. Use multiple workers for CPU parallelism, and set thread count based on I/O concurrency needs only. bcrypt is for passwords, not for API tokens. bcrypt is intentionally slow. Running it on every API request is a CPU tax that adds up quickly. Use it where it belongs (password hashing at login) and use something lighter for per-request token validation. Boring autoscaling is good autoscaling. KEDA supports dozens of trigger types, including metrics-based, queue-based, and custom triggers. But for a business application with a predictable traffic pattern, a Cron trigger is the right choice. Do not build complexity you do not need. Review your resource allocations regularly. If your Pod requests and limits have not changed since the service launched, they are almost certainly wrong. Workloads evolve, but resource specifications tend to stay frozen. Node generations matter more than you think. Upgrading from n1 to n2d gave us 56% more CPU, 23% more RAM, and 3% lower cost. The migration effort was minimal. If you are still on n1, just do it. Looking back, the 60% cost reduction came from being willing to look across the full stack, from Ruby’s runtime model to Kubernetes scaling policies, rather than optimizing within a single layer. For more on the Rails memory topic, see my article: Templates let you quickly answer FAQs or store snippets for re-use. as well , this person and/or - Rails was running 1 Puma worker with 33 threads. Ruby's GVL made this effectively single-core. We switched to 4 workers with 8 threads.- API authentication used bcrypt on every request. We replaced it with a lighter method.- GKE node generation was outdated: upgrading from n1 to n2d gave 56% more CPU, 23% more RAM, for 3% less cost.- Only after fixing per-Pod efficiency did we add KEDA Cron autoscaling and GKE node autoscaling. - Inefficiency: Rails performance was poor, partly due to misconfigured concurrency settings and a CPU-heavy authentication path.- Over-provisioning: To handle traffic, we compensated by running more Pods.- Waste: There was no autoscaling, so those Pods ran 24/7 regardless of actual demand.- High Unit Cost: The nodes were old-generation instances with worse price-performance. - Part 1: Static optimization — making each Pod and each node as efficient as possible.- Part 2: Dynamic optimization — scaling capacity up and down with demand. - WEB_CONCURRENCY was not set (defaults to 1, meaning a single worker process)- RAILS_MAX_THREADS=33 - WEB_CONCURRENCY=4 (4 worker processes)- RAILS_MAX_THREADS=8 - Weekdays 08:00-20:00 JST: about 3–4x more replicas than the baseline off-hours level- All other times: baseline replica count