Tools
Tools: Understanding AWS Autoscaling with Grafana
2026-02-10
0 views
admin
Architecture Overview ## Closing Note: ## Shireenbanu / AI-recipe-finder ## Recipe Finder Application: ## 1) Profile (CRUD + Database Reads/Writes) ## 2) Medical History (Uploads + Processing + AI Pipeline) My application is deployed on AWS as a containerized system:
React frontend served by Nginx
Node.js backend deployed as a Docker container The backend relies heavily on: This architecture is intentionally realistic — it represents the type of stack many modern apps use today.
The Goal: High-Stress Scaling Through Load Testing I wanted to validate autoscaling behavior under pressure. Specifically: Can ECS scale out when traffic spikes? How fast does it scale? Does latency stay stable? Does error rate increase under stress? Which dependency becomes the bottleneck first? Load Testing Strategy (k6)
To make the test realistic, I didn’t just hit a single endpoint repeatedly. Instead, I created a k6 test with two parallel scenarios: 1) Backend Load Scenario (Triggers Scaling)
This scenario generates the high traffic volume needed to push the backend and observe ECS behavior. Observe scale-in behavior 2) UI Monitoring Scenario (Real User Flow)
This scenario runs a small number of browser-based users to monitor actual UI behavior while the system is under stress. This helped validate whether the UI stayed usable during the stress event. The First Surprise: 500 VUs Did Not Spike CPU
At 500 virtual users, I expected ECS CPU utilization to become the main bottleneck. Instead, the CPU stayed surprisingly low — barely crossing 18%, even while the load test pushed close to 20,000 requests through the system. At first, this felt wrong, and I genuinely questioned whether my load test was working. But the test was fine — my assumption was not. After digging deeper, I realized the application workload simply wasn’t CPU-intensive. Most of the request time was spent waiting on external dependencies like RDS reads/writes, S3 uploads/downloads, and Gemini API responses. This made the system primarily I/O-bound, which explains why CPU-based autoscaling did not react strongly, even under heavy traffic. This was one of the biggest lessons from the experiment: High traffic does not always mean high CPU. As you can see in the above graph, I almost hit more than 20000 request. At the same time, here is my CPU and memory utilization graph, hitting no more than 20%. According to my autoscaling policy, CPU utilization must cross 70% before CloudWatch triggers the alarm. Since my application isn’t naturally CPU-intensive, I wasn’t sure how else to push CPU high enough to test scaling properly. So I manually generated CPU stress inside the running ECS container by executing an infinite loop using the following command: Within a few minutes, this forced the container CPU to spike aggressively, reaching a consistent ~99% utilization. Now the real question becomes: how long does scale-out actually take? Based on the autoscaling configuration, CloudWatch requires 60 seconds of sustained CPU breach before it enters the ALARM state. Once the alarm is triggered, ECS detects it and begins launching new tasks. CPU crossed the 70% threshold at 12:09, but the CloudWatch alarm didn’t trigger until 12:13. ECS then increased the desired task count at 12:14, and the new task became fully running by 12:15 — meaning the full scale-out process took roughly 6 minutes from threshold breach to a healthy new task. So, Autoscaling doesn’t react the instant CPU crossed the 70% threshold. CloudWatch evaluates CPU in 1-minute datapoints, and my alarm required 3 breaching datapoints within 3 minutes. Only after the alarm entered the ALARM state did ECS trigger scale-out and launch new Fargate tasks. Now let's look at the scale in process: Notice how the alarm now triggers 15 minutes after the CPU fell below threshold, matching the Low alarm rule of 15 datapoints in 15 minutes. Autoscaling ensures your application can handle spikes, but it comes with temporary performance trade-offs: During scale-out: When CPU spikes and new Fargate tasks are being launched, your application may briefly return 5xx errors or slower responses. In our experiment, we did see 5% errors for a few minutes during the initial warm-up period before the new tasks fully came online. This “warm-up latency” is an inherent part of reactive autoscaling. During scale-in: ECS gradually terminates idle tasks once the Low alarm confirms sustained low CPU. This process is intentionally slow to avoid task flapping, ensuring that users aren’t suddenly impacted if traffic spikes again. Observing CPU, alarm state, and task events together helps understand exactly how long users may experience degraded performance during scaling, and informs decisions about pre-warming, thresholds, and evaluation periods to minimize those user-facing impacts. This application helps users manage their health by securely storing medical history, lab reports, and personal profile information. Based on a patient’s conditions, it generates personalized healthy recipes using a recommendation engine integrated with the Gemini API. The goal is to provide actionable nutrition guidance while maintaining HIPAA compliance, data privacy, and secure storage. It also caches generated recipes for quick retrieval and seamless user experience. Users can view and update profile information. This workflow represents the most typical web-app traffic pattern: read and write operations to the database Users can upload lab reports and add medical conditions. Once submitted, the backend processes the medical data and sends it to a recommendation engine, which then forwards structured… Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK:
aws ecs execute-command --cluster recipe-finder-prod-cluster \ --task a55518997ca84f24bc2fd614cbc18f20 \ --container recipe-finder-api \ --interactive \ --command "/bin/sh -c 'while true; do :; done & while true; do :; done'" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
aws ecs execute-command --cluster recipe-finder-prod-cluster \ --task a55518997ca84f24bc2fd614cbc18f20 \ --container recipe-finder-api \ --interactive \ --command "/bin/sh -c 'while true; do :; done & while true; do :; done'" CODE_BLOCK:
aws ecs execute-command --cluster recipe-finder-prod-cluster \ --task a55518997ca84f24bc2fd614cbc18f20 \ --container recipe-finder-api \ --interactive \ --command "/bin/sh -c 'while true; do :; done & while true; do :; done'" CODE_BLOCK:
terraform apply terrform destroy #to destroy the infra CODE_BLOCK:
terraform apply terrform destroy #to destroy the infra - RDS (reads/writes)
- S3 (uploads/downloads)
- Gemini API (LLM inference) - Warm-up at 20 users
- Spike instantly to 500 users
- Hold for 9 minutes
- Drop back down - navigation to medical history
- viewing a PDF report
- adding a condition
- requesting recipes
- uploading a report
how-totutorialguidedev.toaimlllmnginxdockernodedatabaseterraform