Tools: I cut my AWS bill by 93% by ditching Fargate for a single Lightsail VM (2026)

Tools: I cut my AWS bill by 93% by ditching Fargate for a single Lightsail VM (2026)

What ToolMango is (so the cost numbers make sense)

The original Fargate setup

What was actually costing money

Phase 1: Skeleton mode on AWS

Phase 2: The honest migration

Migrating Aurora data to local Postgres

Building images on the VM

Tearing down AWS Fargate

Auto-deploy via GitHub Actions

What it costs now

What I gave up

What I'd do differently

When to migrate back to Fargate

The site I built ToolMango, an AI tools directory, on AWS Fargate. The bill came back at $345/mo before traffic. I migrated to a single $12 Lightsail VM in an afternoon and cut costs by 93% while keeping the same Next.js + Postgres + Redis + BullMQ stack alive. Here's exactly what I changed, what broke, and what I'd do differently. ToolMango is an editorial directory of AI tools. It scores tools on an ROI Score (cost, time-to-value, output quality, free-tier generosity, category fit, reader engagement) and ranks them — before knowing whether the tool has an affiliate program. Tools we don't earn from frequently outrank tools we do. Pre-revenue. Brand new domain. ~106 tools indexed at the time of writing. I started on AWS because I had CDK boilerplate from another project. The architecture was over-engineered for a directory site getting zero traffic: The CDK code is clean. It deploys with one command. It autoscales. It survives an AZ failure. It's exactly what a series-A SaaS would run. It's also $345/mo for zero users. I broke it down with aws ce get-cost-and-usage and a few aws ecs describe-task-definition calls: The killer insight: about $87/mo of that bill is "infrastructure plumbing" — NAT, ALB, ElastiCache, VPC endpoints. None of it is doing real work for the application. It's all there to support the architecture itself. That's the floor on a Fargate setup. For a pre-revenue project, it's nuts. Before migrating, I tried to make Fargate cheap. CDK changes I shipped: Result: $345/mo → ~$140/mo. Better, but still ridiculous for a pre-revenue project. The reason it stopped at $140: NAT, ALB, ElastiCache, VPC endpoints, and Aurora storage all have hard floors. You can't make Fargate genuinely cheap because the architecture itself isn't designed for cheap. Lightsail is AWS's "give me a Linux VM and stop overthinking it" tier. $12/mo for 2 vCPU, 2GB RAM, 60GB SSD, 3TB transfer — and it includes a static IP and a firewall. The plan: run everything on one VM in Docker Compose. For HTTPS termination: Caddy, which auto-issues Let's Encrypt certs on first request. Configuration is one stanza: Caddy reloads, Caddy gets the cert. Total setup time: 30 seconds. Aurora is in a private subnet (PRIVATE_ISOLATED), so I couldn't pg_dump from outside. The workaround: spin up a one-off ECS Fargate task in the existing Web's VPC that runs pg_dump and uploads to S3. The task definition uses postgres:16-alpine, installs aws-cli on the fly, runs: On the Lightsail VM, pull from S3 (via a presigned URL since Lightsail VMs don't have IAM roles by default), gunzip, and pipe into the local Postgres container: 64 published tools transferred cleanly. ~485KB of data total (it's a directory site — barely any data). Lightsail is x86_64. Fargate was ARM64. So I had to rebuild for x86 anyway, which is a perfect excuse to build directly on the VM and skip the registry-push dance: Next.js builds need ~1.5-2GB peak memory. Lightsail's "small_3_0" has 2GB RAM. Tight, but adding 2GB swap solved it: First build: ~6 min. Subsequent builds with Docker layer cache: ~2 min. Acceptable. After Lightsail was serving traffic, I tore down the Fargate stacks via CDK: CloudFront's destroy is the slowest — disabling a distribution then deleting it takes 15-20 min. Aurora delete is 5-10 min. Compute and network are 3-5 min each. Total teardown: ~30-40 min unattended. The piece that ties it all together: a workflow that on push to main rsyncs the source, rebuilds images on the VM, runs Prisma migrations, restarts containers: First successful auto-deploy: 3m50s end-to-end. From git push to verified 200 OK from Caddy. That's a 93% cut from $345/mo. Same site, same functionality, same automation pipeline. The site is at https://toolmango.com if you want to verify it's actually working. Honest list of what's worse on Lightsail: The CDK code is still in the repo. cdk deploy --all brings the production-grade Fargate stack back up; restore the latest Lightsail backup to a fresh Aurora cluster; cutover DNS. I'll do that when any of these hits: Until then, $25/mo. The CDK code waits patiently. If you want to look at what this stack actually serves: ToolMango. The methodology behind the editorial ROI score is at /about, the affiliate disclosure is at /disclosure. The writeup of the migration is in docs/lightsail-migration.md in the repo if you want the step-by-step. Happy to answer questions in the comments. If you found this useful and you've done a similar AWS-to-cheap-VM migration, I'd love to hear what you cut and what burned. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

CloudFront → ALB → Fargate (web ×2 tasks, worker ×1) ↓ Aurora Serverless v2 (writer) ElastiCache (Redis, t4g.small ×2) NAT ×2 (multi-AZ) VPC + interface endpoints WAF (managed rule sets) CloudFront → ALB → Fargate (web ×2 tasks, worker ×1) ↓ Aurora Serverless v2 (writer) ElastiCache (Redis, t4g.small ×2) NAT ×2 (multi-AZ) VPC + interface endpoints WAF (managed rule sets) CloudFront → ALB → Fargate (web ×2 tasks, worker ×1) ↓ Aurora Serverless v2 (writer) ElastiCache (Redis, t4g.small ×2) NAT ×2 (multi-AZ) VPC + interface endpoints WAF (managed rule sets) // Aurora: enable auto-pause when idle const cfnCluster = cluster.node.defaultChild as rds.CfnDBCluster; cfnCluster.serverlessV2ScalingConfiguration = { minCapacity: 0, // was 0.5 — auto-pause after 5 min idle maxCapacity: 2, // was 4 secondsUntilAutoPause: 300, }; // Network: 1 NAT instead of 2 natGateways: 1, // was 2 (multi-AZ) // Web: smaller, fewer tasks, autoscale up if needed desiredCount: 1, // was 2 cpu: 512, // was 1024 memoryLimitMiB: 1024, // was 2048 // Worker on Fargate Spot capacityProviderStrategies: [ { capacityProvider: "FARGATE_SPOT", weight: 4 }, { capacityProvider: "FARGATE", weight: 1 }, ], // Container Insights off containerInsightsV2: ecs.ContainerInsights.DISABLED, // Backup retention backup: { retention: cdk.Duration.days(1) }, // was 14 // WAF: removed entirely (CloudFront has free Shield Standard) // Aurora: enable auto-pause when idle const cfnCluster = cluster.node.defaultChild as rds.CfnDBCluster; cfnCluster.serverlessV2ScalingConfiguration = { minCapacity: 0, // was 0.5 — auto-pause after 5 min idle maxCapacity: 2, // was 4 secondsUntilAutoPause: 300, }; // Network: 1 NAT instead of 2 natGateways: 1, // was 2 (multi-AZ) // Web: smaller, fewer tasks, autoscale up if needed desiredCount: 1, // was 2 cpu: 512, // was 1024 memoryLimitMiB: 1024, // was 2048 // Worker on Fargate Spot capacityProviderStrategies: [ { capacityProvider: "FARGATE_SPOT", weight: 4 }, { capacityProvider: "FARGATE", weight: 1 }, ], // Container Insights off containerInsightsV2: ecs.ContainerInsights.DISABLED, // Backup retention backup: { retention: cdk.Duration.days(1) }, // was 14 // WAF: removed entirely (CloudFront has free Shield Standard) // Aurora: enable auto-pause when idle const cfnCluster = cluster.node.defaultChild as rds.CfnDBCluster; cfnCluster.serverlessV2ScalingConfiguration = { minCapacity: 0, // was 0.5 — auto-pause after 5 min idle maxCapacity: 2, // was 4 secondsUntilAutoPause: 300, }; // Network: 1 NAT instead of 2 natGateways: 1, // was 2 (multi-AZ) // Web: smaller, fewer tasks, autoscale up if needed desiredCount: 1, // was 2 cpu: 512, // was 1024 memoryLimitMiB: 1024, // was 2048 // Worker on Fargate Spot capacityProviderStrategies: [ { capacityProvider: "FARGATE_SPOT", weight: 4 }, { capacityProvider: "FARGATE", weight: 1 }, ], // Container Insights off containerInsightsV2: ecs.ContainerInsights.DISABLED, // Backup retention backup: { retention: cdk.Duration.days(1) }, // was 14 // WAF: removed entirely (CloudFront has free Shield Standard) services: postgres: image: postgres:16-alpine volumes: [./data/postgres:/var/lib/postgresql/data] deploy: { resources: { limits: { memory: 512M } } } redis: image: redis:7-alpine command: redis-server --appendonly yes --maxmemory 128mb --maxmemory-policy noeviction volumes: [./data/redis:/data] deploy: { resources: { limits: { memory: 192M } } } web: image: tm-web:latest ports: ["127.0.0.1:3000:3000"] env_file: .env depends_on: postgres: { condition: service_healthy } redis: { condition: service_healthy } deploy: { resources: { limits: { memory: 768M } } } worker: image: tm-worker:latest env_file: .env depends_on: postgres: { condition: service_healthy } redis: { condition: service_healthy } deploy: { resources: { limits: { memory: 384M } } } services: postgres: image: postgres:16-alpine volumes: [./data/postgres:/var/lib/postgresql/data] deploy: { resources: { limits: { memory: 512M } } } redis: image: redis:7-alpine command: redis-server --appendonly yes --maxmemory 128mb --maxmemory-policy noeviction volumes: [./data/redis:/data] deploy: { resources: { limits: { memory: 192M } } } web: image: tm-web:latest ports: ["127.0.0.1:3000:3000"] env_file: .env depends_on: postgres: { condition: service_healthy } redis: { condition: service_healthy } deploy: { resources: { limits: { memory: 768M } } } worker: image: tm-worker:latest env_file: .env depends_on: postgres: { condition: service_healthy } redis: { condition: service_healthy } deploy: { resources: { limits: { memory: 384M } } } services: postgres: image: postgres:16-alpine volumes: [./data/postgres:/var/lib/postgresql/data] deploy: { resources: { limits: { memory: 512M } } } redis: image: redis:7-alpine command: redis-server --appendonly yes --maxmemory 128mb --maxmemory-policy noeviction volumes: [./data/redis:/data] deploy: { resources: { limits: { memory: 192M } } } web: image: tm-web:latest ports: ["127.0.0.1:3000:3000"] env_file: .env depends_on: postgres: { condition: service_healthy } redis: { condition: service_healthy } deploy: { resources: { limits: { memory: 768M } } } worker: image: tm-worker:latest env_file: .env depends_on: postgres: { condition: service_healthy } redis: { condition: service_healthy } deploy: { resources: { limits: { memory: 384M } } } toolmango.com, www.toolmango.com { reverse_proxy 127.0.0.1:3000 encode gzip zstd header { Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" X-Content-Type-Options "nosniff" } } toolmango.com, www.toolmango.com { reverse_proxy 127.0.0.1:3000 encode gzip zstd header { Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" X-Content-Type-Options "nosniff" } } toolmango.com, www.toolmango.com { reverse_proxy 127.0.0.1:3000 encode gzip zstd header { Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" X-Content-Type-Options "nosniff" } } aws ecs run-task \ --cluster tm-prod-compute \ --task-definition tm-prod-pgdump \ --launch-type FARGATE \ --network-configuration 'awsvpcConfiguration={subnets=[subnet-...],securityGroups=[sg-...],assignPublicIp=DISABLED}' aws ecs run-task \ --cluster tm-prod-compute \ --task-definition tm-prod-pgdump \ --launch-type FARGATE \ --network-configuration 'awsvpcConfiguration={subnets=[subnet-...],securityGroups=[sg-...],assignPublicIp=DISABLED}' aws ecs run-task \ --cluster tm-prod-compute \ --task-definition tm-prod-pgdump \ --launch-type FARGATE \ --network-configuration 'awsvpcConfiguration={subnets=[subnet-...],securityGroups=[sg-...],assignPublicIp=DISABLED}' pg_dump --no-owner --no-acl --clean --if-exists -h $DB_HOST -U $DB_USER -d toolmango \ | gzip > /tmp/dump.sql.gz \ && aws s3 cp /tmp/dump.sql.gz s3://tm-prod-assets/migration/dump.sql.gz pg_dump --no-owner --no-acl --clean --if-exists -h $DB_HOST -U $DB_USER -d toolmango \ | gzip > /tmp/dump.sql.gz \ && aws s3 cp /tmp/dump.sql.gz s3://tm-prod-assets/migration/dump.sql.gz pg_dump --no-owner --no-acl --clean --if-exists -h $DB_HOST -U $DB_USER -d toolmango \ | gzip > /tmp/dump.sql.gz \ && aws s3 cp /tmp/dump.sql.gz s3://tm-prod-assets/migration/dump.sql.gz gunzip -c /tmp/dump.sql.gz | docker compose exec -T postgres psql -U tmadmin -d toolmango gunzip -c /tmp/dump.sql.gz | docker compose exec -T postgres psql -U tmadmin -d toolmango gunzip -c /tmp/dump.sql.gz | docker compose exec -T postgres psql -U tmadmin -d toolmango docker build --network=host -f Dockerfile.web -t tm-web:latest \ --build-arg NEXT_PUBLIC_SITE_URL=https://toolmango.com \ --build-arg NEXT_PUBLIC_PLAUSIBLE_DOMAIN=toolmango.com \ . docker build --network=host -f Dockerfile.web -t tm-web:latest \ --build-arg NEXT_PUBLIC_SITE_URL=https://toolmango.com \ --build-arg NEXT_PUBLIC_PLAUSIBLE_DOMAIN=toolmango.com \ . docker build --network=host -f Dockerfile.web -t tm-web:latest \ --build-arg NEXT_PUBLIC_SITE_URL=https://toolmango.com \ --build-arg NEXT_PUBLIC_PLAUSIBLE_DOMAIN=toolmango.com \ . sudo fallocate -l 2G /swapfile sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile echo "/swapfile none swap sw 0 0" | sudo tee -a /etc/fstab sudo fallocate -l 2G /swapfile sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile echo "/swapfile none swap sw 0 0" | sudo tee -a /etc/fstab sudo fallocate -l 2G /swapfile sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile echo "/swapfile none swap sw 0 0" | sudo tee -a /etc/fstab # Take final Aurora snapshot first (safety rollback) aws rds create-db-cluster-snapshot --db-cluster-identifier ... --db-cluster-snapshot-identifier tm-prod-final-... # Disable deletion protection aws rds modify-db-cluster --no-deletion-protection ... # Delete Aurora cluster + writer aws rds delete-db-instance --skip-final-snapshot ... aws rds delete-db-cluster --skip-final-snapshot ... # CDK destroy stacks in reverse dependency order cdk destroy tm-prod-edge --force # CloudFront, WAF cdk destroy tm-prod-compute --force # Fargate, ALB, ECS cluster cdk destroy tm-prod-data --force # ElastiCache (S3 retains via RemovalPolicy.RETAIN) cdk destroy tm-prod-network --force # VPC, NAT, subnets # Take final Aurora snapshot first (safety rollback) aws rds create-db-cluster-snapshot --db-cluster-identifier ... --db-cluster-snapshot-identifier tm-prod-final-... # Disable deletion protection aws rds modify-db-cluster --no-deletion-protection ... # Delete Aurora cluster + writer aws rds delete-db-instance --skip-final-snapshot ... aws rds delete-db-cluster --skip-final-snapshot ... # CDK destroy stacks in reverse dependency order cdk destroy tm-prod-edge --force # CloudFront, WAF cdk destroy tm-prod-compute --force # Fargate, ALB, ECS cluster cdk destroy tm-prod-data --force # ElastiCache (S3 retains via RemovalPolicy.RETAIN) cdk destroy tm-prod-network --force # VPC, NAT, subnets # Take final Aurora snapshot first (safety rollback) aws rds create-db-cluster-snapshot --db-cluster-identifier ... --db-cluster-snapshot-identifier tm-prod-final-... # Disable deletion protection aws rds modify-db-cluster --no-deletion-protection ... # Delete Aurora cluster + writer aws rds delete-db-instance --skip-final-snapshot ... aws rds delete-db-cluster --skip-final-snapshot ... # CDK destroy stacks in reverse dependency order cdk destroy tm-prod-edge --force # CloudFront, WAF cdk destroy tm-prod-compute --force # Fargate, ALB, ECS cluster cdk destroy tm-prod-data --force # ElastiCache (S3 retains via RemovalPolicy.RETAIN) cdk destroy tm-prod-network --force # VPC, NAT, subnets - name: Rsync source to Lightsail run: | rsync -az --delete --exclude='node_modules' --exclude='.next' --exclude='.git' \ -e "ssh -i ~/.ssh/id_ed25519" \ ./ ${{ secrets.LIGHTSAIL_USER }}@${{ secrets.LIGHTSAIL_HOST }}:/home/ubuntu/toolmango/src/ - name: Build images on Lightsail run: | ssh ... 'cd /home/ubuntu/toolmango/src && \ sg docker -c "docker build -f Dockerfile.web -t tm-web:latest ." && \ sg docker -c "docker build -f Dockerfile.worker -t tm-worker:latest ."' - name: Run prisma migrate + restart services run: | ssh ... 'cd /home/ubuntu/toolmango && \ sg docker -c "docker compose run --rm --no-deps web npx prisma migrate deploy" && \ sg docker -c "docker compose up -d --force-recreate web worker"' - name: Smoke test run: | for i in {1..6}; do [ "$(curl -s -o /dev/null -w '%{http_code}' https://toolmango.com/api/healthz)" = "200" ] && exit 0 sleep 5 done exit 1 - name: Rsync source to Lightsail run: | rsync -az --delete --exclude='node_modules' --exclude='.next' --exclude='.git' \ -e "ssh -i ~/.ssh/id_ed25519" \ ./ ${{ secrets.LIGHTSAIL_USER }}@${{ secrets.LIGHTSAIL_HOST }}:/home/ubuntu/toolmango/src/ - name: Build images on Lightsail run: | ssh ... 'cd /home/ubuntu/toolmango/src && \ sg docker -c "docker build -f Dockerfile.web -t tm-web:latest ." && \ sg docker -c "docker build -f Dockerfile.worker -t tm-worker:latest ."' - name: Run prisma migrate + restart services run: | ssh ... 'cd /home/ubuntu/toolmango && \ sg docker -c "docker compose run --rm --no-deps web npx prisma migrate deploy" && \ sg docker -c "docker compose up -d --force-recreate web worker"' - name: Smoke test run: | for i in {1..6}; do [ "$(curl -s -o /dev/null -w '%{http_code}' https://toolmango.com/api/healthz)" = "200" ] && exit 0 sleep 5 done exit 1 - name: Rsync source to Lightsail run: | rsync -az --delete --exclude='node_modules' --exclude='.next' --exclude='.git' \ -e "ssh -i ~/.ssh/id_ed25519" \ ./ ${{ secrets.LIGHTSAIL_USER }}@${{ secrets.LIGHTSAIL_HOST }}:/home/ubuntu/toolmango/src/ - name: Build images on Lightsail run: | ssh ... 'cd /home/ubuntu/toolmango/src && \ sg docker -c "docker build -f Dockerfile.web -t tm-web:latest ." && \ sg docker -c "docker build -f Dockerfile.worker -t tm-worker:latest ."' - name: Run prisma migrate + restart services run: | ssh ... 'cd /home/ubuntu/toolmango && \ sg docker -c "docker compose run --rm --no-deps web npx prisma migrate deploy" && \ sg docker -c "docker compose up -d --force-recreate web worker"' - name: Smoke test run: | for i in {1..6}; do [ "$(curl -s -o /dev/null -w '%{http_code}' https://toolmango.com/api/healthz)" = "200" ] && exit 0 sleep 5 done exit 1 - Next.js 14 App Router - Postgres 16 - Redis (BullMQ for the agent job queue) - Anthropic Claude Sonnet for editorial agents (research, SEO sweep, social drafts) - A worker process running 5 cron schedules - No multi-AZ HA. Single VM = single point of failure. AZ-level outage means downtime. - No Aurora point-in-time restore. Just nightly pg_dump to S3. RPO is up to 24h. Acceptable for a content site, not for transactional data. - No autoscaling. Vertical only — bump to a bigger Lightsail bundle if traffic grows. The next tier is $24/mo for 4GB / 80GB. Past that, $44/mo for 8GB. At those numbers you should rethink Lightsail vs going back to managed services. - Manual ops. No service auto-restart on host failure. If the VM dies, I get notified by uptime monitor and SSH in. That's the trade. - Skip Fargate entirely for pre-revenue projects. Start on Lightsail. The migration was 4 hours; if I'd started there, that's 4 hours and ~$700 of avoided bills (the 6 days I was on Fargate). - Don't enable Container Insights "just because." It's $5-15/mo and you'll never look at it on a small project. - Don't let CDK enable WAF by default. WAF is real money ($12-15/mo) for a pre-revenue site that's not under attack. CloudFront's free Shield Standard is enough. - Don't pre-provision multi-AZ NAT. Single NAT is fine until you have customers. - Use Aurora minCapacity: 0 from day 1. The auto-pause feature added in 2024 makes Aurora Serverless v2 actually serverless. Most CDK examples still default to 0.5. - Sustained traffic > 200 req/sec (single VM saturates) - Need for multi-AZ HA (revenue at risk from single AZ outage) - DB > 20 GB (Postgres on local SSD becomes risky for backup/recovery) - Compliance requirement (SOC 2 etc.)