Compute: Stateless web servers → ECS (Fargate) or EC2 Auto Scaling Groups Scheduled jobs → ECS Scheduled Tasks or Lambda (if under 15 min) Long-running workers → ECS with SQS trigger Database: PostgreSQL / MySQL → RDS with Multi-AZ enabled Redis cache → ElastiCache (Redis) File storage → S3 with CloudFront CDN Search → OpenSearch Service Networking: Load balancing → Application Load Balancer (ALB) DNS → Route 53 CDN → CloudFront Secrets → AWS Secrets Manager (never hardcode credentials)
Compute: Stateless web servers → ECS (Fargate) or EC2 Auto Scaling Groups Scheduled jobs → ECS Scheduled Tasks or Lambda (if under 15 min) Long-running workers → ECS with SQS trigger Database: PostgreSQL / MySQL → RDS with Multi-AZ enabled Redis cache → ElastiCache (Redis) File storage → S3 with CloudFront CDN Search → OpenSearch Service Networking: Load balancing → Application Load Balancer (ALB) DNS → Route 53 CDN → CloudFront Secrets → AWS Secrets Manager (never hardcode credentials)
Compute: Stateless web servers → ECS (Fargate) or EC2 Auto Scaling Groups Scheduled jobs → ECS Scheduled Tasks or Lambda (if under 15 min) Long-running workers → ECS with SQS trigger Database: PostgreSQL / MySQL → RDS with Multi-AZ enabled Redis cache → ElastiCache (Redis) File storage → S3 with CloudFront CDN Search → OpenSearch Service Networking: Load balancing → Application Load Balancer (ALB) DNS → Route 53 CDN → CloudFront Secrets → AWS Secrets Manager (never hardcode credentials)
# terraform/main.tf: example: ECS cluster + RDS
resource "aws_ecs_cluster" "app" { name = "${var.app_name}-${var.environment}" setting { name = "containerInsights" value = "enabled" }
} resource "aws_db_instance" "postgres" { identifier = "${var.app_name}-${var.environment}" engine = "postgres" engine_version = "15.3" instance_class = var.db_instance_class allocated_storage = var.db_storage_gb multi_az = var.environment == "production" deletion_protection = var.environment == "production" backup_retention_period = 7 db_subnet_group_name = aws_db_subnet_group.main.name vpc_security_group_ids = [aws_security_group.rds.id]
}
# terraform/main.tf: example: ECS cluster + RDS
resource "aws_ecs_cluster" "app" { name = "${var.app_name}-${var.environment}" setting { name = "containerInsights" value = "enabled" }
} resource "aws_db_instance" "postgres" { identifier = "${var.app_name}-${var.environment}" engine = "postgres" engine_version = "15.3" instance_class = var.db_instance_class allocated_storage = var.db_storage_gb multi_az = var.environment == "production" deletion_protection = var.environment == "production" backup_retention_period = 7 db_subnet_group_name = aws_db_subnet_group.main.name vpc_security_group_ids = [aws_security_group.rds.id]
}
# terraform/main.tf: example: ECS cluster + RDS
resource "aws_ecs_cluster" "app" { name = "${var.app_name}-${var.environment}" setting { name = "containerInsights" value = "enabled" }
} resource "aws_db_instance" "postgres" { identifier = "${var.app_name}-${var.environment}" engine = "postgres" engine_version = "15.3" instance_class = var.db_instance_class allocated_storage = var.db_storage_gb multi_az = var.environment == "production" deletion_protection = var.environment == "production" backup_retention_period = 7 db_subnet_group_name = aws_db_subnet_group.main.name vpc_security_group_ids = [aws_security_group.rds.id]
}
# ECS task definition snippet
{ "family": "app-production", "containerDefinitions": [ { "name": "web", "image": "your-account.dkr.ecr.us-east-1.amazonaws.com/app:latest", "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"], "interval": 30, "timeout": 5, "retries": 3, "startPeriod": 60 } } ], "requiresCompatibilities": ["FARGATE"]
}
# ECS task definition snippet
{ "family": "app-production", "containerDefinitions": [ { "name": "web", "image": "your-account.dkr.ecr.us-east-1.amazonaws.com/app:latest", "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"], "interval": 30, "timeout": 5, "retries": 3, "startPeriod": 60 } } ], "requiresCompatibilities": ["FARGATE"]
}
# ECS task definition snippet
{ "family": "app-production", "containerDefinitions": [ { "name": "web", "image": "your-account.dkr.ecr.us-east-1.amazonaws.com/app:latest", "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"], "interval": 30, "timeout": 5, "retries": 3, "startPeriod": 60 } } ], "requiresCompatibilities": ["FARGATE"]
} - [ ] List every service, process, and scheduled job running on your current infrastructure
- [ ] Map all external dependencies: third-party APIs, payment processors, email providers
- [ ] Document every port, protocol, and network path between services
- [ ] Identify stateful vs stateless components: They migrate differently - [ ] Catalog every database: type, size, read/write patterns, peak load times
- [ ] Identify data with compliance constraints (PII, PHI, PCI): This affects your AWS region choices and service selections
- [ ] Measure your RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements: These drive your backup and replication strategy
- [ ] Document any data that cannot be in certain geographic regions (US-only requirements are common for government and healthcare clients) - [ ] Capture 30 days of traffic patterns: requests/second, peak times, geographic distribution
- [ ] Profile database query patterns: identify slow queries, N+1 problems, missing indexes
- [ ] Measure current response times as your benchmark: You need to beat these after migration - Set up RDS instance and run a full dump/restore to establish the baseline
- Enable continuous replication from old DB to RDS (AWS DMS handles this)
- Run both databases in parallel, validate data consistency
- Switch application read traffic to RDS, keep writes going to old DB
- Switch write traffic to RDS
- Monitor for 48 hours
- Decommission old database - [ ] Response times at or below pre-migration baseline
- [ ] Error rates within normal range for 72 hours
- [ ] Database query performance profiled on RDS: Slow query log enabled
- [ ] CloudWatch alarms configured for: CPU, memory, database connections, error rates, 5xx responses
- [ ] Cost Explorer reviewed: Confirm you're not running over-provisioned instances
- [ ] Security: all services in private subnets, no public RDS endpoints, WAF in front of ALB
- [ ] Backup restore tested: Not just that backups run, but that you can actually restore from them