Tools: Production Lab: ECS Fargate + Prometheus + Grafana + Loki + Alloy + Node Exporter (2026)
Part 1: What Each Tool Does
Part 2: EC2 Monitoring Server Check
Step 1: Check all services
Why we check this
SRE/DevOps checks
Part 3: Fix Prometheus for Remote Write
Step 2: Enable Prometheus remote write receiver
Why we do this
Part 4: EC2 Security Group
Why we do this
Part 5: Configure Alloy on EC2
Important correction
Part 6: Create ECS IAM Roles
Role 1: ECS Task Execution Role
Role 2: ECS Task Role
Part 7: Create ECS Cluster
Part 8: Create Simple Application Container
Part 9: Create Fargate Task Definition
Important note
Part 10: Alloy Fargate Config
Part 11: Run ECS Service
What to check
Part 12: Verify in Prometheus
Part 13: Verify in Grafana
Part 14: Grafana Explore Queries
Part 15: What SRE Must Monitor
1. EC2 monitoring server health
2. Disk usage
3. Fargate task memory
4. Application request rate
5. Error rate
Part 16: What DevOps Must Check
Part 17: Troubleshooting
Problem: ECS task running but no metrics
Problem: Grafana shows no Loki logs
Problem: Node Exporter works but Fargate metrics missing
Final Teaching Summary You will build this architecture: Officially, ECS Fargate tasks use task execution roles for ECS actions like pulling images/logging, and task roles for application AWS permissions. (AWS Documentation) Alloy supports ECS/Fargate container metrics using the ECS Task Metadata Endpoint v4 and should run as a sidecar inside the task. (Grafana Labs) Your EC2 already has: Before we connect ECS, the central monitoring server must be healthy. Fargate tasks are dynamic. Their private IP changes. So instead of Prometheus scraping every task IP, Alloy inside Fargate will push metrics to Prometheus. Open Prometheus service file: Fargate containers cannot easily be scraped by fixed IP because tasks start/stop dynamically. Remote write lets Alloy push metrics to Prometheus. Do not open 9090, 3100, 9100 to 0.0.0.0/0. Prometheus and Loki do not protect themselves like a public website. Keep them private. This allows ECS/Fargate to: For this lab, start with no extra permissions. If app needs S3 later, add only exact S3 permissions. Task role is for your application container, not ECS itself. Cluster is the logical place where ECS services/tasks run. For easiest lab, use a demo app that exposes Prometheus metrics on port 8080. For a real production setup, store Alloy config in: For class/demo, custom Alloy image is easiest. Then search in Graph: Check Alloy internal metrics: Check app request metrics: Check Fargate container metrics: Metric names may vary depending on Alloy/OpenTelemetry conversion. For ECS logs, first check CloudWatch logs: DevOps engineer checks: This lab demonstrates a real DevOps/SRE production pattern: The most important SRE mindset: Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse
ECS Fargate Application | | metrics/logs v
Alloy sidecar | | remote_write metrics | push logs v
EC2 Monitoring Server - Prometheus :9090 - Grafana :3000 - Loki :3100 - Alloy - Node Exporter
ECS Fargate Application | | metrics/logs v
Alloy sidecar | | remote_write metrics | push logs v
EC2 Monitoring Server - Prometheus :9090 - Grafana :3000 - Loki :3100 - Alloy - Node Exporter
ECS Fargate Application | | metrics/logs v
Alloy sidecar | | remote_write metrics | push logs v
EC2 Monitoring Server - Prometheus :9090 - Grafana :3000 - Loki :3100 - Alloy - Node Exporter
Prometheus
Grafana
Node Exporter
Loki
Alloy
Prometheus
Grafana
Node Exporter
Loki
Alloy
Prometheus
Grafana
Node Exporter
Loki
Alloy
sudo systemctl status prometheus
sudo systemctl status grafana-server
sudo systemctl status loki
sudo systemctl status alloy
sudo systemctl status node_exporter
sudo systemctl status prometheus
sudo systemctl status grafana-server
sudo systemctl status loki
sudo systemctl status alloy
sudo systemctl status node_exporter
sudo systemctl status prometheus
sudo systemctl status grafana-server
sudo systemctl status loki
sudo systemctl status alloy
sudo systemctl status node_exporter
active (running)
active (running)
active (running)
sudo ss -tulnp | grep -E '3000|9090|9100|3100'
sudo ss -tulnp | grep -E '3000|9090|9100|3100'
sudo ss -tulnp | grep -E '3000|9090|9100|3100'
3000 Grafana
9090 Prometheus
9100 Node Exporter
3100 Loki
3000 Grafana
9090 Prometheus
9100 Node Exporter
3100 Loki
3000 Grafana
9090 Prometheus
9100 Node Exporter
3100 Loki
curl http://localhost:9090/-/ready
curl http://localhost:3100/ready
curl http://localhost:9100/metrics
curl http://localhost:9090/-/ready
curl http://localhost:3100/ready
curl http://localhost:9100/metrics
curl http://localhost:9090/-/ready
curl http://localhost:3100/ready
curl http://localhost:9100/metrics
Prometheus ready
Loki ready
Node metrics visible
Prometheus ready
Loki ready
Node metrics visible
Prometheus ready
Loki ready
Node metrics visible
sudo systemctl edit prometheus
sudo systemctl edit prometheus
sudo systemctl edit prometheus
[Service]
ExecStart=
ExecStart=/usr/local/bin/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --web.listen-address=:9090 \ --web.enable-lifecycle \ --web.enable-remote-write-receiver
[Service]
ExecStart=
ExecStart=/usr/local/bin/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --web.listen-address=:9090 \ --web.enable-lifecycle \ --web.enable-remote-write-receiver
[Service]
ExecStart=
ExecStart=/usr/local/bin/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --web.listen-address=:9090 \ --web.enable-lifecycle \ --web.enable-remote-write-receiver
sudo systemctl daemon-reload
sudo systemctl restart prometheus
sudo systemctl status prometheus
sudo systemctl daemon-reload
sudo systemctl restart prometheus
sudo systemctl status prometheus
sudo systemctl daemon-reload
sudo systemctl restart prometheus
sudo systemctl status prometheus
curl http://localhost:9090/-/ready
curl http://localhost:9090/-/ready
curl http://localhost:9090/-/ready
EC2 → Instances → Select monitoring EC2 → Security → Security Group
EC2 → Instances → Select monitoring EC2 → Security → Security Group
EC2 → Instances → Select monitoring EC2 → Security → Security Group
10.0.0.0/16
10.0.0.0/16
10.0.0.0/16
sudo nano /etc/alloy/config.alloy
sudo nano /etc/alloy/config.alloy
sudo nano /etc/alloy/config.alloy
prometheus.exporter.unix "local_host" { set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "netdev", "loadavg"]
} prometheus.scrape "local_host" { targets = prometheus.exporter.unix.local_host.targets forward_to = [prometheus.remote_write.local_prom.receiver]
} prometheus.remote_write "local_prom" { endpoint { url = "http://127.0.0.1:9090/api/v1/write" }
} loki.source.file "system_logs" { targets = [ {__path__ = "/var/log/syslog", job = "syslog"}, {__path__ = "/var/log/auth.log", job = "auth"}, {__path__ = "/var/log/nginx/access.log", job = "nginx_access"}, {__path__ = "/var/log/nginx/error.log", job = "nginx_error"}, ] forward_to = [loki.write.local_loki.receiver]
} loki.write "local_loki" { endpoint { url = "http://127.0.0.1:3100/loki/api/v1/push" }
}
prometheus.exporter.unix "local_host" { set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "netdev", "loadavg"]
} prometheus.scrape "local_host" { targets = prometheus.exporter.unix.local_host.targets forward_to = [prometheus.remote_write.local_prom.receiver]
} prometheus.remote_write "local_prom" { endpoint { url = "http://127.0.0.1:9090/api/v1/write" }
} loki.source.file "system_logs" { targets = [ {__path__ = "/var/log/syslog", job = "syslog"}, {__path__ = "/var/log/auth.log", job = "auth"}, {__path__ = "/var/log/nginx/access.log", job = "nginx_access"}, {__path__ = "/var/log/nginx/error.log", job = "nginx_error"}, ] forward_to = [loki.write.local_loki.receiver]
} loki.write "local_loki" { endpoint { url = "http://127.0.0.1:3100/loki/api/v1/push" }
}
prometheus.exporter.unix "local_host" { set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "netdev", "loadavg"]
} prometheus.scrape "local_host" { targets = prometheus.exporter.unix.local_host.targets forward_to = [prometheus.remote_write.local_prom.receiver]
} prometheus.remote_write "local_prom" { endpoint { url = "http://127.0.0.1:9090/api/v1/write" }
} loki.source.file "system_logs" { targets = [ {__path__ = "/var/log/syslog", job = "syslog"}, {__path__ = "/var/log/auth.log", job = "auth"}, {__path__ = "/var/log/nginx/access.log", job = "nginx_access"}, {__path__ = "/var/log/nginx/error.log", job = "nginx_error"}, ] forward_to = [loki.write.local_loki.receiver]
} loki.write "local_loki" { endpoint { url = "http://127.0.0.1:3100/loki/api/v1/push" }
}
sudo alloy fmt --write /etc/alloy/config.alloy
sudo systemctl restart alloy
sudo systemctl status alloy
sudo alloy fmt --write /etc/alloy/config.alloy
sudo systemctl restart alloy
sudo systemctl status alloy
sudo alloy fmt --write /etc/alloy/config.alloy
sudo systemctl restart alloy
sudo systemctl status alloy
IAM → Roles → Create role → AWS service → Elastic Container Service → ECS Task
IAM → Roles → Create role → AWS service → Elastic Container Service → ECS Task
IAM → Roles → Create role → AWS service → Elastic Container Service → ECS Task
AmazonECSTaskExecutionRolePolicy
AmazonECSTaskExecutionRolePolicy
AmazonECSTaskExecutionRolePolicy
ecsTaskExecutionRole
ecsTaskExecutionRole
ecsTaskExecutionRole
Pull image from ECR
Send logs to CloudWatch
Read Secrets Manager if needed
Pull image from ECR
Send logs to CloudWatch
Read Secrets Manager if needed
Pull image from ECR
Send logs to CloudWatch
Read Secrets Manager if needed
IAM → Roles → Create role → ECS Task
IAM → Roles → Create role → ECS Task
IAM → Roles → Create role → ECS Task
ecsAppTaskRole
ecsAppTaskRole
ecsAppTaskRole
ECS → Clusters → Create cluster
ECS → Clusters → Create cluster
ECS → Clusters → Create cluster
AWS Fargate
AWS Fargate
AWS Fargate
prod-observability-cluster
prod-observability-cluster
prod-observability-cluster
ghcr.io/brancz/prometheus-example-app:v0.5.0
ghcr.io/brancz/prometheus-example-app:v0.5.0
ghcr.io/brancz/prometheus-example-app:v0.5.0
ECS → Task Definitions → Create new task definition → Create new task definition with JSON
ECS → Task Definitions → Create new task definition → Create new task definition with JSON
ECS → Task Definitions → Create new task definition → Create new task definition with JSON
{ "family": "fargate-observability-lab", "networkMode": "awsvpc", "requiresCompatibilities": ["FARGATE"], "cpu": "512", "memory": "1024", "executionRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole", "taskRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsAppTaskRole", "containerDefinitions": [ { "name": "demo-app", "image": "ghcr.io/brancz/prometheus-example-app:v0.5.0", "essential": true, "portMappings": [ { "containerPort": 8080, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/fargate-observability-lab", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "demo-app", "awslogs-create-group": "true" } } }, { "name": "alloy-sidecar", "image": "grafana/alloy:latest", "essential": false, "command": [ "run", "--server.http.listen-addr=0.0.0.0:12345", "/etc/alloy/fargate.alloy" ], "environment": [ { "name": "ALLOY_STABILITY_LEVEL", "value": "experimental" }, { "name": "EC2_PROMETHEUS_URL", "value": "http://<EC2_PRIVATE_IP>:9090/api/v1/write" }, { "name": "EC2_LOKI_URL", "value": "http://<EC2_PRIVATE_IP>:3100/loki/api/v1/push" } ], "portMappings": [ { "containerPort": 12345, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/fargate-observability-lab", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "alloy", "awslogs-create-group": "true" } } } ]
}
{ "family": "fargate-observability-lab", "networkMode": "awsvpc", "requiresCompatibilities": ["FARGATE"], "cpu": "512", "memory": "1024", "executionRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole", "taskRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsAppTaskRole", "containerDefinitions": [ { "name": "demo-app", "image": "ghcr.io/brancz/prometheus-example-app:v0.5.0", "essential": true, "portMappings": [ { "containerPort": 8080, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/fargate-observability-lab", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "demo-app", "awslogs-create-group": "true" } } }, { "name": "alloy-sidecar", "image": "grafana/alloy:latest", "essential": false, "command": [ "run", "--server.http.listen-addr=0.0.0.0:12345", "/etc/alloy/fargate.alloy" ], "environment": [ { "name": "ALLOY_STABILITY_LEVEL", "value": "experimental" }, { "name": "EC2_PROMETHEUS_URL", "value": "http://<EC2_PRIVATE_IP>:9090/api/v1/write" }, { "name": "EC2_LOKI_URL", "value": "http://<EC2_PRIVATE_IP>:3100/loki/api/v1/push" } ], "portMappings": [ { "containerPort": 12345, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/fargate-observability-lab", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "alloy", "awslogs-create-group": "true" } } } ]
}
{ "family": "fargate-observability-lab", "networkMode": "awsvpc", "requiresCompatibilities": ["FARGATE"], "cpu": "512", "memory": "1024", "executionRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole", "taskRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsAppTaskRole", "containerDefinitions": [ { "name": "demo-app", "image": "ghcr.io/brancz/prometheus-example-app:v0.5.0", "essential": true, "portMappings": [ { "containerPort": 8080, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/fargate-observability-lab", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "demo-app", "awslogs-create-group": "true" } } }, { "name": "alloy-sidecar", "image": "grafana/alloy:latest", "essential": false, "command": [ "run", "--server.http.listen-addr=0.0.0.0:12345", "/etc/alloy/fargate.alloy" ], "environment": [ { "name": "ALLOY_STABILITY_LEVEL", "value": "experimental" }, { "name": "EC2_PROMETHEUS_URL", "value": "http://<EC2_PRIVATE_IP>:9090/api/v1/write" }, { "name": "EC2_LOKI_URL", "value": "http://<EC2_PRIVATE_IP>:3100/loki/api/v1/push" } ], "portMappings": [ { "containerPort": 12345, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/fargate-observability-lab", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "alloy", "awslogs-create-group": "true" } } } ]
}
<ACCOUNT_ID>
<EC2_PRIVATE_IP>
us-east-2 if your region is different
<ACCOUNT_ID>
<EC2_PRIVATE_IP>
us-east-2 if your region is different
<ACCOUNT_ID>
<EC2_PRIVATE_IP>
us-east-2 if your region is different
EFS
S3 pulled at startup
custom Alloy image
EFS
S3 pulled at startup
custom Alloy image
EFS
S3 pulled at startup
custom Alloy image
fargate.alloy
fargate.alloy
fargate.alloy
prometheus.scrape "app_metrics" { targets = [ {"__address__" = "127.0.0.1:8080", "job" = "demo-app"} ] forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
} otelcol.receiver.awsecscontainermetrics "fargate_metrics" { collection_interval = "30s" output { metrics = [otelcol.exporter.prometheus.fargate_to_prom.receiver] }
} otelcol.exporter.prometheus "fargate_to_prom" { forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
} prometheus.remote_write "ec2_prometheus" { endpoint { url = env("EC2_PROMETHEUS_URL") }
}
prometheus.scrape "app_metrics" { targets = [ {"__address__" = "127.0.0.1:8080", "job" = "demo-app"} ] forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
} otelcol.receiver.awsecscontainermetrics "fargate_metrics" { collection_interval = "30s" output { metrics = [otelcol.exporter.prometheus.fargate_to_prom.receiver] }
} otelcol.exporter.prometheus "fargate_to_prom" { forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
} prometheus.remote_write "ec2_prometheus" { endpoint { url = env("EC2_PROMETHEUS_URL") }
}
prometheus.scrape "app_metrics" { targets = [ {"__address__" = "127.0.0.1:8080", "job" = "demo-app"} ] forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
} otelcol.receiver.awsecscontainermetrics "fargate_metrics" { collection_interval = "30s" output { metrics = [otelcol.exporter.prometheus.fargate_to_prom.receiver] }
} otelcol.exporter.prometheus "fargate_to_prom" { forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
} prometheus.remote_write "ec2_prometheus" { endpoint { url = env("EC2_PROMETHEUS_URL") }
}
Application /metrics
Fargate task CPU
Fargate task memory
Container-level metrics
Application /metrics
Fargate task CPU
Fargate task memory
Container-level metrics
Application /metrics
Fargate task CPU
Fargate task memory
Container-level metrics
ECS → Clusters → prod-observability-cluster → Services → Create
ECS → Clusters → prod-observability-cluster → Services → Create
ECS → Clusters → prod-observability-cluster → Services → Create
Launch type: Fargate
Task definition: fargate-observability-lab
Service name: demo-app-service
Desired tasks: 1
Launch type: Fargate
Task definition: fargate-observability-lab
Service name: demo-app-service
Desired tasks: 1
Launch type: Fargate
Task definition: fargate-observability-lab
Service name: demo-app-service
Desired tasks: 1
VPC: same VPC as EC2 monitoring server
Subnets: private subnets preferred
Security group: allow outbound to EC2 private IP ports 9090 and 3100
Public IP: disabled if private subnet has NAT
VPC: same VPC as EC2 monitoring server
Subnets: private subnets preferred
Security group: allow outbound to EC2 private IP ports 9090 and 3100
Public IP: disabled if private subnet has NAT
VPC: same VPC as EC2 monitoring server
Subnets: private subnets preferred
Security group: allow outbound to EC2 private IP ports 9090 and 3100
Public IP: disabled if private subnet has NAT
ECS → Cluster → Service → Tasks
ECS → Cluster → Service → Tasks
ECS → Cluster → Service → Tasks
Task status: Running
Containers: demo-app running, alloy-sidecar running
Task status: Running
Containers: demo-app running, alloy-sidecar running
Task status: Running
Containers: demo-app running, alloy-sidecar running
http://<EC2_PUBLIC_IP>:9090
http://<EC2_PUBLIC_IP>:9090
http://<EC2_PUBLIC_IP>:9090
Status → TSDB Status
Status → TSDB Status
Status → TSDB Status
alloy_component_controller_running_components
alloy_component_controller_running_components
alloy_component_controller_running_components
rate(node_cpu_seconds_total[5m])
rate(node_cpu_seconds_total[5m])
rate(node_cpu_seconds_total[5m])
node_memory_MemAvailable_bytes
node_memory_MemAvailable_bytes
node_memory_MemAvailable_bytes
http_requests_total
http_requests_total
http_requests_total
ecs_task_memory_utilized
ecs_task_memory_utilized
ecs_task_memory_utilized
container_memory_usage_bytes
container_memory_usage_bytes
container_memory_usage_bytes
http://<EC2_PUBLIC_IP>:3000
http://<EC2_PUBLIC_IP>:3000
http://<EC2_PUBLIC_IP>:3000
Connections → Data sources
Connections → Data sources
Connections → Data sources
URL: http://localhost:9090
URL: http://localhost:9090
URL: http://localhost:9090
URL: http://localhost:3100
URL: http://localhost:3100
URL: http://localhost:3100
Save & test
Save & test
Save & test
Data source is working
Data source is working
Data source is working
Grafana → Explore → Prometheus
Grafana → Explore → Prometheus
Grafana → Explore → Prometheus
rate(node_network_receive_bytes_total[1m])
rate(node_network_receive_bytes_total[1m])
rate(node_network_receive_bytes_total[1m])
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
rate(http_requests_total[5m])
rate(http_requests_total[5m])
rate(http_requests_total[5m])
Grafana → Explore → Loki
Grafana → Explore → Loki
Grafana → Explore → Loki
{job="syslog"}
{job="syslog"}
{job="syslog"}
{job="auth"}
{job="auth"}
{job="auth"}
{job="nginx_access"}
{job="nginx_access"}
{job="nginx_access"}
CloudWatch → Log groups → /ecs/fargate-observability-lab
CloudWatch → Log groups → /ecs/fargate-observability-lab
CloudWatch → Log groups → /ecs/fargate-observability-lab
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
Memory > 85%
Memory > 85%
Memory > 85%
If monitoring server dies, you lose visibility.
If monitoring server dies, you lose visibility.
If monitoring server dies, you lose visibility.
100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"})
100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"})
100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"})
Prometheus and Loki can fill disk quickly.
Prometheus and Loki can fill disk quickly.
Prometheus and Loki can fill disk quickly.
ecs_task_memory_utilized / ecs_task_memory_reserved * 100
ecs_task_memory_utilized / ecs_task_memory_reserved * 100
ecs_task_memory_utilized / ecs_task_memory_reserved * 100
> 85% for 3 minutes
> 85% for 3 minutes
> 85% for 3 minutes
Fargate kills containers when memory limit is reached.
Fargate kills containers when memory limit is reached.
Fargate kills containers when memory limit is reached.
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m]))
If traffic drops to zero, app or routing may be broken.
If traffic drops to zero, app or routing may be broken.
If traffic drops to zero, app or routing may be broken.
sum(rate(http_requests_total{code=~"5.."}[5m]))
sum(rate(http_requests_total{code=~"5.."}[5m]))
sum(rate(http_requests_total{code=~"5.."}[5m]))
5xx errors show application or dependency failure.
5xx errors show application or dependency failure.
5xx errors show application or dependency failure.
1. IAM roles are correct
2. ECS task is running
3. Security groups allow only needed ports
4. Fargate can reach EC2 private IP
5. Prometheus remote write is enabled
6. Loki is receiving logs
7. Grafana data sources work
8. No public access to Prometheus/Loki/Node Exporter
9. ECS service has desired count = running count
10. CloudWatch logs exist for both containers
1. IAM roles are correct
2. ECS task is running
3. Security groups allow only needed ports
4. Fargate can reach EC2 private IP
5. Prometheus remote write is enabled
6. Loki is receiving logs
7. Grafana data sources work
8. No public access to Prometheus/Loki/Node Exporter
9. ECS service has desired count = running count
10. CloudWatch logs exist for both containers
1. IAM roles are correct
2. ECS task is running
3. Security groups allow only needed ports
4. Fargate can reach EC2 private IP
5. Prometheus remote write is enabled
6. Loki is receiving logs
7. Grafana data sources work
8. No public access to Prometheus/Loki/Node Exporter
9. ECS service has desired count = running count
10. CloudWatch logs exist for both containers
ECS → Task → alloy-sidecar → Logs
ECS → Task → alloy-sidecar → Logs
ECS → Task → alloy-sidecar → Logs
connection refused
timeout
remote write failed
connection refused
timeout
remote write failed
connection refused
timeout
remote write failed
EC2 security group blocks port 9090
Wrong EC2 private IP
Prometheus remote write receiver not enabled
Alloy config error
EC2 security group blocks port 9090
Wrong EC2 private IP
Prometheus remote write receiver not enabled
Alloy config error
EC2 security group blocks port 9090
Wrong EC2 private IP
Prometheus remote write receiver not enabled
Alloy config error
curl http://localhost:3100/ready
sudo journalctl -u alloy -f
sudo journalctl -u loki -f
curl http://localhost:3100/ready
sudo journalctl -u alloy -f
sudo journalctl -u loki -f
curl http://localhost:3100/ready
sudo journalctl -u alloy -f
sudo journalctl -u loki -f
Loki not running
Wrong Loki URL
Alloy cannot read log files
No permissions on /var/log/*
Loki not running
Wrong Loki URL
Alloy cannot read log files
No permissions on /var/log/*
Loki not running
Wrong Loki URL
Alloy cannot read log files
No permissions on /var/log/*
Node Exporter monitors EC2 only.
It cannot monitor Fargate hosts.
Node Exporter monitors EC2 only.
It cannot monitor Fargate hosts.
Node Exporter monitors EC2 only.
It cannot monitor Fargate hosts.
Use Alloy sidecar with ECS container metrics receiver.
Use Alloy sidecar with ECS container metrics receiver.
Use Alloy sidecar with ECS container metrics receiver.
ECS Fargate runs application containers.
IAM secures container permissions.
Alloy collects telemetry.
Prometheus stores metrics.
Loki stores logs.
Grafana visualizes everything.
Node Exporter monitors the EC2 monitoring server.
ECS Fargate runs application containers.
IAM secures container permissions.
Alloy collects telemetry.
Prometheus stores metrics.
Loki stores logs.
Grafana visualizes everything.
Node Exporter monitors the EC2 monitoring server.
ECS Fargate runs application containers.
IAM secures container permissions.
Alloy collects telemetry.
Prometheus stores metrics.
Loki stores logs.
Grafana visualizes everything.
Node Exporter monitors the EC2 monitoring server.
Metrics tell you what is happening.
Logs tell you why it happened.
Grafana helps you see the story.
IAM and security groups control who can access what.
Metrics tell you what is happening.
Logs tell you why it happened.
Grafana helps you see the story.
IAM and security groups control who can access what.
Metrics tell you what is happening.
Logs tell you why it happened.
Grafana helps you see the story.
IAM and security groups control who can access what.