Tools

Tools: Production Lab: ECS Fargate + Prometheus + Grafana + Loki + Alloy + Node Exporter (2026)

2026-05-25 0 views admin

Part 1: What Each Tool Does

Part 2: EC2 Monitoring Server Check

Step 1: Check all services

Why we check this

SRE/DevOps checks

Part 3: Fix Prometheus for Remote Write

Step 2: Enable Prometheus remote write receiver

Why we do this

Part 4: EC2 Security Group

Why we do this

Part 5: Configure Alloy on EC2

Important correction

Part 6: Create ECS IAM Roles

Role 1: ECS Task Execution Role

Role 2: ECS Task Role

Part 7: Create ECS Cluster

Part 8: Create Simple Application Container

Part 9: Create Fargate Task Definition

Important note

Part 10: Alloy Fargate Config

Part 11: Run ECS Service

What to check

Part 12: Verify in Prometheus

Part 13: Verify in Grafana

Part 14: Grafana Explore Queries

Part 15: What SRE Must Monitor

1. EC2 monitoring server health

2. Disk usage

3. Fargate task memory

4. Application request rate

5. Error rate

Part 16: What DevOps Must Check

Part 17: Troubleshooting

Problem: ECS task running but no metrics

Problem: Grafana shows no Loki logs

Problem: Node Exporter works but Fargate metrics missing

Final Teaching Summary You will build this architecture: Officially, ECS Fargate tasks use task execution roles for ECS actions like pulling images/logging, and task roles for application AWS permissions. (AWS Documentation) Alloy supports ECS/Fargate container metrics using the ECS Task Metadata Endpoint v4 and should run as a sidecar inside the task. (Grafana Labs) Your EC2 already has: Before we connect ECS, the central monitoring server must be healthy. Fargate tasks are dynamic. Their private IP changes. So instead of Prometheus scraping every task IP, Alloy inside Fargate will push metrics to Prometheus. Open Prometheus service file: Fargate containers cannot easily be scraped by fixed IP because tasks start/stop dynamically. Remote write lets Alloy push metrics to Prometheus. Do not open 9090, 3100, 9100 to 0.0.0.0/0. Prometheus and Loki do not protect themselves like a public website. Keep them private. This allows ECS/Fargate to: For this lab, start with no extra permissions. If app needs S3 later, add only exact S3 permissions. Task role is for your application container, not ECS itself. Cluster is the logical place where ECS services/tasks run. For easiest lab, use a demo app that exposes Prometheus metrics on port 8080. For a real production setup, store Alloy config in: For class/demo, custom Alloy image is easiest. Then search in Graph: Check Alloy internal metrics: Check app request metrics: Check Fargate container metrics: Metric names may vary depending on Alloy/OpenTelemetry conversion. For ECS logs, first check CloudWatch logs: DevOps engineer checks: This lab demonstrates a real DevOps/SRE production pattern: The most important SRE mindset: Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

ECS Fargate Application | | metrics/logs v Alloy sidecar | | remote_write metrics | push logs v EC2 Monitoring Server - Prometheus :9090 - Grafana :3000 - Loki :3100 - Alloy - Node Exporter ECS Fargate Application | | metrics/logs v Alloy sidecar | | remote_write metrics | push logs v EC2 Monitoring Server - Prometheus :9090 - Grafana :3000 - Loki :3100 - Alloy - Node Exporter ECS Fargate Application | | metrics/logs v Alloy sidecar | | remote_write metrics | push logs v EC2 Monitoring Server - Prometheus :9090 - Grafana :3000 - Loki :3100 - Alloy - Node Exporter Prometheus Grafana Node Exporter Loki Alloy Prometheus Grafana Node Exporter Loki Alloy Prometheus Grafana Node Exporter Loki Alloy sudo systemctl status prometheus sudo systemctl status grafana-server sudo systemctl status loki sudo systemctl status alloy sudo systemctl status node_exporter sudo systemctl status prometheus sudo systemctl status grafana-server sudo systemctl status loki sudo systemctl status alloy sudo systemctl status node_exporter sudo systemctl status prometheus sudo systemctl status grafana-server sudo systemctl status loki sudo systemctl status alloy sudo systemctl status node_exporter active (running) active (running) active (running) sudo ss -tulnp | grep -E '3000|9090|9100|3100' sudo ss -tulnp | grep -E '3000|9090|9100|3100' sudo ss -tulnp | grep -E '3000|9090|9100|3100' 3000 Grafana 9090 Prometheus 9100 Node Exporter 3100 Loki 3000 Grafana 9090 Prometheus 9100 Node Exporter 3100 Loki 3000 Grafana 9090 Prometheus 9100 Node Exporter 3100 Loki curl http://localhost:9090/-/ready curl http://localhost:3100/ready curl http://localhost:9100/metrics curl http://localhost:9090/-/ready curl http://localhost:3100/ready curl http://localhost:9100/metrics curl http://localhost:9090/-/ready curl http://localhost:3100/ready curl http://localhost:9100/metrics Prometheus ready Loki ready Node metrics visible Prometheus ready Loki ready Node metrics visible Prometheus ready Loki ready Node metrics visible sudo systemctl edit prometheus sudo systemctl edit prometheus sudo systemctl edit prometheus [Service] ExecStart= ExecStart=/usr/local/bin/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --web.listen-address=:9090 \ --web.enable-lifecycle \ --web.enable-remote-write-receiver [Service] ExecStart= ExecStart=/usr/local/bin/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --web.listen-address=:9090 \ --web.enable-lifecycle \ --web.enable-remote-write-receiver [Service] ExecStart= ExecStart=/usr/local/bin/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --web.listen-address=:9090 \ --web.enable-lifecycle \ --web.enable-remote-write-receiver sudo systemctl daemon-reload sudo systemctl restart prometheus sudo systemctl status prometheus sudo systemctl daemon-reload sudo systemctl restart prometheus sudo systemctl status prometheus sudo systemctl daemon-reload sudo systemctl restart prometheus sudo systemctl status prometheus curl http://localhost:9090/-/ready curl http://localhost:9090/-/ready curl http://localhost:9090/-/ready EC2 → Instances → Select monitoring EC2 → Security → Security Group EC2 → Instances → Select monitoring EC2 → Security → Security Group EC2 → Instances → Select monitoring EC2 → Security → Security Group 10.0.0.0/16 10.0.0.0/16 10.0.0.0/16 sudo nano /etc/alloy/config.alloy sudo nano /etc/alloy/config.alloy sudo nano /etc/alloy/config.alloy prometheus.exporter.unix "local_host" { set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "netdev", "loadavg"] } prometheus.scrape "local_host" { targets = prometheus.exporter.unix.local_host.targets forward_to = [prometheus.remote_write.local_prom.receiver] } prometheus.remote_write "local_prom" { endpoint { url = "http://127.0.0.1:9090/api/v1/write" } } loki.source.file "system_logs" { targets = [ {__path__ = "/var/log/syslog", job = "syslog"}, {__path__ = "/var/log/auth.log", job = "auth"}, {__path__ = "/var/log/nginx/access.log", job = "nginx_access"}, {__path__ = "/var/log/nginx/error.log", job = "nginx_error"}, ] forward_to = [loki.write.local_loki.receiver] } loki.write "local_loki" { endpoint { url = "http://127.0.0.1:3100/loki/api/v1/push" } } prometheus.exporter.unix "local_host" { set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "netdev", "loadavg"] } prometheus.scrape "local_host" { targets = prometheus.exporter.unix.local_host.targets forward_to = [prometheus.remote_write.local_prom.receiver] } prometheus.remote_write "local_prom" { endpoint { url = "http://127.0.0.1:9090/api/v1/write" } } loki.source.file "system_logs" { targets = [ {__path__ = "/var/log/syslog", job = "syslog"}, {__path__ = "/var/log/auth.log", job = "auth"}, {__path__ = "/var/log/nginx/access.log", job = "nginx_access"}, {__path__ = "/var/log/nginx/error.log", job = "nginx_error"}, ] forward_to = [loki.write.local_loki.receiver] } loki.write "local_loki" { endpoint { url = "http://127.0.0.1:3100/loki/api/v1/push" } } prometheus.exporter.unix "local_host" { set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "netdev", "loadavg"] } prometheus.scrape "local_host" { targets = prometheus.exporter.unix.local_host.targets forward_to = [prometheus.remote_write.local_prom.receiver] } prometheus.remote_write "local_prom" { endpoint { url = "http://127.0.0.1:9090/api/v1/write" } } loki.source.file "system_logs" { targets = [ {__path__ = "/var/log/syslog", job = "syslog"}, {__path__ = "/var/log/auth.log", job = "auth"}, {__path__ = "/var/log/nginx/access.log", job = "nginx_access"}, {__path__ = "/var/log/nginx/error.log", job = "nginx_error"}, ] forward_to = [loki.write.local_loki.receiver] } loki.write "local_loki" { endpoint { url = "http://127.0.0.1:3100/loki/api/v1/push" } } sudo alloy fmt --write /etc/alloy/config.alloy sudo systemctl restart alloy sudo systemctl status alloy sudo alloy fmt --write /etc/alloy/config.alloy sudo systemctl restart alloy sudo systemctl status alloy sudo alloy fmt --write /etc/alloy/config.alloy sudo systemctl restart alloy sudo systemctl status alloy IAM → Roles → Create role → AWS service → Elastic Container Service → ECS Task IAM → Roles → Create role → AWS service → Elastic Container Service → ECS Task IAM → Roles → Create role → AWS service → Elastic Container Service → ECS Task AmazonECSTaskExecutionRolePolicy AmazonECSTaskExecutionRolePolicy AmazonECSTaskExecutionRolePolicy ecsTaskExecutionRole ecsTaskExecutionRole ecsTaskExecutionRole Pull image from ECR Send logs to CloudWatch Read Secrets Manager if needed Pull image from ECR Send logs to CloudWatch Read Secrets Manager if needed Pull image from ECR Send logs to CloudWatch Read Secrets Manager if needed IAM → Roles → Create role → ECS Task IAM → Roles → Create role → ECS Task IAM → Roles → Create role → ECS Task ecsAppTaskRole ecsAppTaskRole ecsAppTaskRole ECS → Clusters → Create cluster ECS → Clusters → Create cluster ECS → Clusters → Create cluster AWS Fargate AWS Fargate AWS Fargate prod-observability-cluster prod-observability-cluster prod-observability-cluster ghcr.io/brancz/prometheus-example-app:v0.5.0 ghcr.io/brancz/prometheus-example-app:v0.5.0 ghcr.io/brancz/prometheus-example-app:v0.5.0 ECS → Task Definitions → Create new task definition → Create new task definition with JSON ECS → Task Definitions → Create new task definition → Create new task definition with JSON ECS → Task Definitions → Create new task definition → Create new task definition with JSON { "family": "fargate-observability-lab", "networkMode": "awsvpc", "requiresCompatibilities": ["FARGATE"], "cpu": "512", "memory": "1024", "executionRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole", "taskRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsAppTaskRole", "containerDefinitions": [ { "name": "demo-app", "image": "ghcr.io/brancz/prometheus-example-app:v0.5.0", "essential": true, "portMappings": [ { "containerPort": 8080, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/fargate-observability-lab", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "demo-app", "awslogs-create-group": "true" } } }, { "name": "alloy-sidecar", "image": "grafana/alloy:latest", "essential": false, "command": [ "run", "--server.http.listen-addr=0.0.0.0:12345", "/etc/alloy/fargate.alloy" ], "environment": [ { "name": "ALLOY_STABILITY_LEVEL", "value": "experimental" }, { "name": "EC2_PROMETHEUS_URL", "value": "http://<EC2_PRIVATE_IP>:9090/api/v1/write" }, { "name": "EC2_LOKI_URL", "value": "http://<EC2_PRIVATE_IP>:3100/loki/api/v1/push" } ], "portMappings": [ { "containerPort": 12345, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/fargate-observability-lab", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "alloy", "awslogs-create-group": "true" } } } ] } { "family": "fargate-observability-lab", "networkMode": "awsvpc", "requiresCompatibilities": ["FARGATE"], "cpu": "512", "memory": "1024", "executionRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole", "taskRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsAppTaskRole", "containerDefinitions": [ { "name": "demo-app", "image": "ghcr.io/brancz/prometheus-example-app:v0.5.0", "essential": true, "portMappings": [ { "containerPort": 8080, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/fargate-observability-lab", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "demo-app", "awslogs-create-group": "true" } } }, { "name": "alloy-sidecar", "image": "grafana/alloy:latest", "essential": false, "command": [ "run", "--server.http.listen-addr=0.0.0.0:12345", "/etc/alloy/fargate.alloy" ], "environment": [ { "name": "ALLOY_STABILITY_LEVEL", "value": "experimental" }, { "name": "EC2_PROMETHEUS_URL", "value": "http://<EC2_PRIVATE_IP>:9090/api/v1/write" }, { "name": "EC2_LOKI_URL", "value": "http://<EC2_PRIVATE_IP>:3100/loki/api/v1/push" } ], "portMappings": [ { "containerPort": 12345, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/fargate-observability-lab", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "alloy", "awslogs-create-group": "true" } } } ] } { "family": "fargate-observability-lab", "networkMode": "awsvpc", "requiresCompatibilities": ["FARGATE"], "cpu": "512", "memory": "1024", "executionRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole", "taskRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsAppTaskRole", "containerDefinitions": [ { "name": "demo-app", "image": "ghcr.io/brancz/prometheus-example-app:v0.5.0", "essential": true, "portMappings": [ { "containerPort": 8080, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/fargate-observability-lab", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "demo-app", "awslogs-create-group": "true" } } }, { "name": "alloy-sidecar", "image": "grafana/alloy:latest", "essential": false, "command": [ "run", "--server.http.listen-addr=0.0.0.0:12345", "/etc/alloy/fargate.alloy" ], "environment": [ { "name": "ALLOY_STABILITY_LEVEL", "value": "experimental" }, { "name": "EC2_PROMETHEUS_URL", "value": "http://<EC2_PRIVATE_IP>:9090/api/v1/write" }, { "name": "EC2_LOKI_URL", "value": "http://<EC2_PRIVATE_IP>:3100/loki/api/v1/push" } ], "portMappings": [ { "containerPort": 12345, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/fargate-observability-lab", "awslogs-region": "us-east-2", "awslogs-stream-prefix": "alloy", "awslogs-create-group": "true" } } } ] } <ACCOUNT_ID> <EC2_PRIVATE_IP> us-east-2 if your region is different <ACCOUNT_ID> <EC2_PRIVATE_IP> us-east-2 if your region is different <ACCOUNT_ID> <EC2_PRIVATE_IP> us-east-2 if your region is different EFS S3 pulled at startup custom Alloy image EFS S3 pulled at startup custom Alloy image EFS S3 pulled at startup custom Alloy image fargate.alloy fargate.alloy fargate.alloy prometheus.scrape "app_metrics" { targets = [ {"__address__" = "127.0.0.1:8080", "job" = "demo-app"} ] forward_to = [prometheus.remote_write.ec2_prometheus.receiver] } otelcol.receiver.awsecscontainermetrics "fargate_metrics" { collection_interval = "30s" output { metrics = [otelcol.exporter.prometheus.fargate_to_prom.receiver] } } otelcol.exporter.prometheus "fargate_to_prom" { forward_to = [prometheus.remote_write.ec2_prometheus.receiver] } prometheus.remote_write "ec2_prometheus" { endpoint { url = env("EC2_PROMETHEUS_URL") } } prometheus.scrape "app_metrics" { targets = [ {"__address__" = "127.0.0.1:8080", "job" = "demo-app"} ] forward_to = [prometheus.remote_write.ec2_prometheus.receiver] } otelcol.receiver.awsecscontainermetrics "fargate_metrics" { collection_interval = "30s" output { metrics = [otelcol.exporter.prometheus.fargate_to_prom.receiver] } } otelcol.exporter.prometheus "fargate_to_prom" { forward_to = [prometheus.remote_write.ec2_prometheus.receiver] } prometheus.remote_write "ec2_prometheus" { endpoint { url = env("EC2_PROMETHEUS_URL") } } prometheus.scrape "app_metrics" { targets = [ {"__address__" = "127.0.0.1:8080", "job" = "demo-app"} ] forward_to = [prometheus.remote_write.ec2_prometheus.receiver] } otelcol.receiver.awsecscontainermetrics "fargate_metrics" { collection_interval = "30s" output { metrics = [otelcol.exporter.prometheus.fargate_to_prom.receiver] } } otelcol.exporter.prometheus "fargate_to_prom" { forward_to = [prometheus.remote_write.ec2_prometheus.receiver] } prometheus.remote_write "ec2_prometheus" { endpoint { url = env("EC2_PROMETHEUS_URL") } } Application /metrics Fargate task CPU Fargate task memory Container-level metrics Application /metrics Fargate task CPU Fargate task memory Container-level metrics Application /metrics Fargate task CPU Fargate task memory Container-level metrics ECS → Clusters → prod-observability-cluster → Services → Create ECS → Clusters → prod-observability-cluster → Services → Create ECS → Clusters → prod-observability-cluster → Services → Create Launch type: Fargate Task definition: fargate-observability-lab Service name: demo-app-service Desired tasks: 1 Launch type: Fargate Task definition: fargate-observability-lab Service name: demo-app-service Desired tasks: 1 Launch type: Fargate Task definition: fargate-observability-lab Service name: demo-app-service Desired tasks: 1 VPC: same VPC as EC2 monitoring server Subnets: private subnets preferred Security group: allow outbound to EC2 private IP ports 9090 and 3100 Public IP: disabled if private subnet has NAT VPC: same VPC as EC2 monitoring server Subnets: private subnets preferred Security group: allow outbound to EC2 private IP ports 9090 and 3100 Public IP: disabled if private subnet has NAT VPC: same VPC as EC2 monitoring server Subnets: private subnets preferred Security group: allow outbound to EC2 private IP ports 9090 and 3100 Public IP: disabled if private subnet has NAT ECS → Cluster → Service → Tasks ECS → Cluster → Service → Tasks ECS → Cluster → Service → Tasks Task status: Running Containers: demo-app running, alloy-sidecar running Task status: Running Containers: demo-app running, alloy-sidecar running Task status: Running Containers: demo-app running, alloy-sidecar running http://<EC2_PUBLIC_IP>:9090 http://<EC2_PUBLIC_IP>:9090 http://<EC2_PUBLIC_IP>:9090 Status → TSDB Status Status → TSDB Status Status → TSDB Status alloy_component_controller_running_components alloy_component_controller_running_components alloy_component_controller_running_components rate(node_cpu_seconds_total[5m]) rate(node_cpu_seconds_total[5m]) rate(node_cpu_seconds_total[5m]) node_memory_MemAvailable_bytes node_memory_MemAvailable_bytes node_memory_MemAvailable_bytes http_requests_total http_requests_total http_requests_total ecs_task_memory_utilized ecs_task_memory_utilized ecs_task_memory_utilized container_memory_usage_bytes container_memory_usage_bytes container_memory_usage_bytes http://<EC2_PUBLIC_IP>:3000 http://<EC2_PUBLIC_IP>:3000 http://<EC2_PUBLIC_IP>:3000 Connections → Data sources Connections → Data sources Connections → Data sources URL: http://localhost:9090 URL: http://localhost:9090 URL: http://localhost:9090 URL: http://localhost:3100 URL: http://localhost:3100 URL: http://localhost:3100 Save & test Save & test Save & test Data source is working Data source is working Data source is working Grafana → Explore → Prometheus Grafana → Explore → Prometheus Grafana → Explore → Prometheus rate(node_network_receive_bytes_total[1m]) rate(node_network_receive_bytes_total[1m]) rate(node_network_receive_bytes_total[1m]) 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) rate(http_requests_total[5m]) rate(http_requests_total[5m]) rate(http_requests_total[5m]) Grafana → Explore → Loki Grafana → Explore → Loki Grafana → Explore → Loki {job="syslog"} {job="syslog"} {job="syslog"} {job="auth"} {job="auth"} {job="auth"} {job="nginx_access"} {job="nginx_access"} {job="nginx_access"} CloudWatch → Log groups → /ecs/fargate-observability-lab CloudWatch → Log groups → /ecs/fargate-observability-lab CloudWatch → Log groups → /ecs/fargate-observability-lab 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) Memory > 85% Memory > 85% Memory > 85% If monitoring server dies, you lose visibility. If monitoring server dies, you lose visibility. If monitoring server dies, you lose visibility. 100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"}) 100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"}) 100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"}) Prometheus and Loki can fill disk quickly. Prometheus and Loki can fill disk quickly. Prometheus and Loki can fill disk quickly. ecs_task_memory_utilized / ecs_task_memory_reserved * 100 ecs_task_memory_utilized / ecs_task_memory_reserved * 100 ecs_task_memory_utilized / ecs_task_memory_reserved * 100 > 85% for 3 minutes > 85% for 3 minutes > 85% for 3 minutes Fargate kills containers when memory limit is reached. Fargate kills containers when memory limit is reached. Fargate kills containers when memory limit is reached. sum(rate(http_requests_total[5m])) sum(rate(http_requests_total[5m])) sum(rate(http_requests_total[5m])) If traffic drops to zero, app or routing may be broken. If traffic drops to zero, app or routing may be broken. If traffic drops to zero, app or routing may be broken. sum(rate(http_requests_total{code=~"5.."}[5m])) sum(rate(http_requests_total{code=~"5.."}[5m])) sum(rate(http_requests_total{code=~"5.."}[5m])) 5xx errors show application or dependency failure. 5xx errors show application or dependency failure. 5xx errors show application or dependency failure. 1. IAM roles are correct 2. ECS task is running 3. Security groups allow only needed ports 4. Fargate can reach EC2 private IP 5. Prometheus remote write is enabled 6. Loki is receiving logs 7. Grafana data sources work 8. No public access to Prometheus/Loki/Node Exporter 9. ECS service has desired count = running count 10. CloudWatch logs exist for both containers 1. IAM roles are correct 2. ECS task is running 3. Security groups allow only needed ports 4. Fargate can reach EC2 private IP 5. Prometheus remote write is enabled 6. Loki is receiving logs 7. Grafana data sources work 8. No public access to Prometheus/Loki/Node Exporter 9. ECS service has desired count = running count 10. CloudWatch logs exist for both containers 1. IAM roles are correct 2. ECS task is running 3. Security groups allow only needed ports 4. Fargate can reach EC2 private IP 5. Prometheus remote write is enabled 6. Loki is receiving logs 7. Grafana data sources work 8. No public access to Prometheus/Loki/Node Exporter 9. ECS service has desired count = running count 10. CloudWatch logs exist for both containers ECS → Task → alloy-sidecar → Logs ECS → Task → alloy-sidecar → Logs ECS → Task → alloy-sidecar → Logs connection refused timeout remote write failed connection refused timeout remote write failed connection refused timeout remote write failed EC2 security group blocks port 9090 Wrong EC2 private IP Prometheus remote write receiver not enabled Alloy config error EC2 security group blocks port 9090 Wrong EC2 private IP Prometheus remote write receiver not enabled Alloy config error EC2 security group blocks port 9090 Wrong EC2 private IP Prometheus remote write receiver not enabled Alloy config error curl http://localhost:3100/ready sudo journalctl -u alloy -f sudo journalctl -u loki -f curl http://localhost:3100/ready sudo journalctl -u alloy -f sudo journalctl -u loki -f curl http://localhost:3100/ready sudo journalctl -u alloy -f sudo journalctl -u loki -f Loki not running Wrong Loki URL Alloy cannot read log files No permissions on /var/log/* Loki not running Wrong Loki URL Alloy cannot read log files No permissions on /var/log/* Loki not running Wrong Loki URL Alloy cannot read log files No permissions on /var/log/* Node Exporter monitors EC2 only. It cannot monitor Fargate hosts. Node Exporter monitors EC2 only. It cannot monitor Fargate hosts. Node Exporter monitors EC2 only. It cannot monitor Fargate hosts. Use Alloy sidecar with ECS container metrics receiver. Use Alloy sidecar with ECS container metrics receiver. Use Alloy sidecar with ECS container metrics receiver. ECS Fargate runs application containers. IAM secures container permissions. Alloy collects telemetry. Prometheus stores metrics. Loki stores logs. Grafana visualizes everything. Node Exporter monitors the EC2 monitoring server. ECS Fargate runs application containers. IAM secures container permissions. Alloy collects telemetry. Prometheus stores metrics. Loki stores logs. Grafana visualizes everything. Node Exporter monitors the EC2 monitoring server. ECS Fargate runs application containers. IAM secures container permissions. Alloy collects telemetry. Prometheus stores metrics. Loki stores logs. Grafana visualizes everything. Node Exporter monitors the EC2 monitoring server. Metrics tell you what is happening. Logs tell you why it happened. Grafana helps you see the story. IAM and security groups control who can access what. Metrics tell you what is happening. Logs tell you why it happened. Grafana helps you see the story. IAM and security groups control who can access what. Metrics tell you what is happening. Logs tell you why it happened. Grafana helps you see the story. IAM and security groups control who can access what.

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolsproductionfargateprometheusgrafanaalloyexporter

More from Tools

Tools: Essential Guide: SSH Login Taking Forever? Check Your DNS Settings

2026-05-25 0

Tools: Complete Guide to 로컬 LLM 셋업 가이드 (v23)

2026-05-25 0

Tools: Breaking: Why DevOps Engineers Need Practical Tutorials, Not Just Theory

2026-05-25 0

Tools: Vivado 2026.1 and Linux: why this decision matters beyond the headline - 2025 Update

2026-05-25 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: Production Lab: ECS Fargate + Prometheus + Grafana + Loki + Alloy + Node Exporter (2026)

Part 1: What Each Tool Does

Part 2: EC2 Monitoring Server Check

Step 1: Check all services

Why we check this

SRE/DevOps checks

Part 3: Fix Prometheus for Remote Write

Step 2: Enable Prometheus remote write receiver

Why we do this

Part 4: EC2 Security Group

Why we do this

Part 5: Configure Alloy on EC2

Important correction

Part 6: Create ECS IAM Roles

Role 1: ECS Task Execution Role

Role 2: ECS Task Role

Part 7: Create ECS Cluster

Part 8: Create Simple Application Container

Part 9: Create Fargate Task Definition

Important note

Part 10: Alloy Fargate Config

Part 11: Run ECS Service

What to check

Part 12: Verify in Prometheus

Part 13: Verify in Grafana

Part 14: Grafana Explore Queries

Part 15: What SRE Must Monitor

1. EC2 monitoring server health

2. Disk usage

3. Fargate task memory

4. Application request rate

5. Error rate

Part 16: What DevOps Must Check

Part 17: Troubleshooting

Problem: ECS task running but no metrics

Problem: Grafana shows no Loki logs

Problem: Node Exporter works but Fargate metrics missing

🏷️ Tags

More from Tools

Tools: Essential Guide: SSH Login Taking Forever? Check Your DNS Settings

Tools: Complete Guide to 로컬 LLM 셋업 가이드 (v23)

Tools: Breaking: Why DevOps Engineers Need Practical Tutorials, Not Just Theory

Tools: Vivado 2026.1 and Linux: why this decision matters beyond the headline - 2025 Update

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting