Tools: Build Production-Ready GCP Infrastructure from Scratch Part 04

Tools: Build Production-Ready GCP Infrastructure from Scratch Part 04

Source: Dev.to

Build Production-Ready GCP Infrastructure from Scratch: A Complete Console Guide ## Table of Contents ## Part 4: Observability & Load Balancer ## Overview ## Prerequisites ## Step 1: Create Static Internal IPs for Observability ## What are Static Internal IPs? ## Prometheus Static IP ## Navigation Path ## IP Configuration ## Loki Static IP ## Verify IPs ## Step 2: Create Prometheus VM ## What is Prometheus? ## Navigation Path ## VM Configuration ## Basic Settings ## Machine Type ## Boot Disk ## Network Interface ## Service Account ## Metadata - Startup Script ## Create the VM ## Verify Prometheus ## Step 3: Create Loki VM (with Grafana) ## What is Loki and Grafana? ## Navigation Path ## VM Configuration ## Basic Settings ## Machine Type ## Boot Disk ## Network Interface ## Service Account ## Metadata - Startup Script ## Create the VM ## Verify Loki VM ## Step 4: Access Grafana Dashboard ## Get Loki VM External IP ## Access Grafana ## Grafana Login ## Verify Datasources ## Step 5: Update Prometheus Targets ## What are Prometheus Targets? ## Update Prometheus Configuration ## Verify Targets ## Step 6: Create External Application Load Balancer ## What is an ALB? ## Step 6a: Create Health Check ## Navigation Path ## Health Check Configuration ## Step 6b: Add Named Port to MIG ## What is a Named Port? ## Navigation Path ## Edit MIG ## Step 6c: Create Load Balancer ## Navigation Path ## Select LB Type ## Basic Configuration ## Backend Configuration ## Frontend Configuration ## Routing Rules ## Create the Load Balancer ## Verify Load Balancer ## Step 7: Test End-to-End Connectivity ## Test 1: Load Balancer Health Check ## Test 2: Full Request Flow ## Test 3: Prometheus Metrics ## Test 4: Loki Logs ## Part 4 Verification Checklist ## Final Verification ## Cost Summary - Part 4 ## Final Infrastructure Cost ## Comprehensive Troubleshooting ## Issue: Grafana Cannot Connect to Datasources ## Issue: Load Balancer Shows 0/0 Healthy ## Issue: High CPU on Backend VMs ## Issue: Secrets Not Accessible from VMs ## Next Steps Beyond This Guide ## 1. Domain & SSL ## 2. CI/CD Pipeline ## 3. Monitoring Enhancements ## 4. Security Hardening ## 5. Multi-Environment ## End-to-End Verification Test ## Test 1: Full Request Flow ## Test 2: Database Connectivity ## Test 3: Cache Connectivity ## Test 4: Monitoring Pipeline ## Test 5: Disaster Recovery ## Congratulations! ## References A 4-Part Series for Complete Beginners In this final part, you'll complete your infrastructure with observability and external access. We'll create Prometheus for metrics, Loki for logs, Grafana for dashboards, and an Application Load Balancer for external traffic. Estimated time: 45-60 minutes Estimated cost: ~$64/month Final cumulative cost: ~$301/month Before continuing, ensure you've completed Parts 1-3: If you missed Parts 1-3: Start with Part 1: Foundation → Static IPs ensure observability VMs have predictable IP addresses. This makes configuration easier (no need to update configs if VMs are recreated). You should see 2 reserved IPs: Prometheus is a metrics collection and storage system: Why self-hosted: Full control, no vendor lock-in, predictable costs. Why 50GB: 7 days of metrics at 15s interval requires ~30-40GB. 50GB provides headroom. VM Creation Time: 2-3 minutes Why combined: Cost optimization. Single VM runs both services (~$23/month). Why public IP: Grafana needs to be accessible from your browser. In production, use IAP instead of public IP. VM Creation Time: 2-3 minutes Copy the external IP - we'll need it for Grafana access. Open your browser and navigate to: Troubleshooting: If datasources are unhealthy: Prometheus "scrapes" metrics from targets. We need to add our backend and cache VMs. Application Load Balancer (ALB): Named ports link a name (like "http") to a port number (3000). The load balancer uses named ports for routing. Scroll to "Named ports" section: Click "Application Load Balancer (HTTP/S)" → Click "Configure". Backend type: Instance group Backend configuration: Protocol: HTTPS (we'll add HTTP redirect) Click "Add frontend IP and port": Note: For testing, you can use HTTP only (skip certificate). For production, add your domain and create Google-managed certificate. HTTP Frontend (Redirect): Click "Add frontend IP and port": Host rules: Leave default (all hosts) Path matcher: Leave default (all paths) Backend service: Select the backend created above Review configuration and click "Create load balancer". Creation Time: 5-10 minutes Expected: HTTP 200 with response like {"status":"ok"} Expected: Response from backend application Cost Optimization Tips: Symptom: Datasource shows "Could not connect" Symptom: No healthy backend instances Symptom: VMs constantly at 90%+ CPU Symptom: "Permission denied" accessing secrets You've built a complete, production-ready GCP infrastructure: ✅ Network: VPC with 5 subnets, Cloud NAT, firewall rules ✅ Security: Secret Manager, bastion with IAP, OS Login ✅ Data: Cloud SQL PostgreSQL with regional HA ✅ Compute: Managed Instance Group with autoscaling ✅ Cache: Redis + PgBouncer ✅ Observability: Prometheus, Loki, Grafana ✅ Access: External Application Load Balancer Your infrastructure is ready for production deployment! Series Complete! You now have a fully functional, production-ready GCP infrastructure. From here, you can deploy your NestJS application and scale as needed. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: #!/bin/bash # Prometheus Startup Script set -e echo "=== Prometheus Startup Script Begin $(date) ===" # Install Docker curl -fsSL https://get.docker.com -o get-docker.sh sh get-docker.sh # Install Node Exporter for self-monitoring NODE_EXPORTER_VERSION="1.6.1" echo "Installing Node Exporter ${NODE_EXPORTER_VERSION}..." useradd --no-create-home --shell /bin/false node_exporter || true wget -q "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz" -O /tmp/node_exporter.tar.gz tar xzf /tmp/node_exporter.tar.gz -C /tmp cp /tmp/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/ chown node_exporter:node_exporter /usr/local/bin/node_exporter chmod +x /usr/local/bin/node_exporter cat > /etc/systemd/system/node_exporter.service <<'EOF' [Unit] Description=Prometheus Node Exporter After=network.target [Service] Type=simple User=node_exporter ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100 Restart=always RestartSec=5 [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable node_exporter systemctl start node_exporter # Create Prometheus configuration cat > /opt/prometheus.yml <<'EOF' global: scrape_interval: 15s retention: 7d scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['localhost:9100'] # Backend VMs (update IPs after MIG creation) - job_name: 'backend' static_configs: - targets: ['10.0.2.2:9100', '10.0.2.3:9100'] labels: tier: backend # Cache VM - job_name: 'cache' static_configs: - targets: ['10.0.4.2:9100'] labels: tier: cache EOF # Run Prometheus docker run -d \ --name prometheus \ --restart unless-stopped \ -p 9090:9090 \ -v /opt/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus:latest echo "=== Prometheus Startup Script Complete $(date) ===" echo "Prometheus running on port 9090" echo "Node Exporter running on port 9100" Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: #!/bin/bash # Prometheus Startup Script set -e echo "=== Prometheus Startup Script Begin $(date) ===" # Install Docker curl -fsSL https://get.docker.com -o get-docker.sh sh get-docker.sh # Install Node Exporter for self-monitoring NODE_EXPORTER_VERSION="1.6.1" echo "Installing Node Exporter ${NODE_EXPORTER_VERSION}..." useradd --no-create-home --shell /bin/false node_exporter || true wget -q "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz" -O /tmp/node_exporter.tar.gz tar xzf /tmp/node_exporter.tar.gz -C /tmp cp /tmp/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/ chown node_exporter:node_exporter /usr/local/bin/node_exporter chmod +x /usr/local/bin/node_exporter cat > /etc/systemd/system/node_exporter.service <<'EOF' [Unit] Description=Prometheus Node Exporter After=network.target [Service] Type=simple User=node_exporter ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100 Restart=always RestartSec=5 [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable node_exporter systemctl start node_exporter # Create Prometheus configuration cat > /opt/prometheus.yml <<'EOF' global: scrape_interval: 15s retention: 7d scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['localhost:9100'] # Backend VMs (update IPs after MIG creation) - job_name: 'backend' static_configs: - targets: ['10.0.2.2:9100', '10.0.2.3:9100'] labels: tier: backend # Cache VM - job_name: 'cache' static_configs: - targets: ['10.0.4.2:9100'] labels: tier: cache EOF # Run Prometheus docker run -d \ --name prometheus \ --restart unless-stopped \ -p 9090:9090 \ -v /opt/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus:latest echo "=== Prometheus Startup Script Complete $(date) ===" echo "Prometheus running on port 9090" echo "Node Exporter running on port 9100" COMMAND_BLOCK: #!/bin/bash # Prometheus Startup Script set -e echo "=== Prometheus Startup Script Begin $(date) ===" # Install Docker curl -fsSL https://get.docker.com -o get-docker.sh sh get-docker.sh # Install Node Exporter for self-monitoring NODE_EXPORTER_VERSION="1.6.1" echo "Installing Node Exporter ${NODE_EXPORTER_VERSION}..." useradd --no-create-home --shell /bin/false node_exporter || true wget -q "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz" -O /tmp/node_exporter.tar.gz tar xzf /tmp/node_exporter.tar.gz -C /tmp cp /tmp/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/ chown node_exporter:node_exporter /usr/local/bin/node_exporter chmod +x /usr/local/bin/node_exporter cat > /etc/systemd/system/node_exporter.service <<'EOF' [Unit] Description=Prometheus Node Exporter After=network.target [Service] Type=simple User=node_exporter ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100 Restart=always RestartSec=5 [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable node_exporter systemctl start node_exporter # Create Prometheus configuration cat > /opt/prometheus.yml <<'EOF' global: scrape_interval: 15s retention: 7d scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['localhost:9100'] # Backend VMs (update IPs after MIG creation) - job_name: 'backend' static_configs: - targets: ['10.0.2.2:9100', '10.0.2.3:9100'] labels: tier: backend # Cache VM - job_name: 'cache' static_configs: - targets: ['10.0.4.2:9100'] labels: tier: cache EOF # Run Prometheus docker run -d \ --name prometheus \ --restart unless-stopped \ -p 9090:9090 \ -v /opt/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus:latest echo "=== Prometheus Startup Script Complete $(date) ===" echo "Prometheus running on port 9090" echo "Node Exporter running on port 9100" COMMAND_BLOCK: #!/bin/bash # Loki and Grafana Startup Script set -e echo "=== Loki/Grafana Startup Script Begin $(date) ===" # Install Docker curl -fsSL https://get.docker.com -o get-docker.sh sh get-docker.sh # Install Docker Compose apt-get update apt-get install -y docker-compose # Create docker-compose.yml cat > /opt/docker-compose.yml <<'EOF' version: '3.8' services: loki: image: grafana/loki:latest ports: - "3100:3100" volumes: - /opt/loki-config.yml:/etc/loki/local-config.yaml - loki-data:/loki command: -config.file=/etc/loki/local-config.yaml restart: unless-stopped grafana: image: grafana/grafana:latest ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin123 - GF_USERS_ALLOW_SIGN_UP=false - GF_SERVER_ROOT_URL=http://localhost:3000 volumes: - grafana-storage:/var/lib/grafana - /opt/grafana-provisioning:/etc/grafana/provisioning restart: unless-stopped volumes: loki-data: grafana-storage: EOF # Create Loki config cat > /opt/loki-config.yml <<'EOF' auth_enabled: false server: http_listen_port: 3100 ingester: lifecycler: address: 127.0.0.1 ring: kvstore: store: inmemory replication_factor: 1 final_sleep: 0s chunk_idle_period: 1h max_chunk_age: 1h chunk_target_size: 1048576 chunk_retain_period: 30s limits_config: enforce_metric_name: false reject_old_samples: true reject_old_samples_max_age: 168h schema_config: configs: - from: 2020-10-24 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h storage_config: boltdb_shipper: active_index_directory: /loki/index cache_location: /loki/cache shared_store: filesystem filesystem: directory: /loki/chunks chunk_store_config: max_look_back_period: 168h table_manager: retention_deletes_enabled: false retention_period: 0s EOF # Create Grafana provisioning mkdir -p /opt/grafana-provisioning/datasources cat > /opt/grafana-provisioning/datasources/loki.yml <<'EOF' apiVersion: 1 datasources: - name: Loki type: loki access: proxy url: http://loki:3100 isDefault: false editable: false EOF cat > /opt/grafana-provisioning/datasources/prometheus.yml <<'EOF' apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://10.0.5.10:9090 isDefault: true editable: false EOF # Start services cd /opt docker-compose up -d echo "=== Loki/Grafana Startup Script Complete $(date) ===" echo "Loki running on port 3100" echo "Grafana running on port 3000 (http://EXTERNAL_IP:3000)" echo "Login: admin / admin123" Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: #!/bin/bash # Loki and Grafana Startup Script set -e echo "=== Loki/Grafana Startup Script Begin $(date) ===" # Install Docker curl -fsSL https://get.docker.com -o get-docker.sh sh get-docker.sh # Install Docker Compose apt-get update apt-get install -y docker-compose # Create docker-compose.yml cat > /opt/docker-compose.yml <<'EOF' version: '3.8' services: loki: image: grafana/loki:latest ports: - "3100:3100" volumes: - /opt/loki-config.yml:/etc/loki/local-config.yaml - loki-data:/loki command: -config.file=/etc/loki/local-config.yaml restart: unless-stopped grafana: image: grafana/grafana:latest ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin123 - GF_USERS_ALLOW_SIGN_UP=false - GF_SERVER_ROOT_URL=http://localhost:3000 volumes: - grafana-storage:/var/lib/grafana - /opt/grafana-provisioning:/etc/grafana/provisioning restart: unless-stopped volumes: loki-data: grafana-storage: EOF # Create Loki config cat > /opt/loki-config.yml <<'EOF' auth_enabled: false server: http_listen_port: 3100 ingester: lifecycler: address: 127.0.0.1 ring: kvstore: store: inmemory replication_factor: 1 final_sleep: 0s chunk_idle_period: 1h max_chunk_age: 1h chunk_target_size: 1048576 chunk_retain_period: 30s limits_config: enforce_metric_name: false reject_old_samples: true reject_old_samples_max_age: 168h schema_config: configs: - from: 2020-10-24 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h storage_config: boltdb_shipper: active_index_directory: /loki/index cache_location: /loki/cache shared_store: filesystem filesystem: directory: /loki/chunks chunk_store_config: max_look_back_period: 168h table_manager: retention_deletes_enabled: false retention_period: 0s EOF # Create Grafana provisioning mkdir -p /opt/grafana-provisioning/datasources cat > /opt/grafana-provisioning/datasources/loki.yml <<'EOF' apiVersion: 1 datasources: - name: Loki type: loki access: proxy url: http://loki:3100 isDefault: false editable: false EOF cat > /opt/grafana-provisioning/datasources/prometheus.yml <<'EOF' apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://10.0.5.10:9090 isDefault: true editable: false EOF # Start services cd /opt docker-compose up -d echo "=== Loki/Grafana Startup Script Complete $(date) ===" echo "Loki running on port 3100" echo "Grafana running on port 3000 (http://EXTERNAL_IP:3000)" echo "Login: admin / admin123" COMMAND_BLOCK: #!/bin/bash # Loki and Grafana Startup Script set -e echo "=== Loki/Grafana Startup Script Begin $(date) ===" # Install Docker curl -fsSL https://get.docker.com -o get-docker.sh sh get-docker.sh # Install Docker Compose apt-get update apt-get install -y docker-compose # Create docker-compose.yml cat > /opt/docker-compose.yml <<'EOF' version: '3.8' services: loki: image: grafana/loki:latest ports: - "3100:3100" volumes: - /opt/loki-config.yml:/etc/loki/local-config.yaml - loki-data:/loki command: -config.file=/etc/loki/local-config.yaml restart: unless-stopped grafana: image: grafana/grafana:latest ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin123 - GF_USERS_ALLOW_SIGN_UP=false - GF_SERVER_ROOT_URL=http://localhost:3000 volumes: - grafana-storage:/var/lib/grafana - /opt/grafana-provisioning:/etc/grafana/provisioning restart: unless-stopped volumes: loki-data: grafana-storage: EOF # Create Loki config cat > /opt/loki-config.yml <<'EOF' auth_enabled: false server: http_listen_port: 3100 ingester: lifecycler: address: 127.0.0.1 ring: kvstore: store: inmemory replication_factor: 1 final_sleep: 0s chunk_idle_period: 1h max_chunk_age: 1h chunk_target_size: 1048576 chunk_retain_period: 30s limits_config: enforce_metric_name: false reject_old_samples: true reject_old_samples_max_age: 168h schema_config: configs: - from: 2020-10-24 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h storage_config: boltdb_shipper: active_index_directory: /loki/index cache_location: /loki/cache shared_store: filesystem filesystem: directory: /loki/chunks chunk_store_config: max_look_back_period: 168h table_manager: retention_deletes_enabled: false retention_period: 0s EOF # Create Grafana provisioning mkdir -p /opt/grafana-provisioning/datasources cat > /opt/grafana-provisioning/datasources/loki.yml <<'EOF' apiVersion: 1 datasources: - name: Loki type: loki access: proxy url: http://loki:3100 isDefault: false editable: false EOF cat > /opt/grafana-provisioning/datasources/prometheus.yml <<'EOF' apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://10.0.5.10:9090 isDefault: true editable: false EOF # Start services cd /opt docker-compose up -d echo "=== Loki/Grafana Startup Script Complete $(date) ===" echo "Loki running on port 3100" echo "Grafana running on port 3000 (http://EXTERNAL_IP:3000)" echo "Login: admin / admin123" CODE_BLOCK: http://[LOKI_EXTERNAL_IP]:3000 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: http://[LOKI_EXTERNAL_IP]:3000 CODE_BLOCK: http://[LOKI_EXTERNAL_IP]:3000 COMMAND_BLOCK: # From your local machine gcloud compute ssh dev-bastion --tunnel-through-iap # From bastion, SSH to Prometheus ssh 10.0.5.10 # Edit Prometheus config sudo nano /opt/prometheus.yml Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # From your local machine gcloud compute ssh dev-bastion --tunnel-through-iap # From bastion, SSH to Prometheus ssh 10.0.5.10 # Edit Prometheus config sudo nano /opt/prometheus.yml COMMAND_BLOCK: # From your local machine gcloud compute ssh dev-bastion --tunnel-through-iap # From bastion, SSH to Prometheus ssh 10.0.5.10 # Edit Prometheus config sudo nano /opt/prometheus.yml COMMAND_BLOCK: scrape_configs: - job_name: 'backend' static_configs: - targets: ['10.0.2.2:9100', '10.0.2.3:9100'] # Update with actual IPs labels: tier: backend - job_name: 'cache' static_configs: - targets: ['10.0.4.2:9100'] # Update with actual IP labels: tier: cache Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: scrape_configs: - job_name: 'backend' static_configs: - targets: ['10.0.2.2:9100', '10.0.2.3:9100'] # Update with actual IPs labels: tier: backend - job_name: 'cache' static_configs: - targets: ['10.0.4.2:9100'] # Update with actual IP labels: tier: cache COMMAND_BLOCK: scrape_configs: - job_name: 'backend' static_configs: - targets: ['10.0.2.2:9100', '10.0.2.3:9100'] # Update with actual IPs labels: tier: backend - job_name: 'cache' static_configs: - targets: ['10.0.4.2:9100'] # Update with actual IP labels: tier: cache COMMAND_BLOCK: docker restart prometheus Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: docker restart prometheus COMMAND_BLOCK: docker restart prometheus COMMAND_BLOCK: # Get LB IP gcloud compute forwarding-rules list --filter="name=dev-lb*" # Test health endpoint curl http://[LB_IP]/api/health Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Get LB IP gcloud compute forwarding-rules list --filter="name=dev-lb*" # Test health endpoint curl http://[LB_IP]/api/health COMMAND_BLOCK: # Get LB IP gcloud compute forwarding-rules list --filter="name=dev-lb*" # Test health endpoint curl http://[LB_IP]/api/health COMMAND_BLOCK: # Test through load balancer curl http://[LB_IP]/api/test Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Test through load balancer curl http://[LB_IP]/api/test COMMAND_BLOCK: # Test through load balancer curl http://[LB_IP]/api/test COMMAND_BLOCK: # From Loki VM docker ps # Check containers running # Test Prometheus curl http://10.0.5.10:9090/-/healthy # Test Loki curl http://localhost:3100/ready Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # From Loki VM docker ps # Check containers running # Test Prometheus curl http://10.0.5.10:9090/-/healthy # Test Loki curl http://localhost:3100/ready COMMAND_BLOCK: # From Loki VM docker ps # Check containers running # Test Prometheus curl http://10.0.5.10:9090/-/healthy # Test Loki curl http://localhost:3100/ready COMMAND_BLOCK: # From backend VM curl localhost:3000/health netstat -tlnp | grep 3000 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # From backend VM curl localhost:3000/health netstat -tlnp | grep 3000 COMMAND_BLOCK: # From backend VM curl localhost:3000/health netstat -tlnp | grep 3000 COMMAND_BLOCK: # Check CPU usage top -bn1 | head -20 # Check Node.js process pm2 monit # Check autoscaler status gcloud compute instance-groups managed describe dev-backend-mig \ --region europe-west1 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Check CPU usage top -bn1 | head -20 # Check Node.js process pm2 monit # Check autoscaler status gcloud compute instance-groups managed describe dev-backend-mig \ --region europe-west1 COMMAND_BLOCK: # Check CPU usage top -bn1 | head -20 # Check Node.js process pm2 monit # Check autoscaler status gcloud compute instance-groups managed describe dev-backend-mig \ --region europe-west1 COMMAND_BLOCK: # From VM with correct SA gcloud secrets versions list db-credentials-dev # Add IAM role if missing gcloud secrets add-iam-policy-binding db-credentials-dev \ --member='serviceAccount:backend-dev-sa@PROJECT_ID.iam.gserviceaccount.com' \ --role='roles/secretmanager.secretAccessor' Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # From VM with correct SA gcloud secrets versions list db-credentials-dev # Add IAM role if missing gcloud secrets add-iam-policy-binding db-credentials-dev \ --member='serviceAccount:backend-dev-sa@PROJECT_ID.iam.gserviceaccount.com' \ --role='roles/secretmanager.secretAccessor' COMMAND_BLOCK: # From VM with correct SA gcloud secrets versions list db-credentials-dev # Add IAM role if missing gcloud secrets add-iam-policy-binding db-credentials-dev \ --member='serviceAccount:backend-dev-sa@PROJECT_ID.iam.gserviceaccount.com' \ --role='roles/secretmanager.secretAccessor' COMMAND_BLOCK: # 1. Get Load Balancer IP gcloud compute forwarding-rules list --filter="name=dev-lb*" # 2. Send test request curl http://[LB_IP]/api/health # Expected: {"status":"ok","timestamp":"..."} Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # 1. Get Load Balancer IP gcloud compute forwarding-rules list --filter="name=dev-lb*" # 2. Send test request curl http://[LB_IP]/api/health # Expected: {"status":"ok","timestamp":"..."} COMMAND_BLOCK: # 1. Get Load Balancer IP gcloud compute forwarding-rules list --filter="name=dev-lb*" # 2. Send test request curl http://[LB_IP]/api/health # Expected: {"status":"ok","timestamp":"..."} COMMAND_BLOCK: # From backend VM (via bastion) gcloud compute ssh dev-bastion --tunnel-through-iap ssh 10.0.2.2 # Backend VM IP # Test Cloud SQL connection psql -h 10.100.0.2 -U backend-dev-sa -d appdb Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # From backend VM (via bastion) gcloud compute ssh dev-bastion --tunnel-through-iap ssh 10.0.2.2 # Backend VM IP # Test Cloud SQL connection psql -h 10.100.0.2 -U backend-dev-sa -d appdb COMMAND_BLOCK: # From backend VM (via bastion) gcloud compute ssh dev-bastion --tunnel-through-iap ssh 10.0.2.2 # Backend VM IP # Test Cloud SQL connection psql -h 10.100.0.2 -U backend-dev-sa -d appdb COMMAND_BLOCK: # From backend VM redis-cli -h 10.0.4.2 PING # Expected: PONG # Test PgBouncer psql -h 10.0.4.2 -p 6432 -U app_admin -d appdb Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # From backend VM redis-cli -h 10.0.4.2 PING # Expected: PONG # Test PgBouncer psql -h 10.0.4.2 -p 6432 -U app_admin -d appdb COMMAND_BLOCK: # From backend VM redis-cli -h 10.0.4.2 PING # Expected: PONG # Test PgBouncer psql -h 10.0.4.2 -p 6432 -U app_admin -d appdb COMMAND_BLOCK: # Simulate instance failure gcloud compute instances delete [ONE_BACKEND_INSTANCE] --quiet # Verify: # - MIG auto-heals (new instance appears) # - Load balancer continues serving # - No data loss Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Simulate instance failure gcloud compute instances delete [ONE_BACKEND_INSTANCE] --quiet # Verify: # - MIG auto-heals (new instance appears) # - Load balancer continues serving # - No data loss COMMAND_BLOCK: # Simulate instance failure gcloud compute instances delete [ONE_BACKEND_INSTANCE] --quiet # Verify: # - MIG auto-heals (new instance appears) # - Load balancer continues serving # - No data loss - Part 1: Foundation - Project Setup, VPC & Networking - Part 2: Security Services - Secrets, Bastion & IAM - Part 3: Database & Compute Resources - Part 4: Observability & Load Balancer ← You are here - Prometheus VM for metrics collection (7-day retention) - Loki VM for log aggregation with Grafana - External Application Load Balancer with SSL - End-to-end health checks and monitoring - [ ] VPC and 5 subnets exist (including private-obs subnet) - [ ] Cloud SQL with private IP is running - [ ] Backend MIG has 2+ healthy VMs - [ ] Cache VM with Redis/PgBouncer is running - [ ] Firewall rules allow health check IPs (35.191.0.0/16, 130.211.0.0/22) - Navigate to VPC networks → Internal IP addresses - Click "Reserve static internal IP address" - Scrapes metrics from Node Exporter (every VM) - Stores 7 days of metrics data - Provides query API for Grafana - Navigate to Compute Engine → VM instances - Click "Create instance" - Name: dev-prometheus - Status: Running - Internal IP: 10.0.5.10 - External IP: None - Loki: Log aggregation system (like Prometheus, but for logs) - Grafana: Visualization dashboard for metrics and logs - Navigate to Compute Engine → VM instances - Click "Create instance" - Name: dev-loki - Status: Running - Internal IP: 10.0.5.11 - External IP: (Assigned IP) - Navigate to Compute Engine → VM instances - Find dev-loki VM - Copy the External IP address - Click "Configuration" (gear icon) → Data sources - Verify Prometheus shows "Healthy" (green checkmark) - Verify Loki shows "Healthy" - Check Prometheus is running: docker ps on Loki VM - Verify network connectivity: curl http://10.0.5.10:9090/-/healthy - Check Docker logs: docker logs loki or docker logs grafana - SSH to Prometheus VM via bastion: - Update the targets with actual backend VM IPs: - Restart Prometheus: - Navigate to http://[LOKI_EXTERNAL_IP]:3000 - Go to Explore → Select Prometheus datasource - Query: up{job="backend"} - You should see metrics from backend VMs - Distributes traffic across backend VMs - Provides single public IP for external access - Handles SSL termination - Health checks for backend availability - Navigate to Compute Engine → Health checks - Click "Create health check" - Navigate to Compute Engine → Instance groups - Click on dev-backend-mig - Click "Edit group" - Navigate to Network Services → Load balancing - Click "Create load balancer" - Name: dev-lb - Status: (Checkmark) - Active - IP address: (Reserved IP) - Backends: dev-backend-mig (healthy) - Open Grafana: http://[LOKI_EXTERNAL_IP]:3000 - Go to Explore → Prometheus - Query: rate(http_requests_total[5m]) - You should see request metrics - Open Grafana: http://[LOKI_EXTERNAL_IP]:3000 - Go to Explore → Loki - Query: {job="nestjs"} - You should see application logs - [ ] Prometheus VM running at 10.0.5.10:9090 - [ ] Loki VM running with external IP assigned - [ ] Grafana accessible at http://[LOKI_EXTERNAL_IP]:3000 - [ ] Prometheus datasource shows "Healthy" - [ ] Loki datasource shows "Healthy" - [ ] Prometheus scraping backend and cache VMs - [ ] Load balancer has reserved static IP - [ ] Backend service shows healthy instances - [ ] curl http://[LB_IP]/api/health returns 200 - [ ] Full request flow works (LB → MIG → App) - Use preemptible VMs for non-critical workloads (save 80%) - Reduce Cloud SQL tier to db-g1-small for dev (save ~$70/month) - Scale MIG to 0 during off-hours (save ~$46/month) - Disable Flow Logs on non-critical subnets (save ~$5-10/month) - Check Prometheus is running - Verify network connectivity - Check firewall rules - Check health check configuration - Verify app is running on port 3000 - Check firewall for health check IPs - Increase health check unhealthy threshold to 5 - Increase initial delay for MIG autohealing to 300 - Verify firewall rule allows 35.191.0.0/16 and 130.211.0.0/22 - Check application logs - Verify autoscaling thresholds - Profile application performance - Increase machine type (e2-medium → e2-highcpu-4) - Optimize application code - Increase max replicas to 6 or 8 - Verify service account has Secret Accessor role - Check secret exists and has versions - Purchase domain from registrar - Configure DNS to point to LB IP - Update Load Balancer certificate with your domain - Enable Google-managed certificate - Setup GitHub Actions - Auto-deploy on push - Run tests in container - Automated rollback on failure - Add alert rules in Prometheus - Configure PagerDuty/Slack integration - Create custom Grafana dashboards - Set up uptime monitoring - Enable VPC Service Controls - Configure Organization Policies - Setup Security Command Center - Implement workload identity - Create staging environment - Use Shared VPC - Implement environment isolation - Setup service directory - Generate traffic: ab -n 1000 http://[LB_IP]/api/health - Open Grafana - Check dashboards for metrics - Verify Loki has logs - GCP Documentation - Prometheus Documentation - Grafana Documentation - Loki Documentation - Load Balancer Documentation