Tools: DevOps Monitoring & Alerting — Real-World Lab (Prometheus + Grafana)

Tools: DevOps Monitoring & Alerting — Real-World Lab (Prometheus + Grafana)

Source: Dev.to

1) Why DevOps sets up email notifications ## 2) What must be true before email notifications can work ## 3) Step-by-step: Configure SMTP on Grafana server (DevOps setup) ## Step 3.1 — SSH to the Grafana server ## Step 3.2 — Edit Grafana config ## Step 3.3 — Add/enable SMTP section ## DevOps notes (what matters) ## Step 3.4 — Restart Grafana to load changes ## Step 3.5 — Watch Grafana logs while testing (DevOps habit) ## 4) Step-by-step: Gmail App Password (Most common failure) ## Step 4.1 — Enable 2-Step Verification (required) ## Step 4.2 — Create App Password ## Step 4.3 — Put that App Password in grafana.ini ## DevOps tip ## 5) Step-by-step: Configure Grafana UI (Contact point + policy) ## Step 5.1 — Create Contact Point ## Step 5.2 — Test Contact Point (mandatory) ## Expected: ## If it fails: ## Step 5.3 — Configure Notification Policy (routing) ## DevOps rule ## 6) Step-by-step: Create a “real” alert and trigger it ## Step 6.1 — Create alert rule (example: High CPU) ## Step 6.2 — Trigger CPU load on target machine ## Step 6.3 — Watch alert state ## Step 6.4 — Confirm email arrives ## 7) How DevOps reads an alert email (what matters) ## A) What is the problem? ## B) Which system/server? ## C) How bad is it? ## D) Is it new or recurring? ## E) What action should I take first? ## 8) What DevOps must pay attention to (best practices) ## 1) Always alert on monitoring failures ## 2) Avoid noisy alerts ## 3) Include context in alerts ## 4) Test notifications regularly ## 5) Separate “Warning” vs “Critical” ## 9) Mini checklist ## 🧪 PromQL LAB: Why Node Exporter Is Mandatory for DevOps ## 🔁 Architecture Reminder (Before Lab) ## LAB PART 1 — What Prometheus Knows WITHOUT Node Exporter ## Step 1 — Open Prometheus UI ## Step 2 — Run this query ## Expected result: ## DevOps explanation: ## Step 3 — Try this query (WITHOUT node_exporter) ## Expected result: ## LAB PART 2 — What Node Exporter Adds ## Step 4 — Confirm node exporter is scraped ## Expected result: ## DevOps meaning: ## LAB PART 3 — CPU Metrics (Most Common Incident) ## Step 5 — Raw CPU metric ## What students see: ## DevOps explanation: ## Step 6 — CPU usage percentage (REAL DEVOPS QUERY) ## What this shows: ## DevOps interpretation: ## LAB PART 4 — Memory Metrics (Silent Killers) ## Step 7 — Total memory ## Interpretation: ## Step 8 — Available memory ## DevOps meaning: ## Step 9 — Memory usage percentage ## DevOps interpretation: ## LAB PART 5 — Disk Metrics (Most Dangerous) ## Step 10 — Disk usage % ## DevOps interpretation: ## LAB PART 6 — Network Metrics (Hidden Bottlenecks) ## Step 11 — Network receive rate ## Step 12 — Network transmit rate ## DevOps interpretation: ## LAB PART 7 — Proving Why Node Exporter Is REQUIRED ## Question to students: ## Answer: ## LAB PART 8 — Real Incident Simulation ## Step 13 — Generate CPU load ## Step 14 — Watch PromQL graph change ## DevOps observation: ## WHAT DEVOPS MUST PAY ATTENTION TO ## 1️⃣ Always monitor exporters themselves ## 2️⃣ Use time windows correctly ## 3️⃣ Avoid raw counters ## 4️⃣ Labels matter ## 🧪 LAB: Monitor KIND Kubernetes using EC2 Prometheus (Central Monitoring) ## 🧱 Final Architecture (Explain First) ## PHASE 0 — Prerequisites ## On your laptop: ## On EC2: ## PHASE 1 — Deploy Node Exporter in KIND (DaemonSet) ## Why DaemonSet ## STEP 1 — Create monitoring namespace ## STEP 2 — Node Exporter DaemonSet for KIND ## STEP 3 — Verify node exporter pods ## PHASE 2 — Expose Node Exporter to EC2 Prometheus ## Key Concept (VERY IMPORTANT) ## STEP 4 — Create NodePort Service ## STEP 5 — Verify NodePort ## STEP 6 — Test metrics locally (sanity check) ## PHASE 3 — Configure EC2 Prometheus to Scrape KIND ## STEP 7 — Edit Prometheus config on EC2 ## STEP 8 — Reload Prometheus ## STEP 9 — Verify targets in Prometheus UI ## PHASE 4 — PromQL Labs (KIND Nodes) ## LAB 1 — Is KIND node visible? ## LAB 2 — CPU usage of KIND node ## LAB 3 — Memory usage ## LAB 4 — Disk usage (CRITICAL) ## PHASE 5 — Create Alerts ## Node exporter down (MANDATORY) ## High CPU ## Disk almost full ## PHASE 6 — Incident Simulation ## Scenario ## Step 1 — Kubernetes view ## Step 2 — Node metrics (Prometheus) ## Step 3 — DevOps action Dashboards are passive. Alerts + email are active. You need email notifications when: Email notification depends on 4 layers: In real life, most failures happen at layer 4. This is done on the machine running Grafana (your “monitor” instance). For Gmail SMTP (lab-friendly): If Grafana fails to start, your config has a syntax problem. You keep this open when testing notifications. Your error: 535 5.7.8 Username and Password not accepted (BadCredentials) That means you used a normal password or Gmail blocked the sign-in. Google Account → Security → 2-Step Verification ON Google Account → Security → App passwords → create one for “Mail” Copy the 16-character app password. Paste it without spaces. Restart Grafana again. SMTP is server-side. UI decides WHO gets notified. Grafana → Alerting → Contact points → Create contact point Grafana → Alerting → Notification policies Ensure there is a policy that routes alerts to your contact point. Options: Create a policy that matches labels like: No policy route → no notification, even if contact point exists. Use Prometheus datasource and query: Labels (important for routing): Grafana → Alerting → Active alerts: When an alert email comes, DevOps must answer: This tells urgency and type of incident. In your lab, the most important is: DevOps initial actions should be fast: Because if node exporter dies, you become blind. Use labels/annotations: DevOps must test after: ✅ SMTP configured in /etc/grafana/grafana.ini ✅ Gmail App Password (not normal password) ✅ Grafana restarted ✅ Contact point created + Test succeeded ✅ Notification policy routes alerts to contact point ✅ Alert rule has correct query + labels ✅ Trigger event causes Firing + email received You will see something like: 👉 Important DevOps truth: Prometheus by itself only knows if targets are reachable, not how the system behaves. 👉 DevOps conclusion: Prometheus is a collector, not a sensor. Now node_exporter is installed and running on the target machine. 👉 Why DevOps needs this: 👉 Why DevOps needs this: Memory issues crash apps without warning if not monitored. 👉 This alert is mandatory in production “Why can’t Prometheus do this alone?” 👉 DevOps conclusion: Prometheus without exporters is blind. If exporter dies, monitoring dies silently. “Prometheus collects metrics, node_exporter exposes system data, PromQL turns numbers into insight, alerts turn insight into action.” 👉 Prometheus stays on EC2 👉 KIND is just another “target” “If something must run on every node → DaemonSet” Create file: node-exporter-kind.yaml 👉 DevOps rule: If a node has no exporter → you are blind on that node. KIND runs locally. EC2 Prometheus cannot directly reach 127.0.0.1. So we expose node exporter via NodePort. Create file: node-exporter-svc.yaml 👉 If this fails → Prometheus will fail too. SSH into EC2 Prometheus server: ⚠️ Replace <YOUR_LAPTOP_PUBLIC_IP> (Use curl ifconfig.me on laptop) Or if using reload endpoint: 👉 This is the big success moment. Now PromQL works unchanged. Alerts go to same email. Pods restarting randomly. 👉 Node exporter revealed the real cause. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: ssh -i ~/Downloads/keypaircalifornia.pem ubuntu@<GRAFANA_PUBLIC_IP> Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: ssh -i ~/Downloads/keypaircalifornia.pem ubuntu@<GRAFANA_PUBLIC_IP> COMMAND_BLOCK: ssh -i ~/Downloads/keypaircalifornia.pem ubuntu@<GRAFANA_PUBLIC_IP> COMMAND_BLOCK: sudo nano /etc/grafana/grafana.ini Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo nano /etc/grafana/grafana.ini COMMAND_BLOCK: sudo nano /etc/grafana/grafana.ini CODE_BLOCK: [smtp] enabled = true host = smtp.gmail.com:587 user = [email protected] password = YOUR_GMAIL_APP_PASSWORD from_address = [email protected] from_name = Grafana Alerts skip_verify = true startTLS_policy = OpportunisticStartTLS Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: [smtp] enabled = true host = smtp.gmail.com:587 user = [email protected] password = YOUR_GMAIL_APP_PASSWORD from_address = [email protected] from_name = Grafana Alerts skip_verify = true startTLS_policy = OpportunisticStartTLS CODE_BLOCK: [smtp] enabled = true host = smtp.gmail.com:587 user = [email protected] password = YOUR_GMAIL_APP_PASSWORD from_address = [email protected] from_name = Grafana Alerts skip_verify = true startTLS_policy = OpportunisticStartTLS COMMAND_BLOCK: sudo systemctl restart grafana-server sudo systemctl status grafana-server Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo systemctl restart grafana-server sudo systemctl status grafana-server COMMAND_BLOCK: sudo systemctl restart grafana-server sudo systemctl status grafana-server COMMAND_BLOCK: sudo journalctl -u grafana-server -f Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo journalctl -u grafana-server -f COMMAND_BLOCK: sudo journalctl -u grafana-server -f CODE_BLOCK: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) CODE_BLOCK: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) COMMAND_BLOCK: sudo apt update sudo apt install -y stress stress --cpu 2 --timeout 180 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo apt update sudo apt install -y stress stress --cpu 2 --timeout 180 COMMAND_BLOCK: sudo apt update sudo apt install -y stress stress --cpu 2 --timeout 180 CODE_BLOCK: top ps aux --sort=-%cpu | head Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: top ps aux --sort=-%cpu | head CODE_BLOCK: top ps aux --sort=-%cpu | head CODE_BLOCK: df -h sudo du -xh / | sort -h | tail Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: df -h sudo du -xh / | sort -h | tail CODE_BLOCK: df -h sudo du -xh / | sort -h | tail CODE_BLOCK: up{job="node"} == 0 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: up{job="node"} == 0 CODE_BLOCK: up{job="node"} == 0 CODE_BLOCK: [ Linux Server ] └── node_exporter (system metrics) ↓ Prometheus (scrapes metrics) ↓ Grafana (query + alert + notify) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: [ Linux Server ] └── node_exporter (system metrics) ↓ Prometheus (scrapes metrics) ↓ Grafana (query + alert + notify) CODE_BLOCK: [ Linux Server ] └── node_exporter (system metrics) ↓ Prometheus (scrapes metrics) ↓ Grafana (query + alert + notify) CODE_BLOCK: http://<PROMETHEUS_IP>:9090 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: http://<PROMETHEUS_IP>:9090 CODE_BLOCK: http://<PROMETHEUS_IP>:9090 CODE_BLOCK: up Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: up{job="prometheus"} = 1 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: up{job="prometheus"} = 1 CODE_BLOCK: up{job="prometheus"} = 1 CODE_BLOCK: node_cpu_seconds_total Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: up{job="node"} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: up{job="node"} CODE_BLOCK: up{job="node"} CODE_BLOCK: up{instance="172.31.x.x:9100", job="node"} = 1 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: up{instance="172.31.x.x:9100", job="node"} = 1 CODE_BLOCK: up{instance="172.31.x.x:9100", job="node"} = 1 CODE_BLOCK: node_cpu_seconds_total Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: 100 - ( avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) * 100 ) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 100 - ( avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) * 100 ) CODE_BLOCK: 100 - ( avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) * 100 ) CODE_BLOCK: node_memory_MemTotal_bytes Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node_memory_MemTotal_bytes CODE_BLOCK: node_memory_MemTotal_bytes CODE_BLOCK: node_memory_MemAvailable_bytes Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node_memory_MemAvailable_bytes CODE_BLOCK: node_memory_MemAvailable_bytes CODE_BLOCK: ( 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) ) * 100 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ( 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) ) * 100 CODE_BLOCK: ( 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) ) * 100 CODE_BLOCK: 100 - ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 100 - ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 CODE_BLOCK: 100 - ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 CODE_BLOCK: rate(node_network_receive_bytes_total[5m]) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: rate(node_network_receive_bytes_total[5m]) CODE_BLOCK: rate(node_network_receive_bytes_total[5m]) CODE_BLOCK: rate(node_network_transmit_bytes_total[5m]) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: rate(node_network_transmit_bytes_total[5m]) CODE_BLOCK: rate(node_network_transmit_bytes_total[5m]) CODE_BLOCK: stress --cpu 2 --timeout 120 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: stress --cpu 2 --timeout 120 CODE_BLOCK: stress --cpu 2 --timeout 120 CODE_BLOCK: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100 CODE_BLOCK: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100 CODE_BLOCK: up{job="node"} == 0 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: up{job="node"} == 0 CODE_BLOCK: up{job="node"} == 0 CODE_BLOCK: node_cpu_seconds_total Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: rate(node_cpu_seconds_total[5m]) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: rate(node_cpu_seconds_total[5m]) CODE_BLOCK: rate(node_cpu_seconds_total[5m]) CODE_BLOCK: EC2 (Prometheus + Grafana) | | scrape metrics v KIND cluster ├─ control-plane node (node-exporter pod) ├─ worker node (node-exporter pod) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: EC2 (Prometheus + Grafana) | | scrape metrics v KIND cluster ├─ control-plane node (node-exporter pod) ├─ worker node (node-exporter pod) CODE_BLOCK: EC2 (Prometheus + Grafana) | | scrape metrics v KIND cluster ├─ control-plane node (node-exporter pod) ├─ worker node (node-exporter pod) COMMAND_BLOCK: kubectl get nodes Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: kubectl get nodes COMMAND_BLOCK: kubectl get nodes COMMAND_BLOCK: kubectl create namespace monitoring Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: kubectl create namespace monitoring COMMAND_BLOCK: kubectl create namespace monitoring CODE_BLOCK: apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter template: metadata: labels: app: node-exporter spec: hostPID: true hostNetwork: true containers: - name: node-exporter image: prom/node-exporter:latest args: - "--path.procfs=/host/proc" - "--path.sysfs=/host/sys" - "--path.rootfs=/host/root" ports: - containerPort: 9100 hostPort: 9100 volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true - name: root mountPath: /host/root readOnly: true volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys - name: root hostPath: path: / Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter template: metadata: labels: app: node-exporter spec: hostPID: true hostNetwork: true containers: - name: node-exporter image: prom/node-exporter:latest args: - "--path.procfs=/host/proc" - "--path.sysfs=/host/sys" - "--path.rootfs=/host/root" ports: - containerPort: 9100 hostPort: 9100 volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true - name: root mountPath: /host/root readOnly: true volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys - name: root hostPath: path: / CODE_BLOCK: apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter template: metadata: labels: app: node-exporter spec: hostPID: true hostNetwork: true containers: - name: node-exporter image: prom/node-exporter:latest args: - "--path.procfs=/host/proc" - "--path.sysfs=/host/sys" - "--path.rootfs=/host/root" ports: - containerPort: 9100 hostPort: 9100 volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true - name: root mountPath: /host/root readOnly: true volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys - name: root hostPath: path: / COMMAND_BLOCK: kubectl apply -f node-exporter-kind.yaml Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: kubectl apply -f node-exporter-kind.yaml COMMAND_BLOCK: kubectl apply -f node-exporter-kind.yaml COMMAND_BLOCK: kubectl get pods -n monitoring -o wide Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: kubectl get pods -n monitoring -o wide COMMAND_BLOCK: kubectl get pods -n monitoring -o wide CODE_BLOCK: apiVersion: v1 kind: Service metadata: name: node-exporter namespace: monitoring spec: type: NodePort selector: app: node-exporter ports: - name: metrics port: 9100 targetPort: 9100 nodePort: 30910 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: apiVersion: v1 kind: Service metadata: name: node-exporter namespace: monitoring spec: type: NodePort selector: app: node-exporter ports: - name: metrics port: 9100 targetPort: 9100 nodePort: 30910 CODE_BLOCK: apiVersion: v1 kind: Service metadata: name: node-exporter namespace: monitoring spec: type: NodePort selector: app: node-exporter ports: - name: metrics port: 9100 targetPort: 9100 nodePort: 30910 COMMAND_BLOCK: kubectl apply -f node-exporter-svc.yaml Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: kubectl apply -f node-exporter-svc.yaml COMMAND_BLOCK: kubectl apply -f node-exporter-svc.yaml COMMAND_BLOCK: kubectl get svc -n monitoring Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: kubectl get svc -n monitoring COMMAND_BLOCK: kubectl get svc -n monitoring CODE_BLOCK: node-exporter NodePort 9100:30910/TCP Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node-exporter NodePort 9100:30910/TCP CODE_BLOCK: node-exporter NodePort 9100:30910/TCP COMMAND_BLOCK: curl http://localhost:30910/metrics | head Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: curl http://localhost:30910/metrics | head COMMAND_BLOCK: curl http://localhost:30910/metrics | head CODE_BLOCK: node_cpu_seconds_total node_memory_MemAvailable_bytes Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node_cpu_seconds_total node_memory_MemAvailable_bytes CODE_BLOCK: node_cpu_seconds_total node_memory_MemAvailable_bytes COMMAND_BLOCK: ssh -i keypair.pem ubuntu@<EC2_IP> Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: ssh -i keypair.pem ubuntu@<EC2_IP> COMMAND_BLOCK: ssh -i keypair.pem ubuntu@<EC2_IP> COMMAND_BLOCK: sudo nano /etc/prometheus/prometheus.yml Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo nano /etc/prometheus/prometheus.yml COMMAND_BLOCK: sudo nano /etc/prometheus/prometheus.yml CODE_BLOCK: - job_name: "kind-node-exporter" static_configs: - targets: - "<YOUR_LAPTOP_PUBLIC_IP>:30910" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: - job_name: "kind-node-exporter" static_configs: - targets: - "<YOUR_LAPTOP_PUBLIC_IP>:30910" CODE_BLOCK: - job_name: "kind-node-exporter" static_configs: - targets: - "<YOUR_LAPTOP_PUBLIC_IP>:30910" COMMAND_BLOCK: sudo systemctl restart prometheus Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo systemctl restart prometheus COMMAND_BLOCK: sudo systemctl restart prometheus COMMAND_BLOCK: curl -X POST http://localhost:9090/-/reload Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: curl -X POST http://localhost:9090/-/reload COMMAND_BLOCK: curl -X POST http://localhost:9090/-/reload CODE_BLOCK: Status → Targets Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Status → Targets CODE_BLOCK: Status → Targets CODE_BLOCK: kind-node-exporter UP Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: kind-node-exporter UP CODE_BLOCK: kind-node-exporter UP CODE_BLOCK: up{job="kind-node-exporter"} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: up{job="kind-node-exporter"} CODE_BLOCK: up{job="kind-node-exporter"} CODE_BLOCK: 100 - ( avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) * 100 ) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 100 - ( avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) * 100 ) CODE_BLOCK: 100 - ( avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) * 100 ) CODE_BLOCK: ( 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) ) * 100 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ( 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) ) * 100 CODE_BLOCK: ( 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) ) * 100 CODE_BLOCK: 100 - ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 100 - ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 CODE_BLOCK: 100 - ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 CODE_BLOCK: up{job="kind-node-exporter"} == 0 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: up{job="kind-node-exporter"} == 0 CODE_BLOCK: up{job="kind-node-exporter"} == 0 COMMAND_BLOCK: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80 COMMAND_BLOCK: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80 CODE_BLOCK: node_filesystem_avail_bytes{mountpoint="/"} < 10 * 1024 * 1024 * 1024 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node_filesystem_avail_bytes{mountpoint="/"} < 10 * 1024 * 1024 * 1024 CODE_BLOCK: node_filesystem_avail_bytes{mountpoint="/"} < 10 * 1024 * 1024 * 1024 COMMAND_BLOCK: kubectl get pods Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: kubectl get pods COMMAND_BLOCK: kubectl get pods COMMAND_BLOCK: kubectl cordon <node> kubectl drain <node> Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: kubectl cordon <node> kubectl drain <node> COMMAND_BLOCK: kubectl cordon <node> kubectl drain <node> - You are on-call and must know about incidents immediately - The system is unattended (night/weekend) - You need evidence for SLAs and incident reports - Detect problems before users complain - Reduce MTTR (mean time to recovery) - Avoid “silent failure” (monitoring is broken but nobody knows) - Exporter / Metrics exist (node_exporter up) - Prometheus scrapes (Targets show UP) - Grafana alert rule fires (Normal → Pending → Firing) - Notification delivery (SMTP works + contact point + policy routes alerts) - host: SMTP server + port - user: mailbox used to send alerts (sender) - password: App Password, not normal Gmail password - from_address: must match sender for best deliverability - startTLS_policy: enables encryption for SMTP - 535 BadCredentials → wrong password/app password missing - 534-5.7.9 Application-specific password required → needs app password - connection timeout → network egress blocked / wrong SMTP host/port - Type: Email - Addresses: your receiver email (example: [email protected]) - UI: “Test notification sent” - Inbox: “Grafana test notification” - Logs: show email send attempt - Look at the UI error + logs - Fix SMTP first - Put your email contact point in the Default policy or - Create a policy that matches labels like: severity = critical team = devops - severity = critical - team = devops - severity = critical - team = devops - IS ABOVE 80 - severity = warning or critical - team = devops - Normal → Pending → Firing - FIRING email - RESOLVED email after load ends - “Node down” - “Disk almost full” - instance label (IP:port) - job label (node/prometheus) - environment label (prod/dev) if you use it - instance="172.31.x.x:9100" - Severity label: warning vs critical - Actual value (CPU 92%, disk 95%) - “For 1m” or “For 5m” indicates persistence - Similar previous emails - SSH to server - Check top processes: - Identify cause: deployment? runaway job? attack? - Mitigation: restart service, scale out, stop job - Check if host is reachable (ping/ssh) - AWS instance status checks - Security group changes? - node_exporter service status - Find biggest usage: - Clean logs / expand disk / rotate logs - FOR 1m or FOR 5m - Use avg / rate windows Otherwise you get spam and ignore alerts. - summary: “CPU above 80% on {{ $labels.instance }}” - description: “Check top, deployments, scaling” - SMTP changes - Grafana upgrades - firewall changes - password rotations - warning: CPU > 80% for 5m - critical: CPU > 95% for 2m - Prometheus knows itself - It knows nothing about CPU, memory, disk - up only means “can I scrape this endpoint?” - Prometheus does not collect OS metrics - Prometheus is not an agent - It only pulls what is exposed - Prometheus can reach node_exporter - Metrics are available - Monitoring is alive - Multiple time series - Labels: cpu="0" mode="idle" | user | system | iowait - mode="idle" | user | system | iowait - mode="idle" | user | system | iowait - Linux CPU time is cumulative - Metrics grow forever - We must use rate() to make sense of it - CPU usage % - 0–30% → normal - 50–70% → watch - > 80% → alert - > 95% → incident - High CPU causes: Slow apps Timeouts Failed deployments - Failed deployments - Failed deployments - Physical RAM installed - Does NOT change - How much memory apps can still use - Much better than “free memory” - Memory > 80% → danger - Memory leaks show slow increase - OOM kills happen suddenly - Disk full = app crashes - Databases stop - Logs can’t write - OS can become unstable - Sudden spikes → traffic surge or attack - Drops → network issues - Used in: DDoS detection Load testing validation - DDoS detection - Load testing validation - DDoS detection - Load testing validation - ❌ Does not know CPU - ❌ Does not know memory - ❌ Does not know disk - ❌ Does not know network - ❌ Does not run on every server - ✅ Reads /proc, /sys - ✅ Exposes OS internals safely - ✅ Lightweight - ✅ Industry standard - Alert transitions to Firing - Email notification sent - rate(...[1m]) → fast reaction - rate(...[5m]) → stable alerts - instance → which server - job → which role - mountpoint → which disk - KIND cluster running - kubectl configured - Prometheus already running - Prometheus UI accessible - You know EC2 private IP of Prometheus - Needs host access - One per node - Not per pod - One pod per KIND node - Each pod on a different node - 1 → node reachable - 0 → cluster blind - This is host CPU - Includes kubelet, containers, OS - High memory → pod OOMKills - Kubernetes hides this unless you look - Disk full → kubelet stops - Pods fail silently