Tools

Tools: DevOps Monitoring & Alerting — Real-World Lab (Prometheus + Grafana)

2026-02-04 0 views admin

Tools: DevOps Monitoring & Alerting — Real-World Lab (Prometheus + Grafana)

1) Why DevOps sets up email notifications ## 2) What must be true before email notifications can work ## 3) Step-by-step: Configure SMTP on Grafana server (DevOps setup) ## Step 3.1 — SSH to the Grafana server ## Step 3.2 — Edit Grafana config ## Step 3.3 — Add/enable SMTP section ## DevOps notes (what matters) ## Step 3.4 — Restart Grafana to load changes ## Step 3.5 — Watch Grafana logs while testing (DevOps habit) ## 4) Step-by-step: Gmail App Password (Most common failure) ## Step 4.1 — Enable 2-Step Verification (required) ## Step 4.2 — Create App Password ## Step 4.3 — Put that App Password in grafana.ini ## DevOps tip ## 5) Step-by-step: Configure Grafana UI (Contact point + policy) ## Step 5.1 — Create Contact Point ## Step 5.2 — Test Contact Point (mandatory) ## Expected: ## If it fails: ## Step 5.3 — Configure Notification Policy (routing) ## DevOps rule ## 6) Step-by-step: Create a “real” alert and trigger it ## Step 6.1 — Create alert rule (example: High CPU) ## Step 6.2 — Trigger CPU load on target machine ## Step 6.3 — Watch alert state ## Step 6.4 — Confirm email arrives ## 7) How DevOps reads an alert email (what matters) ## A) What is the problem? ## B) Which system/server? ## C) How bad is it? ## D) Is it new or recurring? ## E) What action should I take first? ## 8) What DevOps must pay attention to (best practices) ## 1) Always alert on monitoring failures ## 2) Avoid noisy alerts ## 3) Include context in alerts ## 4) Test notifications regularly ## 5) Separate “Warning” vs “Critical” ## 9) Mini checklist ## 🧪 PromQL LAB: Why Node Exporter Is Mandatory for DevOps ## 🔁 Architecture Reminder (Before Lab) ## LAB PART 1 — What Prometheus Knows WITHOUT Node Exporter ## Step 1 — Open Prometheus UI ## Step 2 — Run this query ## Expected result: ## DevOps explanation: ## Step 3 — Try this query (WITHOUT node_exporter) ## Expected result: ## LAB PART 2 — What Node Exporter Adds ## Step 4 — Confirm node exporter is scraped ## Expected result: ## DevOps meaning: ## LAB PART 3 — CPU Metrics (Most Common Incident) ## Step 5 — Raw CPU metric ## What students see: ## DevOps explanation: ## Step 6 — CPU usage percentage (REAL DEVOPS QUERY) ## What this shows: ## DevOps interpretation: ## LAB PART 4 — Memory Metrics (Silent Killers) ## Step 7 — Total memory ## Interpretation: ## Step 8 — Available memory ## DevOps meaning: ## Step 9 — Memory usage percentage ## DevOps interpretation: ## LAB PART 5 — Disk Metrics (Most Dangerous) ## Step 10 — Disk usage % ## DevOps interpretation: ## LAB PART 6 — Network Metrics (Hidden Bottlenecks) ## Step 11 — Network receive rate ## Step 12 — Network transmit rate ## DevOps interpretation: ## LAB PART 7 — Proving Why Node Exporter Is REQUIRED ## Question to students: ## Answer: ## LAB PART 8 — Real Incident Simulation ## Step 13 — Generate CPU load ## Step 14 — Watch PromQL graph change ## DevOps observation: ## WHAT DEVOPS MUST PAY ATTENTION TO ## 1️⃣ Always monitor exporters themselves ## 2️⃣ Use time windows correctly ## 3️⃣ Avoid raw counters ## 4️⃣ Labels matter ## 🧪 LAB: Monitor KIND Kubernetes using EC2 Prometheus (Central Monitoring) ## 🧱 Final Architecture (Explain First) ## PHASE 0 — Prerequisites ## On your laptop: ## On EC2: ## PHASE 1 — Deploy Node Exporter in KIND (DaemonSet) ## Why DaemonSet ## STEP 1 — Create monitoring namespace ## STEP 2 — Node Exporter DaemonSet for KIND ## STEP 3 — Verify node exporter pods ## PHASE 2 — Expose Node Exporter to EC2 Prometheus ## Key Concept (VERY IMPORTANT) ## STEP 4 — Create NodePort Service ## STEP 5 — Verify NodePort ## STEP 6 — Test metrics locally (sanity check) ## PHASE 3 — Configure EC2 Prometheus to Scrape KIND ## STEP 7 — Edit Prometheus config on EC2 ## STEP 8 — Reload Prometheus ## STEP 9 — Verify targets in Prometheus UI ## PHASE 4 — PromQL Labs (KIND Nodes) ## LAB 1 — Is KIND node visible? ## LAB 2 — CPU usage of KIND node ## LAB 3 — Memory usage ## LAB 4 — Disk usage (CRITICAL) ## PHASE 5 — Create Alerts ## Node exporter down (MANDATORY) ## High CPU ## Disk almost full ## PHASE 6 — Incident Simulation ## Scenario ## Step 1 — Kubernetes view ## Step 2 — Node metrics (Prometheus) ## Step 3 — DevOps action Dashboards are passive. Alerts + email are active. You need email notifications when: Email notification depends on 4 layers: In real life, most failures happen at layer 4. This is done on the machine running Grafana (your “monitor” instance). For Gmail SMTP (lab-friendly): If Grafana fails to start, your config has a syntax problem. You keep this open when testing notifications. Your error: 535 5.7.8 Username and Password not accepted (BadCredentials) That means you used a normal password or Gmail blocked the sign-in. Google Account → Security → 2-Step Verification ON Google Account → Security → App passwords → create one for “Mail” Copy the 16-character app password. Paste it without spaces. Restart Grafana again. SMTP is server-side. UI decides WHO gets notified. Grafana → Alerting → Contact points → Create contact point Grafana → Alerting → Notification policies Ensure there is a policy that routes alerts to your contact point. Options: Create a policy that matches labels like: No policy route → no notification, even if contact point exists. Use Prometheus datasource and query: Labels (important for routing): Grafana → Alerting → Active alerts: When an alert email comes, DevOps must answer: This tells urgency and type of incident. In your lab, the most important is: DevOps initial actions should be fast: Because if node exporter dies, you become blind. Use labels/annotations: DevOps must test after: ✅ SMTP configured in /etc/grafana/grafana.ini ✅ Gmail App Password (not normal password) ✅ Grafana restarted ✅ Contact point created + Test succeeded ✅ Notification policy routes alerts to contact point ✅ Alert rule has correct query + labels ✅ Trigger event causes Firing + email received You will see something like: 👉 Important DevOps truth: Prometheus by itself only knows if targets are reachable, not how the system behaves. 👉 DevOps conclusion: Prometheus is a collector, not a sensor. Now node_exporter is installed and running on the target machine. 👉 Why DevOps needs this: 👉 Why DevOps needs this: Memory issues crash apps without warning if not monitored. 👉 This alert is mandatory in production “Why can’t Prometheus do this alone?” 👉 DevOps conclusion: Prometheus without exporters is blind. If exporter dies, monitoring dies silently. “Prometheus collects metrics, node_exporter exposes system data, PromQL turns numbers into insight, alerts turn insight into action.” 👉 Prometheus stays on EC2 👉 KIND is just another “target” “If something must run on every node → DaemonSet” Create file: node-exporter-kind.yaml 👉 DevOps rule: If a node has no exporter → you are blind on that node. KIND runs locally. EC2 Prometheus cannot directly reach 127.0.0.1. So we expose node exporter via NodePort. Create file: node-exporter-svc.yaml 👉 If this fails → Prometheus will fail too. SSH into EC2 Prometheus server: ⚠️ Replace <YOUR_LAPTOP_PUBLIC_IP> (Use curl ifconfig.me on laptop) Or if using reload endpoint: 👉 This is the big success moment. Now PromQL works unchanged. Alerts go to same email. Pods restarting randomly. 👉 Node exporter revealed the real cause. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. as well , this person and/or COMMAND_BLOCK: ssh -i ~/Downloads/keypaircalifornia.pem ubuntu@<GRAFANA_PUBLIC_IP> COMMAND_BLOCK: ssh -i ~/Downloads/keypaircalifornia.pem ubuntu@<GRAFANA_PUBLIC_IP> COMMAND_BLOCK: ssh -i ~/Downloads/keypaircalifornia.pem ubuntu@<GRAFANA_PUBLIC_IP> COMMAND_BLOCK: sudo nano /etc/grafana/grafana.ini COMMAND_BLOCK: sudo nano /etc/grafana/grafana.ini COMMAND_BLOCK: sudo nano /etc/grafana/grafana.ini CODE_BLOCK: [smtp] enabled = true host = smtp.gmail.com:587 user = [email protected] password = YOUR_GMAIL_APP_PASSWORD from_address = [email protected] from_name = Grafana Alerts skip_verify = true startTLS_policy = OpportunisticStartTLS CODE_BLOCK: [smtp] enabled = true host = smtp.gmail.com:587 user = [email protected] password = YOUR_GMAIL_APP_PASSWORD from_address = [email protected] from_name = Grafana Alerts skip_verify = true startTLS_policy = OpportunisticStartTLS CODE_BLOCK: [smtp] enabled = true host = smtp.gmail.com:587 user = [email protected] password = YOUR_GMAIL_APP_PASSWORD from_address = [email protected] from_name = Grafana Alerts skip_verify = true startTLS_policy = OpportunisticStartTLS COMMAND_BLOCK: sudo systemctl restart grafana-server sudo systemctl status grafana-server COMMAND_BLOCK: sudo systemctl restart grafana-server sudo systemctl status grafana-server COMMAND_BLOCK: sudo systemctl restart grafana-server sudo systemctl status grafana-server COMMAND_BLOCK: sudo journalctl -u grafana-server -f COMMAND_BLOCK: sudo journalctl -u grafana-server -f COMMAND_BLOCK: sudo journalctl -u grafana-server -f CODE_BLOCK: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) CODE_BLOCK: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) CODE_BLOCK: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) COMMAND_BLOCK: sudo apt update sudo apt install -y stress stress --cpu 2 --timeout 180 COMMAND_BLOCK: sudo apt update sudo apt install -y stress stress --cpu 2 --timeout 180 COMMAND_BLOCK: sudo apt update sudo apt install -y stress stress --cpu 2 --timeout 180 CODE_BLOCK: top ps aux --sort=-%cpu | head CODE_BLOCK: top ps aux --sort=-%cpu | head CODE_BLOCK: top ps aux --sort=-%cpu | head CODE_BLOCK: df -h sudo du -xh / | sort -h | tail CODE_BLOCK: df -h sudo du -xh / | sort -h | tail CODE_BLOCK: df -h sudo du -xh / | sort -h | tail CODE_BLOCK: up{job="node"} == 0 CODE_BLOCK: up{job="node"} == 0 CODE_BLOCK: up{job="node"} == 0 CODE_BLOCK: [ Linux Server ] └── node_exporter (system metrics) ↓ Prometheus (scrapes metrics) ↓ Grafana (query + alert + notify) CODE_BLOCK: [ Linux Server ] └── node_exporter (system metrics) ↓ Prometheus (scrapes metrics) ↓ Grafana (query + alert + notify) CODE_BLOCK: [ Linux Server ] └── node_exporter (system metrics) ↓ Prometheus (scrapes metrics) ↓ Grafana (query + alert + notify) CODE_BLOCK: http://<PROMETHEUS_IP>:9090 CODE_BLOCK: http://<PROMETHEUS_IP>:9090 CODE_BLOCK: http://<PROMETHEUS_IP>:9090 CODE_BLOCK: up CODE_BLOCK: up{job="prometheus"} = 1 CODE_BLOCK: up{job="prometheus"} = 1 CODE_BLOCK: up{job="prometheus"} = 1 CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: up{job="node"} CODE_BLOCK: up{job="node"} CODE_BLOCK: up{job="node"} CODE_BLOCK: up{instance="172.31.x.x:9100", job="node"} = 1 CODE_BLOCK: up{instance="172.31.x.x:9100", job="node"} = 1 CODE_BLOCK: up{instance="172.31.x.x:9100", job="node"} = 1 CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: 100 - ( avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) * 100 ) CODE_BLOCK: 100 - ( avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) * 100 ) CODE_BLOCK: 100 - ( avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) * 100 ) CODE_BLOCK: node_memory_MemTotal_bytes CODE_BLOCK: node_memory_MemTotal_bytes CODE_BLOCK: node_memory_MemTotal_bytes CODE_BLOCK: node_memory_MemAvailable_bytes CODE_BLOCK: node_memory_MemAvailable_bytes CODE_BLOCK: node_memory_MemAvailable_bytes CODE_BLOCK: ( 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) ) * 100 CODE_BLOCK: ( 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) ) * 100 CODE_BLOCK: ( 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) ) * 100 CODE_BLOCK: 100 - ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 CODE_BLOCK: 100 - ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 CODE_BLOCK: 100 - ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 CODE_BLOCK: rate(node_network_receive_bytes_total[5m]) CODE_BLOCK: rate(node_network_receive_bytes_total[5m]) CODE_BLOCK: rate(node_network_receive_bytes_total[5m]) CODE_BLOCK: rate(node_network_transmit_bytes_total[5m]) CODE_BLOCK: rate(node_network_transmit_bytes_total[5m]) CODE_BLOCK: rate(node_network_transmit_bytes_total[5m]) CODE_BLOCK: stress --cpu 2 --timeout 120 CODE_BLOCK: stress --cpu 2 --timeout 120 CODE_BLOCK: stress --cpu 2 --timeout 120 CODE_BLOCK: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100 CODE_BLOCK: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100 CODE_BLOCK: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100 CODE_BLOCK: up{job="node"} == 0 CODE_BLOCK: up{job="node"} == 0 CODE_BLOCK: up{job="node"} == 0 CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: rate(node_cpu_seconds_total[5m]) CODE_BLOCK: rate(node_cpu_seconds_total[5m]) CODE_BLOCK: rate(node_cpu_seconds_total[5m]) CODE_BLOCK: EC2 (Prometheus + Grafana) | | scrape metrics v KIND cluster ├─ control-plane node (node-exporter pod) ├─ worker node (node-exporter pod) CODE_BLOCK: EC2 (Prometheus + Grafana) | | scrape metrics v KIND cluster ├─ control-plane node (node-exporter pod) ├─ worker node (node-exporter pod) CODE_BLOCK: EC2 (Prometheus + Grafana) | | scrape metrics v KIND cluster ├─ control-plane node (node-exporter pod) ├─ worker node (node-exporter pod) COMMAND_BLOCK: kubectl get nodes COMMAND_BLOCK: kubectl get nodes COMMAND_BLOCK: kubectl get nodes COMMAND_BLOCK: kubectl create namespace monitoring COMMAND_BLOCK: kubectl create namespace monitoring COMMAND_BLOCK: kubectl create namespace monitoring CODE_BLOCK: apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter template: metadata: labels: app: node-exporter spec: hostPID: true hostNetwork: true containers: - name: node-exporter image: prom/node-exporter:latest args: - "--path.procfs=/host/proc" - "--path.sysfs=/host/sys" - "--path.rootfs=/host/root" ports: - containerPort: 9100 hostPort: 9100 volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true - name: root mountPath: /host/root readOnly: true volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys - name: root hostPath: path: / CODE_BLOCK: apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter template: metadata: labels: app: node-exporter spec: hostPID: true hostNetwork: true containers: - name: node-exporter image: prom/node-exporter:latest args: - "--path.procfs=/host/proc" - "--path.sysfs=/host/sys" - "--path.rootfs=/host/root" ports: - containerPort: 9100 hostPort: 9100 volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true - name: root mountPath: /host/root readOnly: true volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys - name: root hostPath: path: / CODE_BLOCK: apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter template: metadata: labels: app: node-exporter spec: hostPID: true hostNetwork: true containers: - name: node-exporter image: prom/node-exporter:latest args: - "--path.procfs=/host/proc" - "--path.sysfs=/host/sys" - "--path.rootfs=/host/root" ports: - containerPort: 9100 hostPort: 9100 volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true - name: root mountPath: /host/root readOnly: true volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys - name: root hostPath: path: / COMMAND_BLOCK: kubectl apply -f node-exporter-kind.yaml COMMAND_BLOCK: kubectl apply -f node-exporter-kind.yaml COMMAND_BLOCK: kubectl apply -f node-exporter-kind.yaml COMMAND_BLOCK: kubectl get pods -n monitoring -o wide COMMAND_BLOCK: kubectl get pods -n monitoring -o wide COMMAND_BLOCK: kubectl get pods -n monitoring -o wide CODE_BLOCK: apiVersion: v1 kind: Service metadata: name: node-exporter namespace: monitoring spec: type: NodePort selector: app: node-exporter ports: - name: metrics port: 9100 targetPort: 9100 nodePort: 30910 CODE_BLOCK: apiVersion: v1 kind: Service metadata: name: node-exporter namespace: monitoring spec: type: NodePort selector: app: node-exporter ports: - name: metrics port: 9100 targetPort: 9100 nodePort: 30910 CODE_BLOCK: apiVersion: v1 kind: Service metadata: name: node-exporter namespace: monitoring spec: type: NodePort selector: app: node-exporter ports: - name: metrics port: 9100 targetPort: 9100 nodePort: 30910 COMMAND_BLOCK: kubectl apply -f node-exporter-svc.yaml COMMAND_BLOCK: kubectl apply -f node-exporter-svc.yaml COMMAND_BLOCK: kubectl apply -f node-exporter-svc.yaml COMMAND_BLOCK: kubectl get svc -n monitoring COMMAND_BLOCK: kubectl get svc -n monitoring COMMAND_BLOCK: kubectl get svc -n monitoring CODE_BLOCK: node-exporter NodePort 9100:30910/TCP CODE_BLOCK: node-exporter NodePort 9100:30910/TCP CODE_BLOCK: node-exporter NodePort 9100:30910/TCP COMMAND_BLOCK: curl http://localhost:30910/metrics | head COMMAND_BLOCK: curl http://localhost:30910/metrics | head COMMAND_BLOCK: curl http://localhost:30910/metrics | head CODE_BLOCK: node_cpu_seconds_total node_memory_MemAvailable_bytes CODE_BLOCK: node_cpu_seconds_total node_memory_MemAvailable_bytes CODE_BLOCK: node_cpu_seconds_total node_memory_MemAvailable_bytes COMMAND_BLOCK: ssh -i keypair.pem ubuntu@<EC2_IP> COMMAND_BLOCK: ssh -i keypair.pem ubuntu@<EC2_IP> COMMAND_BLOCK: ssh -i keypair.pem ubuntu@<EC2_IP> COMMAND_BLOCK: sudo nano /etc/prometheus/prometheus.yml COMMAND_BLOCK: sudo nano /etc/prometheus/prometheus.yml COMMAND_BLOCK: sudo nano /etc/prometheus/prometheus.yml CODE_BLOCK: - job_name: "kind-node-exporter" static_configs: - targets: - "<YOUR_LAPTOP_PUBLIC_IP>:30910" CODE_BLOCK: - job_name: "kind-node-exporter" static_configs: - targets: - "<YOUR_LAPTOP_PUBLIC_IP>:30910" CODE_BLOCK: - job_name: "kind-node-exporter" static_configs: - targets: - "<YOUR_LAPTOP_PUBLIC_IP>:30910" COMMAND_BLOCK: sudo systemctl restart prometheus COMMAND_BLOCK: sudo systemctl restart prometheus COMMAND_BLOCK: sudo systemctl restart prometheus COMMAND_BLOCK: curl -X POST http://localhost:9090/-/reload COMMAND_BLOCK: curl -X POST http://localhost:9090/-/reload COMMAND_BLOCK: curl -X POST http://localhost:9090/-/reload CODE_BLOCK: Status → Targets CODE_BLOCK: Status → Targets CODE_BLOCK: Status → Targets CODE_BLOCK: kind-node-exporter UP CODE_BLOCK: kind-node-exporter UP CODE_BLOCK: kind-node-exporter UP CODE_BLOCK: up{job="kind-node-exporter"} CODE_BLOCK: up{job="kind-node-exporter"} CODE_BLOCK: up{job="kind-node-exporter"} CODE_BLOCK: 100 - ( avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) * 100 ) CODE_BLOCK: 100 - ( avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) * 100 ) CODE_BLOCK: 100 - ( avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) * 100 ) CODE_BLOCK: ( 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) ) * 100 CODE_BLOCK: ( 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) ) * 100 CODE_BLOCK: ( 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) ) * 100 CODE_BLOCK: 100 - ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 CODE_BLOCK: 100 - ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 CODE_BLOCK: 100 - ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 CODE_BLOCK: up{job="kind-node-exporter"} == 0 CODE_BLOCK: up{job="kind-node-exporter"} == 0 CODE_BLOCK: up{job="kind-node-exporter"} == 0 COMMAND_BLOCK: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80 COMMAND_BLOCK: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80 COMMAND_BLOCK: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80 CODE_BLOCK: node_filesystem_avail_bytes{mountpoint="/"} < 10 * 1024 * 1024 * 1024 CODE_BLOCK: node_filesystem_avail_bytes{mountpoint="/"} < 10 * 1024 * 1024 * 1024 CODE_BLOCK: node_filesystem_avail_bytes{mountpoint="/"} < 10 * 1024 * 1024 * 1024 COMMAND_BLOCK: kubectl get pods COMMAND_BLOCK: kubectl get pods COMMAND_BLOCK: kubectl get pods COMMAND_BLOCK: kubectl cordon <node> kubectl drain <node> COMMAND_BLOCK: kubectl cordon <node> kubectl drain <node> COMMAND_BLOCK: kubectl cordon <node> kubectl drain <node> - You are on-call and must know about incidents immediately - The system is unattended (night/weekend) - You need evidence for SLAs and incident reports - Detect problems before users complain - Reduce MTTR (mean time to recovery) - Avoid “silent failure” (monitoring is broken but nobody knows) - Exporter / Metrics exist (node_exporter up) - Prometheus scrapes (Targets show UP) - Grafana alert rule fires (Normal → Pending → Firing) - Notification delivery (SMTP works + contact point + policy routes alerts) - host: SMTP server + port - user: mailbox used to send alerts (sender) - password: App Password, not normal Gmail password - from_address: must match sender for best deliverability - startTLS_policy: enables encryption for SMTP - 535 BadCredentials → wrong password/app password missing - 534-5.7.9 Application-specific password required → needs app password - connection timeout → network egress blocked / wrong SMTP host/port - Type: Email - Addresses: your receiver email (example: [email protected]) - UI: “Test notification sent” - Inbox: “Grafana test notification” - Logs: show email send attempt - Look at the UI error + logs - Fix SMTP first - Put your email contact point in the Default policy or - Create a policy that matches labels like: severity = critical team = devops - severity = critical - team = devops - severity = critical - team = devops - IS ABOVE 80 - severity = warning or critical - team = devops - Normal → Pending → Firing - FIRING email - RESOLVED email after load ends - “Node down” - “Disk almost full” - instance label (IP:port) - job label (node/prometheus) - environment label (prod/dev) if you use it - instance="172.31.x.x:9100" - Severity label: warning vs critical - Actual value (CPU 92%, disk 95%) - “For 1m” or “For 5m” indicates persistence - Similar previous emails - SSH to server - Check top processes: - Identify cause: deployment? runaway job? attack? - Mitigation: restart service, scale out, stop job - Check if host is reachable (ping/ssh) - AWS instance status checks - Security group changes? - node_exporter service status - Find biggest usage: - Clean logs / expand disk / rotate logs - FOR 1m or FOR 5m - Use avg / rate windows Otherwise you get spam and ignore alerts. - summary: “CPU above 80% on {{ $labels.instance }}” - description: “Check top, deployments, scaling” - SMTP changes - Grafana upgrades - firewall changes - password rotations - warning: CPU > 80% for 5m - critical: CPU > 95% for 2m - Prometheus knows itself - It knows nothing about CPU, memory, disk - up only means “can I scrape this endpoint?” - Prometheus does not collect OS metrics - Prometheus is not an agent - It only pulls what is exposed - Prometheus can reach node_exporter - Metrics are available - Monitoring is alive - Multiple time series - Labels: cpu="0" mode="idle" | user | system | iowait - mode="idle" | user | system | iowait - mode="idle" | user | system | iowait - Linux CPU time is cumulative - Metrics grow forever - We must use rate() to make sense of it - CPU usage % - 0–30% → normal - 50–70% → watch - > 80% → alert - > 95% → incident - High CPU causes: Slow apps Timeouts Failed deployments - Failed deployments - Failed deployments - Physical RAM installed - Does NOT change - How much memory apps can still use - Much better than “free memory” - Memory > 80% → danger - Memory leaks show slow increase - OOM kills happen suddenly - Disk full = app crashes - Databases stop - Logs can’t write - OS can become unstable - Sudden spikes → traffic surge or attack - Drops → network issues - Used in: DDoS detection Load testing validation - DDoS detection - Load testing validation - DDoS detection - Load testing validation - ❌ Does not know CPU - ❌ Does not know memory - ❌ Does not know disk - ❌ Does not know network - ❌ Does not run on every server - ✅ Reads /proc, /sys - ✅ Exposes OS internals safely - ✅ Lightweight - ✅ Industry standard - Alert transitions to Firing - Email notification sent - rate(...[1m]) → fast reaction - rate(...[5m]) → stable alerts - instance → which server - job → which role - mountpoint → which disk - KIND cluster running - kubectl configured - Prometheus already running - Prometheus UI accessible - You know EC2 private IP of Prometheus - Needs host access - One per node - Not per pod - One pod per KIND node - Each pod on a different node - 1 → node reachable - 0 → cluster blind - This is host CPU - Includes kubelet, containers, OS - High memory → pod OOMKills - Kubernetes hides this unless you look - Disk full → kubelet stops - Pods fail silently

🏷️ Tags

toolsutilitiessecurity toolsdevopsmonitoringalertingworldprometheusgrafana