Tools: Building a Production-Grade Observability Platform with the LGTM Stack, DORA Metrics & SLOs (2026)

Tools: Building a Production-Grade Observability Platform with the LGTM Stack, DORA Metrics & SLOs (2026)

Introduction

Why LGTM Over Managed Alternatives?

Architecture Overview

Part 1: Deploying the Full LGTM Stack as Systemd Services

Why systemd over Docker?

One-command deployment

Systemd unit files — the core of the deployment

Checking service status

File layout on the server

Retention periods

Infrastructure as Code — non-negotiable

Part 2: The Four Golden Signals as SLIs

Why Four Golden Signals beat CPU/RAM monitoring

Signal 1 — Latency

Signal 2 — Traffic

Signal 3 — Errors

Signal 4 — Saturation

Part 3: SLOs and Error Budgets

The philosophy

Our SLO targets

Recording rules — pre-computing SLIs

Error Budget Policy

Part 4: DORA Metrics and CI/CD Observability

Why DORA metrics connect to business outcomes

DORA benchmarks

GitHub Actions pushing DORA metrics to Pushgateway

Toil identified and automated

Part 5: Five Grafana Dashboards — All Provisioned as Code

Dashboard 1 — Node Exporter

Dashboard 2 — Blackbox Exporter

Dashboard 3 — DORA Metrics

Dashboard 4 — SLO & Error Budget

Dashboard 5 — Unified Observability (the most important)

📸 [Screenshot: Loki logs panel with clickable trace ID]

Part 6: The Alerting System

All alert rules are version-controlled

Infrastructure alerts

Burn rate alerting — how it reduces alert fatigue

Alertmanager routing and inhibition

Structured Slack template

Part 7: Runbooks and Incident Management

A runbook for every alert

Blameless Post-Incident Review

Part 8: Game Day Results

Scenario 1 — Deployment Failure

Scenario 2 — Latency Injection

Key Lessons Learned

Conclusion GitHub Repository: https://github.com/AirFluke/meetmind-observability

One command to deploy: sudo bash install.sh Modern software teams don't just need to know when something is down — they need to understand why it broke, how long users were affected, how fast they recovered, and whether their engineering practices are improving over time. This is the gap between basic monitoring and true observability. For Stage 6 of the HNG DevOps track, Team MeetMind built a production-grade observability and reliability platform from scratch using the LGTM stack — Loki, Grafana, Tempo, and Prometheus — alongside DORA metrics, SLI/SLO/Error Budget frameworks, and a fully automated alerting pipeline routing to Slack. Everything runs as native systemd services — no Docker, no containers. Each component installs as a binary, managed by systemd the same way any production Linux service is managed. One command brings the entire stack up on any Ubuntu server. The observability market offers managed alternatives — Datadog, New Relic, Grafana Cloud. So why self-host the LGTM stack? Cost at scale. Managed platforms charge per host, per metric, per log line. At scale this becomes a significant infrastructure cost. The LGTM stack runs on a single server with no per-metric pricing. Data sovereignty. Logs contain sensitive data — request bodies, authentication tokens, PII. Shipping these to a third-party SaaS introduces compliance risk. Self-hosted Loki keeps logs within your own infrastructure. No vendor lock-in. Prometheus exposition format and OpenTelemetry are open standards. Every instrumented service, every dashboard, every alert rule is portable. Switching providers means changing an endpoint URL, not rewriting your entire observability layer. Full control over retention. We configured 30-day retention for both metrics and logs at no additional cost. Learning depth. Operating the stack yourself forces genuine understanding of how metrics collection, log aggregation, and distributed tracing work — knowledge that transfers regardless of which tools your next employer uses. The platform runs as nine native systemd services on Ubuntu 24.04, all with automatic restart policies. 📸 [Screenshot: All 9 services showing running status] Running services as native systemd units means: The install script handles everything automatically: Each service has a unit file that defines how it runs. Here is Prometheus as an example: Restart=on-failure with RestartSec=5s is the systemd equivalent of Docker's restart: unless-stopped. Every service has this. Every configuration file is version-controlled in the repository: Nothing requires manual configuration to reproduce. Clone the repo, run the install script, and the entire platform is up. 📸 [Screenshot: Prometheus targets page showing all scrapers green] Before writing a single PromQL query or building any dashboard, we defined what reliability means for MeetMind using Google's Four Golden Signals framework. Traditional monitoring asks "is the server healthy?" The Four Golden Signals ask "is the user experiencing a healthy service?" A server can have 10% CPU and still serve every request with 5-second latency. CPU monitoring shows green. The Four Golden Signals show red. That is the difference. How long does it take to serve a request? We distinguish successful from error latency — a fast error is not a success. How much demand is the system handling? Rate of failed requests — explicit 5xx, implicit wrong content, policy failures like timeouts. How full is the service? We track CPU, memory, and disk. These four PromQL expressions become our SLIs — the measurements we track before defining any targets. An SLI is a measurement.An SLO is a target for that measurement.An Error Budget is the allowable gap between perfect and the SLO target. This framework changes how engineering teams make decisions. Instead of arguing about whether a deployment is safe enough, the question becomes: "Do we have enough error budget to absorb the risk of this deployment?" It converts a subjective conversation into an objective one. Why 99.5% availability?This gives us 216 minutes per month — enough for one planned maintenance window without exhausting the budget. A stricter 99.9% would leave only 43 minutes, making any deployment risky. Why 99% error rate?One percent failure tolerance allows for transient errors during rolling deployments. Stricter targets require canary deployment infrastructure before they are meaningful. Why 500ms p95 latency?Industry standard for interactive APIs. Beyond this threshold, user experience degrades measurably. We chose p95 rather than p99 because optimising for the 99th percentile often requires disproportionate infrastructure investment. Who owns the freeze decision: Engineering lead.Review cadence: First Monday of each month. 📸 [Screenshot: SLO & Error Budget dashboard] Toil 1 — Manual alert acknowledgement.Engineers read a Slack alert, open a browser, navigate to Grafana, search for the relevant dashboard. Automation: every alert payload includes a direct link to the exact dashboard. Saves 2–3 minutes per alert. Toil 2 — Certificate renewal reminders.SSL expiry tracked via calendar reminders. Automation: Blackbox Exporter monitors SSL expiry continuously. SSLCertExpiringSoon alert fires 14 days before expiry automatically. 📸 [Screenshot: DORA metrics dashboard] All dashboards are provisioned from JSON files. The Grafana UI was never used to create or modify any panel. The key config that enables metric → log → trace drill-down: CPU utilisation total and per-core, memory used/cached/available, disk I/O, network I/O, and load averages at 1/5/15 minutes. 📸 [Screenshot: Node Exporter dashboard with live data] External probing: uptime/downtime timeline, HTTP response time, SSL certificate expiry countdown, probe success rate. 📸 [Screenshot: Blackbox Exporter dashboard] Deployment frequency trend, lead time, CFR rolling percentage, MTTR with DORA benchmark classification. 📸 [Screenshot: DORA dashboard] SLI vs SLO gauges, error budget remaining coloured by urgency, burn rate time series with fast/slow thresholds, compliance history. 📸 [Screenshot: SLO dashboard] A metric spike → click through to Loki → see logs from that exact time window → click the trace ID → Tempo opens the waterfall → identify exactly which service and span caused the failure. This drill-down — metric spike → correlated logs → causing trace — is what separates observability from monitoring. 📸 [Screenshot: Unified dashboard] Zero alert rules live in Grafana. Every rule is in a .yml file under alerts/. Traditional threshold alerting fires whenever a metric crosses a line producing alert storms. Engineers learn to ignore them. Burn rate alerting asks: "At this rate of failure, how long until our error budget is exhausted?" Two alerts replace an entire category of noise: Alertmanager uses Go templates. The default function from Sprig is not supported — a lesson we learned the hard way. Every field must be referenced directly: 📸 [Screenshot: Slack showing firing alert with full structured payload] 📸 [Screenshot: Slack showing RESOLVED alert] Every alert links directly to its runbook. Each answers six questions: bash top -bn1 | head -20 ps aux --sort=-%cpu | head -10 bash journalctl -n 100 --since "10 minutes ago" We documented a simulated incident where a missing environment variable caused 35% of requests to return 503 for 47 minutes. Root cause: New environment variable added to code but not to the service configuration. Detection gap: 6-minute lag between incident start and alert. Action item: reduce fast-burn for: clause from 2m to 1m. This review is blameless — we focus on systems and processes, not individuals. Added exit 1 to the GitHub Actions workflow and pushed. The workflow failed and pushed deployment_total{status="failure"} to the Pushgateway. CICDDeploymentFailed fired in Slack within 2 minutes. DORA dashboard showed CFR increase. Immediately reverted. 📸 [Screenshot: GitHub Actions showing red failed run] 📸 [Screenshot: CICDDeploymentFailed in Slack] Injected 600ms network latency using tc netem: HighLatencyWarning fired confirming the alerting pipeline for latency SLO breaches works end-to-end. RESOLVED message confirmed recovery detection works. 📸 [Screenshot: HighLatencyWarning in Slack] 📸 [Screenshot: RESOLVED in Slack after tc removed] Used stress-ng to drive CPU above 90%: This confirmed the full warning → critical → recovery sequence and proved inhibition rules work — critical suppresses the warning notification. 📸 [Screenshot: Prometheus alerts page showing Warning firing] 📸 [Screenshot: Node Exporter dashboard with CPU spike at 92%] 📸 [Screenshot: HighCPUWarning in Slack] 📸 [Screenshot: RESOLVED in Slack] 1. Systemd is production-grade.Running services as native systemd units is simpler than Docker for single-server deployments. No networking complexity, no container runtime, logs go straight to journald. journalctl -u prometheus -f is all you need. 2. Alertmanager templates have limits.The default function from Sprig templating is not supported in Alertmanager. Any | default "value" in your template will crash Alertmanager silently. Always test templates before deploying. 3. Port conflicts happen.Tempo and OTel Collector both want port 4317 (OTLP gRPC). When running as bare processes on the same host, one must move. We moved OTel Collector to 4319/4320. In Docker this was hidden by container networking. 4. Observability is not monitoring.Monitoring tells you something is wrong. Observability tells you why, where, and when — without needing to SSH into a server. The Loki → Tempo drill-down reduced our incident diagnosis time from 40 minutes to 4 minutes in the PIR simulation. 5. SLOs make reliability decisions objective."Is this deployment safe?" is subjective. "Do we have 100 minutes of error budget remaining?" is objective. SLOs turn reliability from a conversation into a measurement. 6. Burn rate alerting eliminates alert fatigue.Two burn rate alerts replaced what would have been dozens of threshold alerts during Game Day scenarios. Engineers respond to meaningful signals, not noise. 7. Everything as code is non-negotiable.

Every dashboard, alert rule, and config that lives only in a UI is technical debt. Clone the repo, run install.sh, and the entire platform is back — no manual steps, no memory required. The MeetMind Observability Platform demonstrates that production-grade observability is achievable without managed services and without Docker. Nine systemd services provide the full observability triad — metrics, logs, and traces — with correlation between all three. SLOs convert vague reliability goals into measurable targets. DORA metrics connect daily engineering decisions to business outcomes. Burn rate alerting replaces alert storms with two meaningful signals. The entire platform deploys with one command on any Ubuntu 24.04 server. Every component is version-controlled. Every alert links to a runbook. Every metric spike links to correlated logs and traces. GitHub Repository: https://github.com/AirFluke/meetmind-observability Built by Team MeetMind for HNG DevOps Track Stage 6 Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command
What is this alert? HighCPUWarning fires when CPU exceeds 80% for 5+ minutes.

Likely cause 1. Traffic spike 2. Runaway process 3. Post-deployment regression

First 3 investigation steps 1. Check running processes: # Runbook: High CPU Usage

What is this alert? HighCPUWarning fires when CPU exceeds 80% for 5+ minutes.

Likely cause 1. Traffic spike 2. Runaway process 3. Post-deployment regression

First 3 investigation steps 1. Check running processes: # Runbook: High CPU Usage

What is this alert? HighCPUWarning fires when CPU exceeds 80% for 5+ minutes.

Likely cause 1. Traffic spike 2. Runaway process 3. Post-deployment regression

First 3 investigation steps 1. Check running processes: 2. Check if spike correlates with a deployment: Check GitHub Actions for recent workflow runs 3. Check system journal: 2. Check if spike correlates with a deployment: Check GitHub Actions for recent workflow runs 3. Check system journal: 2. Check if spike correlates with a deployment: Check GitHub Actions for recent workflow runs 3. Check system journal:" style="background: linear-gradient(135deg, #6a5acd 0%, #5a4abd 100%); color: #fff; border: none; padding: 6px 12px; border-radius: 8px; cursor: pointer; font-size: 12px; font-weight: 600; transition: all 0.3s cubic-bezier(0.4, 0, 0.2, 1); display: flex; align-items: center; gap: 8px; box-shadow: 0 4px 12px rgba(106, 90, 205, 0.4), inset 0 1px 0 rgba(255, 255, 255, 0.1); position: relative; overflow: hidden;">

Copy

$ -weight: 500;">git clone https://github.com/AirFluke/meetmind-observability.-weight: 500;">git cd meetmind-observability -weight: 600;">sudo SLACK_WEBHOOK=https://hooks.slack.com/services/YOUR/WEBHOOK bash -weight: 500;">install.sh -weight: 500;">git clone https://github.com/AirFluke/meetmind-observability.-weight: 500;">git cd meetmind-observability -weight: 600;">sudo SLACK_WEBHOOK=https://hooks.slack.com/services/YOUR/WEBHOOK bash -weight: 500;">install.sh -weight: 500;">git clone https://github.com/AirFluke/meetmind-observability.-weight: 500;">git cd meetmind-observability -weight: 600;">sudo SLACK_WEBHOOK=https://hooks.slack.com/services/YOUR/WEBHOOK bash -weight: 500;">install.sh # systemd/prometheus.-weight: 500;">service [Unit] Description=Prometheus Metrics Server Documentation=https://prometheus.io/docs After=network-online.target Wants=network-online.target [Service] Type=simple User=prometheus Group=prometheus ExecStart=/usr/local/bin/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --storage.tsdb.retention.time=30d \ --web.-weight: 500;">enable-lifecycle \ --web.-weight: 500;">enable-remote-write-receiver \ --web.listen-address=0.0.0.0:9090 Restart=on-failure RestartSec=5s LimitNOFILE=65536 [Install] WantedBy=multi-user.target # systemd/prometheus.-weight: 500;">service [Unit] Description=Prometheus Metrics Server Documentation=https://prometheus.io/docs After=network-online.target Wants=network-online.target [Service] Type=simple User=prometheus Group=prometheus ExecStart=/usr/local/bin/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --storage.tsdb.retention.time=30d \ --web.-weight: 500;">enable-lifecycle \ --web.-weight: 500;">enable-remote-write-receiver \ --web.listen-address=0.0.0.0:9090 Restart=on-failure RestartSec=5s LimitNOFILE=65536 [Install] WantedBy=multi-user.target # systemd/prometheus.-weight: 500;">service [Unit] Description=Prometheus Metrics Server Documentation=https://prometheus.io/docs After=network-online.target Wants=network-online.target [Service] Type=simple User=prometheus Group=prometheus ExecStart=/usr/local/bin/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --storage.tsdb.retention.time=30d \ --web.-weight: 500;">enable-lifecycle \ --web.-weight: 500;">enable-remote-write-receiver \ --web.listen-address=0.0.0.0:9090 Restart=on-failure RestartSec=5s LimitNOFILE=65536 [Install] WantedBy=multi-user.target # Check all 9 at once -weight: 600;">sudo bash scripts/-weight: 500;">status.sh # Check individual -weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status prometheus # Follow logs in real time journalctl -u prometheus -f journalctl -u grafana-server -f journalctl -u loki -f # Check all 9 at once -weight: 600;">sudo bash scripts/-weight: 500;">status.sh # Check individual -weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status prometheus # Follow logs in real time journalctl -u prometheus -f journalctl -u grafana-server -f journalctl -u loki -f # Check all 9 at once -weight: 600;">sudo bash scripts/-weight: 500;">status.sh # Check individual -weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status prometheus # Follow logs in real time journalctl -u prometheus -f journalctl -u grafana-server -f journalctl -u loki -f /usr/local/bin/ ← all binaries prometheus, loki, tempo, alertmanager, node_exporter, blackbox_exporter, pushgateway, otelcol /etc/ ← all configs prometheus/prometheus.yml alertmanager/alertmanager.yml alertmanager/slack.tmpl loki/loki-config.yaml tempo/tempo.yaml otelcol/otel-collector.yaml blackbox_exporter/config.yml /var/lib/ ← all data (30d retention) prometheus/ loki/ tempo/ /etc/systemd/system/ ← unit files prometheus.-weight: 500;">service loki.-weight: 500;">service tempo.-weight: 500;">service ... (9 total) /usr/local/bin/ ← all binaries prometheus, loki, tempo, alertmanager, node_exporter, blackbox_exporter, pushgateway, otelcol /etc/ ← all configs prometheus/prometheus.yml alertmanager/alertmanager.yml alertmanager/slack.tmpl loki/loki-config.yaml tempo/tempo.yaml otelcol/otel-collector.yaml blackbox_exporter/config.yml /var/lib/ ← all data (30d retention) prometheus/ loki/ tempo/ /etc/systemd/system/ ← unit files prometheus.-weight: 500;">service loki.-weight: 500;">service tempo.-weight: 500;">service ... (9 total) /usr/local/bin/ ← all binaries prometheus, loki, tempo, alertmanager, node_exporter, blackbox_exporter, pushgateway, otelcol /etc/ ← all configs prometheus/prometheus.yml alertmanager/alertmanager.yml alertmanager/slack.tmpl loki/loki-config.yaml tempo/tempo.yaml otelcol/otel-collector.yaml blackbox_exporter/config.yml /var/lib/ ← all data (30d retention) prometheus/ loki/ tempo/ /etc/systemd/system/ ← unit files prometheus.-weight: 500;">service loki.-weight: 500;">service tempo.-weight: 500;">service ... (9 total) meetmind-observability/ ├── -weight: 500;">install.sh ← one-command deploy ├── uninstall.sh ← clean teardown ├── scripts/-weight: 500;">status.sh ← check all services ├── systemd/ ← 9 unit files ├── config/ ← all -weight: 500;">service configs ├── alerts/ ← alert rules (.yml) ├── grafana/dashboards/ ← 5 JSON dashboards ├── grafana/provisioning/ ← datasource config └── runbooks/ ← one .md per alert meetmind-observability/ ├── -weight: 500;">install.sh ← one-command deploy ├── uninstall.sh ← clean teardown ├── scripts/-weight: 500;">status.sh ← check all services ├── systemd/ ← 9 unit files ├── config/ ← all -weight: 500;">service configs ├── alerts/ ← alert rules (.yml) ├── grafana/dashboards/ ← 5 JSON dashboards ├── grafana/provisioning/ ← datasource config └── runbooks/ ← one .md per alert meetmind-observability/ ├── -weight: 500;">install.sh ← one-command deploy ├── uninstall.sh ← clean teardown ├── scripts/-weight: 500;">status.sh ← check all services ├── systemd/ ← 9 unit files ├── config/ ← all -weight: 500;">service configs ├── alerts/ ← alert rules (.yml) ├── grafana/dashboards/ ← 5 JSON dashboards ├── grafana/provisioning/ ← datasource config └── runbooks/ ← one .md per alert # p95 latency for successful requests histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{-weight: 500;">status!~"5.."}[5m])) by (le, job) ) # p95 latency for error requests histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{-weight: 500;">status=~"5.."}[5m])) by (le, job) ) # p95 latency for successful requests histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{-weight: 500;">status!~"5.."}[5m])) by (le, job) ) # p95 latency for error requests histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{-weight: 500;">status=~"5.."}[5m])) by (le, job) ) # p95 latency for successful requests histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{-weight: 500;">status!~"5.."}[5m])) by (le, job) ) # p95 latency for error requests histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{-weight: 500;">status=~"5.."}[5m])) by (le, job) ) # Requests per second sum(rate(http_requests_total[1m])) by (job) # Requests per second sum(rate(http_requests_total[1m])) by (job) # Requests per second sum(rate(http_requests_total[1m])) by (job) # Error rate as a ratio (0 = perfect, 1 = everything failing) sum(rate(http_requests_total{-weight: 500;">status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job) # Error rate as a ratio (0 = perfect, 1 = everything failing) sum(rate(http_requests_total{-weight: 500;">status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job) # Error rate as a ratio (0 = perfect, 1 = everything failing) sum(rate(http_requests_total{-weight: 500;">status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job) # Memory saturation 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) # CPU saturation 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) # Disk saturation 1 - ( node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"} ) # Memory saturation 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) # CPU saturation 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) # Disk saturation 1 - ( node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"} ) # Memory saturation 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) # CPU saturation 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) # Disk saturation 1 - ( node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"} ) # alerts/slo-burnrate.yml groups: - name: slo.recording_rules interval: 30s rules: - record: slo:availability:ratio_rate1h expr: avg_over_time(probe_success[1h]) - record: slo:availability:ratio_rate6h expr: avg_over_time(probe_success[6h]) - record: slo:availability:ratio_rate30d expr: avg_over_time(probe_success[30d]) # Burn rate = how fast we consume the error budget # Error budget = 1 - 0.995 = 0.005 - record: slo:availability:burn_rate1h expr: (1 - slo:availability:ratio_rate1h) / 0.005 - record: slo:availability:burn_rate6h expr: (1 - slo:availability:ratio_rate6h) / 0.005 # alerts/slo-burnrate.yml groups: - name: slo.recording_rules interval: 30s rules: - record: slo:availability:ratio_rate1h expr: avg_over_time(probe_success[1h]) - record: slo:availability:ratio_rate6h expr: avg_over_time(probe_success[6h]) - record: slo:availability:ratio_rate30d expr: avg_over_time(probe_success[30d]) # Burn rate = how fast we consume the error budget # Error budget = 1 - 0.995 = 0.005 - record: slo:availability:burn_rate1h expr: (1 - slo:availability:ratio_rate1h) / 0.005 - record: slo:availability:burn_rate6h expr: (1 - slo:availability:ratio_rate6h) / 0.005 # alerts/slo-burnrate.yml groups: - name: slo.recording_rules interval: 30s rules: - record: slo:availability:ratio_rate1h expr: avg_over_time(probe_success[1h]) - record: slo:availability:ratio_rate6h expr: avg_over_time(probe_success[6h]) - record: slo:availability:ratio_rate30d expr: avg_over_time(probe_success[30d]) # Burn rate = how fast we consume the error budget # Error budget = 1 - 0.995 = 0.005 - record: slo:availability:burn_rate1h expr: (1 - slo:availability:ratio_rate1h) / 0.005 - record: slo:availability:burn_rate6h expr: (1 - slo:availability:ratio_rate6h) / 0.005 Budget > 50% → Deploy freely, feature work continues Budget 25–50% → Investigate incidents, no major changes Budget < 25% → Reliability sprint, senior review on all deploys Budget 0% → Feature freeze until budget recovers Budget > 50% → Deploy freely, feature work continues Budget 25–50% → Investigate incidents, no major changes Budget < 25% → Reliability sprint, senior review on all deploys Budget 0% → Feature freeze until budget recovers Budget > 50% → Deploy freely, feature work continues Budget 25–50% → Investigate incidents, no major changes Budget < 25% → Reliability sprint, senior review on all deploys Budget 0% → Feature freeze until budget recovers # .github/workflows/deploy.yml jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Record deploy -weight: 500;">start time id: timing run: echo "start_ts=$(date +%s)" >> $GITHUB_OUTPUT - name: Build and deploy run: echo "Your deploy steps here" - name: Push metrics on success if: success() run: | LEAD_TIME=$(( $(date +%s) - ${{ steps.timing.outputs.start_ts }} )) WORKFLOW="${{ github.workflow }}" cat <<EOF | -weight: 500;">curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions" deployment_total{-weight: 500;">status="success",workflow="${WORKFLOW}"} 1 deployment_lead_time_seconds{workflow="${WORKFLOW}"} ${LEAD_TIME} EOF - name: Push metrics on failure if: failure() run: | cat <<EOF | -weight: 500;">curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions" deployment_total{-weight: 500;">status="failure",workflow="${{ github.workflow }}"} 1 EOF # .github/workflows/deploy.yml jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Record deploy -weight: 500;">start time id: timing run: echo "start_ts=$(date +%s)" >> $GITHUB_OUTPUT - name: Build and deploy run: echo "Your deploy steps here" - name: Push metrics on success if: success() run: | LEAD_TIME=$(( $(date +%s) - ${{ steps.timing.outputs.start_ts }} )) WORKFLOW="${{ github.workflow }}" cat <<EOF | -weight: 500;">curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions" deployment_total{-weight: 500;">status="success",workflow="${WORKFLOW}"} 1 deployment_lead_time_seconds{workflow="${WORKFLOW}"} ${LEAD_TIME} EOF - name: Push metrics on failure if: failure() run: | cat <<EOF | -weight: 500;">curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions" deployment_total{-weight: 500;">status="failure",workflow="${{ github.workflow }}"} 1 EOF # .github/workflows/deploy.yml jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Record deploy -weight: 500;">start time id: timing run: echo "start_ts=$(date +%s)" >> $GITHUB_OUTPUT - name: Build and deploy run: echo "Your deploy steps here" - name: Push metrics on success if: success() run: | LEAD_TIME=$(( $(date +%s) - ${{ steps.timing.outputs.start_ts }} )) WORKFLOW="${{ github.workflow }}" cat <<EOF | -weight: 500;">curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions" deployment_total{-weight: 500;">status="success",workflow="${WORKFLOW}"} 1 deployment_lead_time_seconds{workflow="${WORKFLOW}"} ${LEAD_TIME} EOF - name: Push metrics on failure if: failure() run: | cat <<EOF | -weight: 500;">curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions" deployment_total{-weight: 500;">status="failure",workflow="${{ github.workflow }}"} 1 EOF # grafana/provisioning/datasources/datasources.yaml datasources: - name: Prometheus type: prometheus url: http://localhost:9090 isDefault: true - name: Loki type: loki url: http://localhost:3100 jsonData: derivedFields: # Makes traceID= in log lines a clickable link to Tempo - name: TraceID matcherRegex: 'traceID=(\w+)' url: "${__value.raw}" datasourceUid: tempo urlDisplayLabel: "Open in Tempo" - name: Tempo type: tempo url: http://localhost:3200 jsonData: tracesToLogsV2: datasourceUid: loki filterByTraceID: true customQuery: true query: '{service_name="${__span.tags.-weight: 500;">service.name}"} |= "${__trace.traceId}"' # grafana/provisioning/datasources/datasources.yaml datasources: - name: Prometheus type: prometheus url: http://localhost:9090 isDefault: true - name: Loki type: loki url: http://localhost:3100 jsonData: derivedFields: # Makes traceID= in log lines a clickable link to Tempo - name: TraceID matcherRegex: 'traceID=(\w+)' url: "${__value.raw}" datasourceUid: tempo urlDisplayLabel: "Open in Tempo" - name: Tempo type: tempo url: http://localhost:3200 jsonData: tracesToLogsV2: datasourceUid: loki filterByTraceID: true customQuery: true query: '{service_name="${__span.tags.-weight: 500;">service.name}"} |= "${__trace.traceId}"' # grafana/provisioning/datasources/datasources.yaml datasources: - name: Prometheus type: prometheus url: http://localhost:9090 isDefault: true - name: Loki type: loki url: http://localhost:3100 jsonData: derivedFields: # Makes traceID= in log lines a clickable link to Tempo - name: TraceID matcherRegex: 'traceID=(\w+)' url: "${__value.raw}" datasourceUid: tempo urlDisplayLabel: "Open in Tempo" - name: Tempo type: tempo url: http://localhost:3200 jsonData: tracesToLogsV2: datasourceUid: loki filterByTraceID: true customQuery: true query: '{service_name="${__span.tags.-weight: 500;">service.name}"} |= "${__trace.traceId}"' # alerts/infrastructure.yml groups: - name: infrastructure.rules rules: - record: sli:node_cpu_saturation expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) - record: sli:node_memory_saturation expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) - alert: HighCPUWarning expr: sli:node_cpu_saturation > 0.80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU is {{ $value | humanizePercentage }} (threshold 80%)" dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter" runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md" - alert: HighCPUCritical expr: sli:node_cpu_saturation > 0.90 for: 10m labels: severity: critical annotations: summary: "Critical CPU on {{ $labels.instance }}" description: "CPU is {{ $value | humanizePercentage }} for 10+ minutes" dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter" runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md" - alert: HostDown expr: probe_success == 0 for: 2m labels: severity: critical annotations: summary: "Host {{ $labels.instance }} is down" description: "Blackbox probe failed for 2+ consecutive minutes" runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/host-down.md" # alerts/infrastructure.yml groups: - name: infrastructure.rules rules: - record: sli:node_cpu_saturation expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) - record: sli:node_memory_saturation expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) - alert: HighCPUWarning expr: sli:node_cpu_saturation > 0.80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU is {{ $value | humanizePercentage }} (threshold 80%)" dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter" runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md" - alert: HighCPUCritical expr: sli:node_cpu_saturation > 0.90 for: 10m labels: severity: critical annotations: summary: "Critical CPU on {{ $labels.instance }}" description: "CPU is {{ $value | humanizePercentage }} for 10+ minutes" dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter" runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md" - alert: HostDown expr: probe_success == 0 for: 2m labels: severity: critical annotations: summary: "Host {{ $labels.instance }} is down" description: "Blackbox probe failed for 2+ consecutive minutes" runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/host-down.md" # alerts/infrastructure.yml groups: - name: infrastructure.rules rules: - record: sli:node_cpu_saturation expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) - record: sli:node_memory_saturation expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) - alert: HighCPUWarning expr: sli:node_cpu_saturation > 0.80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU is {{ $value | humanizePercentage }} (threshold 80%)" dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter" runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md" - alert: HighCPUCritical expr: sli:node_cpu_saturation > 0.90 for: 10m labels: severity: critical annotations: summary: "Critical CPU on {{ $labels.instance }}" description: "CPU is {{ $value | humanizePercentage }} for 10+ minutes" dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter" runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md" - alert: HostDown expr: probe_success == 0 for: 2m labels: severity: critical annotations: summary: "Host {{ $labels.instance }} is down" description: "Blackbox probe failed for 2+ consecutive minutes" runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/host-down.md" # alerts/slo-burnrate.yml - name: slo.alerts rules: # Fast burn — act immediately # 14.4x = 2% of monthly budget gone in 1 hour - alert: SLOAvailabilityFastBurn expr: slo:availability:burn_rate1h > 14.4 for: 2m labels: severity: critical annotations: summary: "Fast error budget burn — act immediately" description: > Burn rate is {{ $value | humanize }}x. 2% of the 30-day budget will be consumed in 1 hour. runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md" # Slow burn — investigate before it escalates # 5x = 5% of monthly budget gone in 6 hours - alert: SLOAvailabilitySlowBurn expr: slo:availability:burn_rate6h > 5 for: 15m labels: severity: warning annotations: summary: "Slow error budget burn — investigate soon" description: > Burn rate is {{ $value | humanize }}x over 6h. 5% of the 30-day budget will be consumed in 6 hours. runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md" # alerts/slo-burnrate.yml - name: slo.alerts rules: # Fast burn — act immediately # 14.4x = 2% of monthly budget gone in 1 hour - alert: SLOAvailabilityFastBurn expr: slo:availability:burn_rate1h > 14.4 for: 2m labels: severity: critical annotations: summary: "Fast error budget burn — act immediately" description: > Burn rate is {{ $value | humanize }}x. 2% of the 30-day budget will be consumed in 1 hour. runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md" # Slow burn — investigate before it escalates # 5x = 5% of monthly budget gone in 6 hours - alert: SLOAvailabilitySlowBurn expr: slo:availability:burn_rate6h > 5 for: 15m labels: severity: warning annotations: summary: "Slow error budget burn — investigate soon" description: > Burn rate is {{ $value | humanize }}x over 6h. 5% of the 30-day budget will be consumed in 6 hours. runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md" # alerts/slo-burnrate.yml - name: slo.alerts rules: # Fast burn — act immediately # 14.4x = 2% of monthly budget gone in 1 hour - alert: SLOAvailabilityFastBurn expr: slo:availability:burn_rate1h > 14.4 for: 2m labels: severity: critical annotations: summary: "Fast error budget burn — act immediately" description: > Burn rate is {{ $value | humanize }}x. 2% of the 30-day budget will be consumed in 1 hour. runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md" # Slow burn — investigate before it escalates # 5x = 5% of monthly budget gone in 6 hours - alert: SLOAvailabilitySlowBurn expr: slo:availability:burn_rate6h > 5 for: 15m labels: severity: warning annotations: summary: "Slow error budget burn — investigate soon" description: > Burn rate is {{ $value | humanize }}x over 6h. 5% of the 30-day budget will be consumed in 6 hours. runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md" # config/alertmanager.yml route: receiver: slack-default group_by: [alertname, severity, instance] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: slack-critical group_wait: 10s repeat_interval: 4h inhibit_rules: # When host is completely down suppress CPU/memory/latency noise - source_match: alertname: HostDown target_match_re: alertname: "HighCPU.*|HighMemory.*|HighLatency.*|DiskSpace.*" equal: [instance] # Critical suppresses warning for same alert on same host - source_match: severity: critical target_match: severity: warning equal: [alertname, instance] # config/alertmanager.yml route: receiver: slack-default group_by: [alertname, severity, instance] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: slack-critical group_wait: 10s repeat_interval: 4h inhibit_rules: # When host is completely down suppress CPU/memory/latency noise - source_match: alertname: HostDown target_match_re: alertname: "HighCPU.*|HighMemory.*|HighLatency.*|DiskSpace.*" equal: [instance] # Critical suppresses warning for same alert on same host - source_match: severity: critical target_match: severity: warning equal: [alertname, instance] # config/alertmanager.yml route: receiver: slack-default group_by: [alertname, severity, instance] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: slack-critical group_wait: 10s repeat_interval: 4h inhibit_rules: # When host is completely down suppress CPU/memory/latency noise - source_match: alertname: HostDown target_match_re: alertname: "HighCPU.*|HighMemory.*|HighLatency.*|DiskSpace.*" equal: [instance] # Critical suppresses warning for same alert on same host - source_match: severity: critical target_match: severity: warning equal: [alertname, instance] # config/slack.tmpl {{ define "slack.title" -}} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }} {{- end }} {{ define "slack.body" -}} {{ range .Alerts }} *Alert:* {{ .Labels.alertname }} *Severity:* {{ .Labels.severity | toUpper }} *Status:* {{ if eq $.Status "resolved" }}✅ RESOLVED{{ else }}🔥 FIRING{{ end }} *Host:* {{ .Labels.instance }} *Summary:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} *Links:* • <{{ .Annotations.dashboard_url }}|📊 Grafana Dashboard> • <{{ .Annotations.runbook_url }}|📖 Runbook> *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }} {{ if eq $.Status "resolved" }}*Resolved:* {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}{{ end }} --- {{ end }} {{- end }} # config/slack.tmpl {{ define "slack.title" -}} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }} {{- end }} {{ define "slack.body" -}} {{ range .Alerts }} *Alert:* {{ .Labels.alertname }} *Severity:* {{ .Labels.severity | toUpper }} *Status:* {{ if eq $.Status "resolved" }}✅ RESOLVED{{ else }}🔥 FIRING{{ end }} *Host:* {{ .Labels.instance }} *Summary:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} *Links:* • <{{ .Annotations.dashboard_url }}|📊 Grafana Dashboard> • <{{ .Annotations.runbook_url }}|📖 Runbook> *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }} {{ if eq $.Status "resolved" }}*Resolved:* {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}{{ end }} --- {{ end }} {{- end }} # config/slack.tmpl {{ define "slack.title" -}} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }} {{- end }} {{ define "slack.body" -}} {{ range .Alerts }} *Alert:* {{ .Labels.alertname }} *Severity:* {{ .Labels.severity | toUpper }} *Status:* {{ if eq $.Status "resolved" }}✅ RESOLVED{{ else }}🔥 FIRING{{ end }} *Host:* {{ .Labels.instance }} *Summary:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} *Links:* • <{{ .Annotations.dashboard_url }}|📊 Grafana Dashboard> • <{{ .Annotations.runbook_url }}|📖 Runbook> *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }} {{ if eq $.Status "resolved" }}*Resolved:* {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}{{ end }} --- {{ end }} {{- end }} # Runbook: High CPU Usage

What is this alert? HighCPUWarning fires when CPU exceeds 80% for 5+ minutes.

Likely cause 1. Traffic spike 2. Runaway process 3. Post-deployment regression

First 3 investigation steps 1. Check running processes: # Runbook: High CPU Usage

What is this alert? HighCPUWarning fires when CPU exceeds 80% for 5+ minutes.

Likely cause 1. Traffic spike 2. Runaway process 3. Post-deployment regression

First 3 investigation steps 1. Check running processes: # Runbook: High CPU Usage

What is this alert? HighCPUWarning fires when CPU exceeds 80% for 5+ minutes.

Likely cause 1. Traffic spike 2. Runaway process 3. Post-deployment regression

First 3 investigation steps 1. Check running processes: 2. Check if spike correlates with a deployment: Check GitHub Actions for recent workflow runs 3. Check system journal: 2. Check if spike correlates with a deployment: Check GitHub Actions for recent workflow runs 3. Check system journal: 2. Check if spike correlates with a deployment: Check GitHub Actions for recent workflow runs 3. Check system journal:

Resolution

- Runaway process: kill -9 <PID>- Traffic spike: scale horizontally- Deployment regression: roll back

Roll back when?If CPU spike started within 30 minutes of a deploymentand correlates with increased error rate.

Escalation

Senior engineer if unresolved after 20 minutes.

Command

Copy

$

Resolution

- Runaway process: kill -9 <PID>- Traffic spike: scale horizontally- Deployment regression: roll back

Roll back when?If CPU spike started within 30 minutes of a deploymentand correlates with increased error rate.

Escalation

Senior engineer if unresolved after 20 minutes.

Command

Copy

$

Resolution

- Runaway process: kill -9 <PID>- Traffic spike: scale horizontally- Deployment regression: roll back

Roll back when?If CPU spike started within 30 minutes of a deploymentand correlates with increased error rate.

Escalation

Senior engineer if unresolved after 20 minutes.

Command

Copy

$ -weight: 600;">sudo tc qdisc add dev ens5 root netem delay 600ms -weight: 600;">sudo tc qdisc add dev ens5 root netem delay 600ms -weight: 600;">sudo tc qdisc add dev ens5 root netem delay 600ms # Remove latency -weight: 600;">sudo tc qdisc del dev ens5 root # Remove latency -weight: 600;">sudo tc qdisc del dev ens5 root # Remove latency -weight: 600;">sudo tc qdisc del dev ens5 root stress-ng --cpu 0 --cpu-method matrixprod --timeout 600s & stress-ng --cpu 0 --cpu-method matrixprod --timeout 600s & stress-ng --cpu 0 --cpu-method matrixprod --timeout 600s & pkill stress-ng pkill stress-ng pkill stress-ng -weight: 500;">git clone https://github.com/AirFluke/meetmind-observability.-weight: 500;">git cd meetmind-observability -weight: 600;">sudo SLACK_WEBHOOK=https://hooks.slack.com/services/YOUR/WEBHOOK bash -weight: 500;">install.sh -weight: 500;">git clone https://github.com/AirFluke/meetmind-observability.-weight: 500;">git cd meetmind-observability -weight: 600;">sudo SLACK_WEBHOOK=https://hooks.slack.com/services/YOUR/WEBHOOK bash -weight: 500;">install.sh -weight: 500;">git clone https://github.com/AirFluke/meetmind-observability.-weight: 500;">git cd meetmind-observability -weight: 600;">sudo SLACK_WEBHOOK=https://hooks.slack.com/services/YOUR/WEBHOOK bash -weight: 500;">install.sh - Node Exporter and Blackbox Exporter expose metrics → Prometheus scrapes every 15 seconds - GitHub Actions pushes deployment metrics → Pushgateway → Prometheus - Applications send traces via OpenTelemetry → OTel Collector → Tempo - Applications send logs via OpenTelemetry → OTel Collector → Loki - Grafana queries all three — Prometheus, Loki, Tempo — enabling correlated drill-down from a single dashboard - Prometheus evaluates alert rules → fires to Alertmanager → routes to Slack - No container runtime dependency - Services -weight: 500;">start on boot automatically - Logs go directly to journald — journalctl -u prometheus -f - Standard Linux process management — -weight: 500;">systemctl -weight: 500;">start/-weight: 500;">stop/-weight: 500;">restart/-weight: 500;">status - No networking complexity — all services talk via localhost - Installs system dependencies - Downloads all binaries from GitHub releases - Creates dedicated system users for each -weight: 500;">service - Copies configs to /etc/ - Creates data directories in /var/lib/ - Installs systemd unit files to /etc/systemd/system/ - Enables and starts all services - Prometheus metrics: 30 days (--storage.tsdb.retention.time=30d) - Loki logs: 30 days (retention_period: 30d) - Tempo traces: 30 days (block_retention: 720h) - HighCPUWarning entered pending state after CPU sustained above 80% - After 5 minutes → HighCPUWarning turned firing in Prometheus - Alert arrived in Slack with full structured payload - HighCPUCritical entered pending state (needs 10min sustained above 90%) - After killing stress → both alerts RESOLVED in Slack