Tools

Tools: ⚙️ Monitoring MinIO with Prometheus and Grafana — the right way for production (2026)

2026-05-19 0 views admin

🔧 Prerequisites — What You Need

📊 Prometheus Setup — Scraping Metrics

🔐 Securing the Scrape

🧠 Understanding Metric Cardinality

🎨 Grafana Dashboard — Turning Data into Insight

📈 Key Visualizations to Add

⚠️ Avoiding Dashboard Overload

🚦 Alerting — Preventing Outages

🟩 Final Thoughts

❓ Frequently Asked Questions

Can I monitor standalone MinIO instances?

How often does MinIO emit metrics?

Does monitoring impact MinIO performance?

📚 References & Further Reading A full monitoring setup can generate zero actionable alerts — when metrics aren’t tied to system invariants, not just resource usage. The issue isn’t the dashboard; it’s that CPU and memory alone can’t tell you whether your object storage is actually working. You need four components to monitor MinIO with Prometheus and Grafana: a running MinIO tenant, Prometheus server, Grafana instance, and network connectivity between them. MinIO exposes metrics via its built-in Prometheus endpoint at /minio/v2/metrics/cluster. This endpoint emits service-level indicators (SLIs) like minio_bucket_objects_total, minio_disk_usage, and minio_s3_requests_duration_seconds. These are not host-level metrics — they reflect object storage behavior across the entire tenant. Ensure your MinIO deployment is in distributed mode (at least 4 nodes) and running a recent version (RELEASE.-xx-xx or later). Older versions lack critical instrumentation for cluster-wide metrics. Verify the metrics endpoint is accessible: If you see metric lines, the endpoint is live. If you get a 401, ensure your admin credentials are correct. The endpoint requires admin privileges. MinIO uses HTTP basic auth — Prometheus must supply credentials in the scrape job. Prometheus must be configured to scrape MinIO’s cluster metrics endpoint every 30 seconds, using secure credentials and proper relabeling to extract tenant and bucket labels. Here’s the scrape job configuration for prometheus.yml: This job scrapes the /minio/v2/metrics/cluster path, which aggregates metrics across all nodes in the tenant. That’s key: you’re not scraping individual nodes, but the cluster view, avoiding duplication and gaps. Prometheus uses HTTP polling — every 30 seconds, it makes a GET request, receives plain-text OpenMetrics, and parses it into time series. Each metric gets a timestamp and is stored in Prometheus’s local TSDB using a write-optimized block structure (WAL + memory-mapped chunks). This design minimizes disk seeks but requires compaction later. Restart Prometheus: Verify the target is up in Prometheus web UI at http://prometheus:9090/targets. You should see minio-cluster with state "UP". Query a sample metric: The value array contains [timestamp, string_value]. Prometheus stores all values as float64 internally but serializes integers as strings in JSON responses. Never expose MinIO’s admin port publicly. Use either: Or a sidecar reverse proxy with IP filtering For mTLS, generate client certs and update the scrape config: tls_config: ca_file: /etc/prometheus/minio-ca.crt cert_file: /etc/prometheus/prom-client.crt key_file: /etc/prometheus/prom-client.key insecure_skip_verify: false This ensures authentication and encryption at the transport layer — preventing credential leakage and tampering. MinIO metrics include labels like bucket, node, and operation. High cardinality (e.g., thousands of buckets) can explode Prometheus memory usage. Monitor prometheus_tsdb_head_series — if it grows beyond 10M series, consider: Or using recording rules to pre-aggregate Example recording rule: groups: - name: minio-aggregated rules: - record: job:minio_bucket_objects_total:sum expr: sum by (job) (minio_bucket_objects_total) This reduces cardinality by pre-summing object counts per job, lowering query load and memory pressure. “Monitoring MinIO with Prometheus and Grafana isn’t about collecting data — it’s about isolating failure modes before they isolate you.” A Grafana dashboard should answer: Is my MinIO tenant healthy? Are objects being written and read reliably? Is erasure coding balanced? Start by adding Prometheus as a data source in Grafana. Then import MinIO’s official dashboard (ID: 18085) from Grafana.com: Then import via UI or API. The dashboard shows: The default dashboard is good, but production needs deeper insight. Add these panels: 1. Erasure Set Imbalance:

"promql max by (set) (minio_erasure_set_drives_online) / on(set) group_left max by (set) (minio_erasure_set_drives_total) " This shows the ratio of online drives per erasure set. Below 1.0 means degraded performance due to missing or failed drives. 2. Healing Queue Lag:"promql max(minio_healing_queue_length) " If this is >0 for more than 10 minutes, background healing is falling behind — could indicate disk failures or sustained I/O pressure. 3. S3 Error Rate:

"promql sum(rate(minio_s3_requests_duration_seconds_count{code=~"5.."}[5m])) / sum(rate(minio_s3_requests_duration_seconds_count[5m])) " This computes the HTTP 5xx error ratio over a 5-minute sliding window. Values above 1% indicate potential service degradation. Don’t add every metric. Focus on SLO-relevant signals : Alerts must be specific, actionable, and based on symptoms — not thresholds. Monitoring MinIO with Prometheus and Grafana means alerting on what users experience , not just what the system reports. Use Prometheus alerting rules in a dedicated file: These alerts trigger only after sustained conditions (for:), preventing flapping. Prometheus sends alerts to Alertmanager , which deduplicates, groups, and routes them via email, Slack, or PagerDuty. Monitoring MinIO with Prometheus and Grafana turns reactive firefighting into proactive resilience. Monitoring MinIO with Prometheus and Grafana isn’t just a DevOps checkbox — it’s how you prove your object storage is reliable. Metrics like bucket growth, healing queues, and S3 error rates expose issues long before users notice. The system doesn’t just react; it anticipates. Too many teams treat monitoring as a sidecar — something added after the fact. But in distributed systems, observability is part of the design. You wouldn’t deploy a database without backups; don’t deploy MinIO without instrumentation. The real win isn’t the dashboard. It’s knowing, at any moment, whether your data is safe, accessible, and consistent — because the metrics say so. Yes, but the /minio/v2/metrics/cluster endpoint only works in distributed mode. For standalone, use /minio/metrics/instance — but you’ll miss tenant-wide aggregation. (More onPythonTPoint tutorials) MinIO updates metrics every 5 seconds in memory. Prometheus typically scrapes every 30s, so there’s no data loss. The values are gauges and counters, not sampled. Negligibly. The metrics endpoint reads from in-memory counters — no disk I/O or locking. Even under heavy load, response time is under 10ms. Scrape every 30s to minimize overhead. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 500;">curl -s http://minio-tenant:9000/minio/v2/metrics/cluster | head -5 # HELP minio_bucket_objects_total Total number of objects in a bucket # TYPE minio_bucket_objects_total gauge minio_bucket_objects_total{bucket="logs"} 24892 minio_bucket_objects_total{bucket="backups"} 512 # HELP minio_disk_usage Total disk usage in bytes $ -weight: 500;">curl -s http://minio-tenant:9000/minio/v2/metrics/cluster | head -5 # HELP minio_bucket_objects_total Total number of objects in a bucket # TYPE minio_bucket_objects_total gauge minio_bucket_objects_total{bucket="logs"} 24892 minio_bucket_objects_total{bucket="backups"} 512 # HELP minio_disk_usage Total disk usage in bytes $ -weight: 500;">curl -s http://minio-tenant:9000/minio/v2/metrics/cluster | head -5 # HELP minio_bucket_objects_total Total number of objects in a bucket # TYPE minio_bucket_objects_total gauge minio_bucket_objects_total{bucket="logs"} 24892 minio_bucket_objects_total{bucket="backups"} 512 # HELP minio_disk_usage Total disk usage in bytes scrape_configs: - job_name: 'minio-cluster' metrics_path: /minio/v2/metrics/cluster static_configs: - targets: ['minio-tenant-1.example.com:9000'] basic_auth: username: 'admin' password: 'your-secure-password' relabel_configs: - source_labels: [__address__] target_label: instance - target_label: job replacement: minio_cluster scrape_configs: - job_name: 'minio-cluster' metrics_path: /minio/v2/metrics/cluster static_configs: - targets: ['minio-tenant-1.example.com:9000'] basic_auth: username: 'admin' password: 'your-secure-password' relabel_configs: - source_labels: [__address__] target_label: instance - target_label: job replacement: minio_cluster scrape_configs: - job_name: 'minio-cluster' metrics_path: /minio/v2/metrics/cluster static_configs: - targets: ['minio-tenant-1.example.com:9000'] basic_auth: username: 'admin' password: 'your-secure-password' relabel_configs: - source_labels: [__address__] target_label: instance - target_label: job replacement: minio_cluster $ -weight: 600;">sudo -weight: 500;">systemctl reload prometheus # OR if using Docker: $ -weight: 500;">docker -weight: 500;">restart prometheus $ -weight: 600;">sudo -weight: 500;">systemctl reload prometheus # OR if using Docker: $ -weight: 500;">docker -weight: 500;">restart prometheus $ -weight: 600;">sudo -weight: 500;">systemctl reload prometheus # OR if using Docker: $ -weight: 500;">docker -weight: 500;">restart prometheus $ -weight: 500;">curl -G http://prometheus:9090/api/v1/query \ -data-urlencode 'query=minio_bucket_objects_total' | jq { "-weight: 500;">status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "minio_bucket_objects_total", "bucket": "logs", "instance": "minio-tenant-1.example.com:9000", "job": "minio_cluster" }, "value": [1700000000, "24892"] } ] } } $ -weight: 500;">curl -G http://prometheus:9090/api/v1/query \ -data-urlencode 'query=minio_bucket_objects_total' | jq { "-weight: 500;">status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "minio_bucket_objects_total", "bucket": "logs", "instance": "minio-tenant-1.example.com:9000", "job": "minio_cluster" }, "value": [1700000000, "24892"] } ] } } $ -weight: 500;">curl -G http://prometheus:9090/api/v1/query \ -data-urlencode 'query=minio_bucket_objects_total' | jq { "-weight: 500;">status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "minio_bucket_objects_total", "bucket": "logs", "instance": "minio-tenant-1.example.com:9000", "job": "minio_cluster" }, "value": [1700000000, "24892"] } ] } } $ -weight: 500;">curl -o minio-dashboard.json \ https://grafana.com/api/dashboards/18085/revisions/1/download $ -weight: 500;">curl -o minio-dashboard.json \ https://grafana.com/api/dashboards/18085/revisions/1/download $ -weight: 500;">curl -o minio-dashboard.json \ https://grafana.com/api/dashboards/18085/revisions/1/download groups: - name: minio-alerts rules: - alert: MinIOHighS3ErrorRate expr: | sum(rate(minio_s3_requests_duration_seconds_count{code=~"5.."}[5m])) / sum(rate(minio_s3_requests_duration_seconds_count[5m])) > 0.01 for: 5m labels: severity: critical annotations: summary: "High S3 error rate on MinIO" description: "Error rate is {{ $value }} over 5m" - alert: MinIOErasureSetDegraded expr: minio_erasure_set_drives_online < minio_erasure_set_drives_total for: 10m labels: severity: warning annotations: summary: "Erasure set partially offline" description: "One or more drives offline for over 10m" - alert: MinIODiskAlmostFull expr: minio_disk_usage / minio_disk_total > 0.85 for: 1h labels: severity: warning annotations: summary: "MinIO disk usage >85%" description: "Disk {{ $labels.instance }} is running out of space" groups: - name: minio-alerts rules: - alert: MinIOHighS3ErrorRate expr: | sum(rate(minio_s3_requests_duration_seconds_count{code=~"5.."}[5m])) / sum(rate(minio_s3_requests_duration_seconds_count[5m])) > 0.01 for: 5m labels: severity: critical annotations: summary: "High S3 error rate on MinIO" description: "Error rate is {{ $value }} over 5m" - alert: MinIOErasureSetDegraded expr: minio_erasure_set_drives_online < minio_erasure_set_drives_total for: 10m labels: severity: warning annotations: summary: "Erasure set partially offline" description: "One or more drives offline for over 10m" - alert: MinIODiskAlmostFull expr: minio_disk_usage / minio_disk_total > 0.85 for: 1h labels: severity: warning annotations: summary: "MinIO disk usage >85%" description: "Disk {{ $labels.instance }} is running out of space" groups: - name: minio-alerts rules: - alert: MinIOHighS3ErrorRate expr: | sum(rate(minio_s3_requests_duration_seconds_count{code=~"5.."}[5m])) / sum(rate(minio_s3_requests_duration_seconds_count[5m])) > 0.01 for: 5m labels: severity: critical annotations: summary: "High S3 error rate on MinIO" description: "Error rate is {{ $value }} over 5m" - alert: MinIOErasureSetDegraded expr: minio_erasure_set_drives_online < minio_erasure_set_drives_total for: 10m labels: severity: warning annotations: summary: "Erasure set partially offline" description: "One or more drives offline for over 10m" - alert: MinIODiskAlmostFull expr: minio_disk_usage / minio_disk_total > 0.85 for: 1h labels: severity: warning annotations: summary: "MinIO disk usage >85%" description: "Disk {{ $labels.instance }} is running out of space" - 🔧 Prerequisites — What You Need - 📊 Prometheus Setup — Scraping Metrics - 🔐 Securing the Scrape - 🧠 Understanding Metric Cardinality - 🎨 Grafana Dashboard — Turning Data into Insight - 📈 Key Visualizations to Add - ⚠️ Avoiding Dashboard Overload - 🚦 Alerting — Preventing Outages - 🟩 Final Thoughts - ❓ Frequently Asked Questions - Can I monitor standalone MinIO instances? - How often does MinIO emit metrics? - Does monitoring impact MinIO performance? - 📚 References & Further Reading - Mutual TLS (mTLS) between Prometheus and MinIO - Or a sidecar reverse proxy with IP filtering For mTLS, generate client certs and -weight: 500;">update the scrape config: tls_config: ca_file: /etc/prometheus/minio-ca.crt cert_file: /etc/prometheus/prom-client.crt key_file: /etc/prometheus/prom-client.key insecure_skip_verify: false - Aggregating metrics in Grafana (e.g., sum by (operation)) - Or using recording rules to pre-aggregate Example recording rule: groups: - name: minio-aggregated rules: - record: job:minio_bucket_objects_total:sum expr: sum by (job) (minio_bucket_objects_total) - Bucket object counts and growth rate - S3 request rates and error ratios - Disk usage and free space per node - Replication and healing queue depths Under the hood, Grafana runs PromQL queries every 30 seconds. For example, object growth uses: "promql sum(rate(minio_bucket_objects_total[5m])) " rate() calculates per-second increase over a 5-minute window, then sum() aggregates across all buckets. This works because minio_bucket_objects_total is a counter — it only increases, and Prometheus handles resets (e.g., after -weight: 500;">restart) by detecting negative deltas. - Object durability (replication/healing) - Read/write availability (error rates) - Capacity planning (growth trends) Too many graphs create noise. A clean dashboard with 6-8 panels is better than 50. - MinIO Monitoring Guide — official documentation on metrics, alerts, and dashboards: docs.min.io - Prometheus Configuration — detailed syntax for scrape jobs, relabeling, and TLS: prometheus.io - Grafana Dashboard Best Practices — how to build effective, maintainable dashboards: grafana.com

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolsmonitoringminioprometheusgrafanarightproduction

More from Tools

Tools: SSH died. Spent 3 hours fixing the wrong thing. (2026)

2026-05-20 0

Tools: Ultimate Guide: MainWP vs ManageWP vs custom scripts: how I manage 15+ WordPress sites in 2025

2026-05-20 0

Tools: Metrics: How cAdvisor and CRI Collect Kubernetes Stats Kubelet

2026-05-20 0

Tools: Essential Guide: GPU Observability for Workloads That Cannot Phone Home

2026-05-20 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: ⚙️ Monitoring MinIO with Prometheus and Grafana — the right way for production (2026)

🔧 Prerequisites — What You Need

📊 Prometheus Setup — Scraping Metrics

🔐 Securing the Scrape

🧠 Understanding Metric Cardinality

🎨 Grafana Dashboard — Turning Data into Insight

📈 Key Visualizations to Add

⚠️ Avoiding Dashboard Overload

🚦 Alerting — Preventing Outages

🟩 Final Thoughts

❓ Frequently Asked Questions

Can I monitor standalone MinIO instances?

How often does MinIO emit metrics?

Does monitoring impact MinIO performance?

🏷️ Tags

More from Tools

Tools: SSH died. Spent 3 hours fixing the wrong thing. (2026)

Tools: Ultimate Guide: MainWP vs ManageWP vs custom scripts: how I manage 15+ WordPress sites in 2025

Tools: Metrics: How cAdvisor and CRI Collect Kubernetes Stats Kubelet

Tools: Essential Guide: GPU Observability for Workloads That Cannot Phone Home

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting