Tools: βοΈ Monitoring MinIO with Prometheus and Grafana β the right way for production (2026)
π§ Prerequisites β What You Need
π Prometheus Setup β Scraping Metrics
π Securing the Scrape
π§ Understanding Metric Cardinality
π¨ Grafana Dashboard β Turning Data into Insight
π Key Visualizations to Add
β οΈ Avoiding Dashboard Overload
π¦ Alerting β Preventing Outages
π© Final Thoughts
β Frequently Asked Questions
Can I monitor standalone MinIO instances?
How often does MinIO emit metrics?
Does monitoring impact MinIO performance?
π References & Further Reading A full monitoring setup can generate zero actionable alerts β when metrics arenβt tied to system invariants, not just resource usage. The issue isnβt the dashboard; itβs that CPU and memory alone canβt tell you whether your object storage is actually working. You need four components to monitor MinIO with Prometheus and Grafana: a running MinIO tenant, Prometheus server, Grafana instance, and network connectivity between them. MinIO exposes metrics via its built-in Prometheus endpoint at /minio/v2/metrics/cluster. This endpoint emits service-level indicators (SLIs) like minio_bucket_objects_total, minio_disk_usage, and minio_s3_requests_duration_seconds. These are not host-level metrics β they reflect object storage behavior across the entire tenant. Ensure your MinIO deployment is in distributed mode (at least 4 nodes) and running a recent version (RELEASE.-xx-xx or later). Older versions lack critical instrumentation for cluster-wide metrics. Verify the metrics endpoint is accessible: If you see metric lines, the endpoint is live. If you get a 401, ensure your admin credentials are correct. The endpoint requires admin privileges. MinIO uses HTTP basic auth β Prometheus must supply credentials in the scrape job. Prometheus must be configured to scrape MinIOβs cluster metrics endpoint every 30 seconds, using secure credentials and proper relabeling to extract tenant and bucket labels. Hereβs the scrape job configuration for prometheus.yml: This job scrapes the /minio/v2/metrics/cluster path, which aggregates metrics across all nodes in the tenant. Thatβs key: youβre not scraping individual nodes, but the cluster view, avoiding duplication and gaps. Prometheus uses HTTP polling β every 30 seconds, it makes a GET request, receives plain-text OpenMetrics, and parses it into time series. Each metric gets a timestamp and is stored in Prometheusβs local TSDB using a write-optimized block structure (WAL + memory-mapped chunks). This design minimizes disk seeks but requires compaction later. Restart Prometheus: Verify the target is up in Prometheus web UI at http://prometheus:9090/targets. You should see minio-cluster with state "UP". Query a sample metric: The value array contains [timestamp, string_value]. Prometheus stores all values as float64 internally but serializes integers as strings in JSON responses. Never expose MinIOβs admin port publicly. Use either: Or a sidecar reverse proxy with IP filtering For mTLS, generate client certs and update the scrape config: tls_config: ca_file: /etc/prometheus/minio-ca.crt cert_file: /etc/prometheus/prom-client.crt key_file: /etc/prometheus/prom-client.key insecure_skip_verify: false This ensures authentication and encryption at the transport layer β preventing credential leakage and tampering. MinIO metrics include labels like bucket, node, and operation. High cardinality (e.g., thousands of buckets) can explode Prometheus memory usage. Monitor prometheus_tsdb_head_series β if it grows beyond 10M series, consider: Or using recording rules to pre-aggregate Example recording rule: groups: - name: minio-aggregated rules: - record: job:minio_bucket_objects_total:sum expr: sum by (job) (minio_bucket_objects_total) This reduces cardinality by pre-summing object counts per job, lowering query load and memory pressure. βMonitoring MinIO with Prometheus and Grafana isnβt about collecting data β itβs about isolating failure modes before they isolate you.β A Grafana dashboard should answer: Is my MinIO tenant healthy? Are objects being written and read reliably? Is erasure coding balanced? Start by adding Prometheus as a data source in Grafana. Then import MinIOβs official dashboard (ID: 18085) from Grafana.com: Then import via UI or API. The dashboard shows: The default dashboard is good, but production needs deeper insight. Add these panels: 1. Erasure Set Imbalance:
"promql max by (set) (minio_erasure_set_drives_online) / on(set) group_left max by (set) (minio_erasure_set_drives_total) " This shows the ratio of online drives per erasure set. Below 1.0 means degraded performance due to missing or failed drives. 2. Healing Queue Lag:"promql max(minio_healing_queue_length) " If this is >0 for more than 10 minutes, background healing is falling behind β could indicate disk failures or sustained I/O pressure. 3. S3 Error Rate:
"promql sum(rate(minio_s3_requests_duration_seconds_count{code=~"5.."}[5m])) / sum(rate(minio_s3_requests_duration_seconds_count[5m])) " This computes the HTTP 5xx error ratio over a 5-minute sliding window. Values above 1% indicate potential service degradation. Donβt add every metric. Focus on SLO-relevant signals : Alerts must be specific, actionable, and based on symptoms β not thresholds. Monitoring MinIO with Prometheus and Grafana means alerting on what users experience , not just what the system reports. Use Prometheus alerting rules in a dedicated file: These alerts trigger only after sustained conditions (for:), preventing flapping. Prometheus sends alerts to Alertmanager , which deduplicates, groups, and routes them via email, Slack, or PagerDuty. Monitoring MinIO with Prometheus and Grafana turns reactive firefighting into proactive resilience. Monitoring MinIO with Prometheus and Grafana isnβt just a DevOps checkbox β itβs how you prove your object storage is reliable. Metrics like bucket growth, healing queues, and S3 error rates expose issues long before users notice. The system doesnβt just react; it anticipates. Too many teams treat monitoring as a sidecar β something added after the fact. But in distributed systems, observability is part of the design. You wouldnβt deploy a database without backups; donβt deploy MinIO without instrumentation. The real win isnβt the dashboard. Itβs knowing, at any moment, whether your data is safe, accessible, and consistent β because the metrics say so. Yes, but the /minio/v2/metrics/cluster endpoint only works in distributed mode. For standalone, use /minio/metrics/instance β but youβll miss tenant-wide aggregation. (More onPythonTPoint tutorials) MinIO updates metrics every 5 seconds in memory. Prometheus typically scrapes every 30s, so thereβs no data loss. The values are gauges and counters, not sampled. Negligibly. The metrics endpoint reads from in-memory counters β no disk I/O or locking. Even under heavy load, response time is under 10ms. Scrape every 30s to minimize overhead. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse