Tools: Latest: Remetric: find waste in self-hosted Prometheus, Grafana, and Loki
The four waste patterns
Why a separate tool
Does this work for Grafana Cloud?
What you get per finding
Running a scan
What's next
Get started Self-hosted Prometheus stacks degrade in predictable ways: a label
explosion that quietly doubles TSDB head size, a metric scraped byevery node and queried by none, an alert rule that has not fired innine months, a dashboard panel pointing at a metric that was renamedlast quarter. The signals are all there in the existing APIs, butwriting the queries, running them on a schedule, and turning theanswers into actionable fixes is enough friction that nobody does ituntil something breaks. Remetric is a single static binary that does this. v0.1 has already shipped. Remetric ships five analyzers covering four categories of waste. Cardinality explosion. A label with high uniqueness produces oneseries per value. A trace_id label on a request counter generates anew series for every request and never reuses any of them. Withinhours a single metric can carry hundreds of thousands of dead series,consuming TSDB head memory and slowing every query that touches it.Remetric ranks labels by uniqueness ratio and series-countcontribution, flags hot labels, and produces a fix snippet that dropsthe offending label via metric_relabel_configs. Unused metrics. Exporters scrape thousands of metric names.Dashboards, alerting rules, and recording rules reference a subset.The leftover is dead weight in head series. Remetric walks everyGrafana dashboard and every Prometheus rule expression, collects theset of referenced metric names, diffs against the ingested set, andemits one finding per unused metric. The fix is ametric_relabel_configs drop rule per metric. Alert hygiene. Two failure modes: alerting rules that fireconstantly (noise that everyone scrolls past) and rules that havenever fired in the available retention window (broken? thresholdwrong? failure mode no longer present?). Both need a human decision,but neither announces itself. Remetric queries rule history andsurfaces rules in each state with lookback window and observedfire-count as evidence. Broken panels. Panel queries that reference metrics no longer inhead series or in recording-rule outputs. The panel renders empty;nobody notices, because empty looks the same as fine. Remetric parsesevery PromQL target across every dashboard, diffs against theexistence set (head series union recording-rule outputs), and emitsone finding per (dashboard, missing-metric) pair listing theaffected panels. None of these are exotic. All are detectable from Prometheus andGrafana HTTP APIs. The PromQL queries for each have been folklore foryears; the value is in running them on a schedule and producingactionable findings instead of more PromQL to maintain. The detection logic exists scattered across blog posts, gists, andpinned Slack messages. Per-backend tools (cortex-tool, vmctl,mimirtool) each cover a slice, usually for the backend their vendorsells. None of them: Remetric covers all four patterns plus the Grafana side, with aread-only contract: it never writes to the target Prometheus orGrafana, and bounded concurrency (5 in-flight requests by default,configurable) keeps it from overwhelming the target during a scan. Yes. Grafana Cloud's Prometheus (hosted Mimir) and Grafana itselfspeak the same HTTP APIs as the self-hosted versions, so remetricruns against them with bearer-token auth: The lowered concurrency keeps remetric under Grafana Cloud'sper-tenant rate limits during a full scan. Two of remetric's analyzers overlap with built-in Grafana Cloudfeatures: What Grafana Cloud's built-ins don't cover: For Grafana Cloud users, remetric is most useful at those gaps:alert hygiene, broken panels, and getting label-by-labelexplanations of what's driving the bill, instead of just "AdaptiveMetrics handled it." Each finding carries: The fix snippet is the load-bearing part. Every finding answers "nowwhat?" with copy-pasteable text, not "consider reducing cardinality"advice. Five analyzers run in sequence (each logs when it starts and howlong it took, so a hung target or a slow analyzer is visible). Theseverity table at the top gives at-a-glance ranking; per-findingdetail blocks below carry evidence and the fix. For CI integration, swap terminal output for JSON and add a fail-onthreshold: Exit code 3 on any finding at or above the threshold; defaultbehavior is exit 0 regardless of findings, so the tool wires intopipelines without surprise failures. Known-noise patterns suppress with anchored regex flags:--ignore-metric "node_.*", --ignore-label "container_label_.*",--ignore-alert "TestAlert.*", --ignore-dashboard "Legacy .*". Theflags are repeatable; the patterns wrap as ^(<pattern>)$ sosubstring matches don't accidentally suppress unrelated findings. The report subcommand produces the same findings as scan but inself-contained HTML (--format html) or Markdown (--format markdown)for PR comments and review distribution. v0.1 ships five analyzers. The post-v0.1 roadmap: Further out: a plugin system for custom analyzers (so teams cancodify their own anti-patterns) and a parallel continuous-monitoringSaaS layer with alerts on cardinality spikes, multi-cluster views,and historical trends. Documentation at remetric.dev. Source andissues at github.com/remetric-dev/remetric. Feedback on production scans is the most useful input for the v0.2roadmap: what the tool caught, what it missed, what itfalse-positived. Open a GitHub issue with a redacted output snippet
and the rough shape of the stack. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse