Tools: Latest: Remetric: find waste in self-hosted Prometheus, Grafana, and Loki

Tools: Latest: Remetric: find waste in self-hosted Prometheus, Grafana, and Loki

The four waste patterns

Why a separate tool

Does this work for Grafana Cloud?

What you get per finding

Running a scan

What's next

Get started Self-hosted Prometheus stacks degrade in predictable ways: a label

explosion that quietly doubles TSDB head size, a metric scraped byevery node and queried by none, an alert rule that has not fired innine months, a dashboard panel pointing at a metric that was renamedlast quarter. The signals are all there in the existing APIs, butwriting the queries, running them on a schedule, and turning theanswers into actionable fixes is enough friction that nobody does ituntil something breaks. Remetric is a single static binary that does this. v0.1 has already shipped. Remetric ships five analyzers covering four categories of waste. Cardinality explosion. A label with high uniqueness produces oneseries per value. A trace_id label on a request counter generates anew series for every request and never reuses any of them. Withinhours a single metric can carry hundreds of thousands of dead series,consuming TSDB head memory and slowing every query that touches it.Remetric ranks labels by uniqueness ratio and series-countcontribution, flags hot labels, and produces a fix snippet that dropsthe offending label via metric_relabel_configs. Unused metrics. Exporters scrape thousands of metric names.Dashboards, alerting rules, and recording rules reference a subset.The leftover is dead weight in head series. Remetric walks everyGrafana dashboard and every Prometheus rule expression, collects theset of referenced metric names, diffs against the ingested set, andemits one finding per unused metric. The fix is ametric_relabel_configs drop rule per metric. Alert hygiene. Two failure modes: alerting rules that fireconstantly (noise that everyone scrolls past) and rules that havenever fired in the available retention window (broken? thresholdwrong? failure mode no longer present?). Both need a human decision,but neither announces itself. Remetric queries rule history andsurfaces rules in each state with lookback window and observedfire-count as evidence. Broken panels. Panel queries that reference metrics no longer inhead series or in recording-rule outputs. The panel renders empty;nobody notices, because empty looks the same as fine. Remetric parsesevery PromQL target across every dashboard, diffs against theexistence set (head series union recording-rule outputs), and emitsone finding per (dashboard, missing-metric) pair listing theaffected panels. None of these are exotic. All are detectable from Prometheus andGrafana HTTP APIs. The PromQL queries for each have been folklore foryears; the value is in running them on a schedule and producingactionable findings instead of more PromQL to maintain. The detection logic exists scattered across blog posts, gists, andpinned Slack messages. Per-backend tools (cortex-tool, vmctl,mimirtool) each cover a slice, usually for the backend their vendorsells. None of them: Remetric covers all four patterns plus the Grafana side, with aread-only contract: it never writes to the target Prometheus orGrafana, and bounded concurrency (5 in-flight requests by default,configurable) keeps it from overwhelming the target during a scan. Yes. Grafana Cloud's Prometheus (hosted Mimir) and Grafana itselfspeak the same HTTP APIs as the self-hosted versions, so remetricruns against them with bearer-token auth: The lowered concurrency keeps remetric under Grafana Cloud'sper-tenant rate limits during a full scan. Two of remetric's analyzers overlap with built-in Grafana Cloudfeatures: What Grafana Cloud's built-ins don't cover: For Grafana Cloud users, remetric is most useful at those gaps:alert hygiene, broken panels, and getting label-by-labelexplanations of what's driving the bill, instead of just "AdaptiveMetrics handled it." Each finding carries: The fix snippet is the load-bearing part. Every finding answers "nowwhat?" with copy-pasteable text, not "consider reducing cardinality"advice. Five analyzers run in sequence (each logs when it starts and howlong it took, so a hung target or a slow analyzer is visible). Theseverity table at the top gives at-a-glance ranking; per-findingdetail blocks below carry evidence and the fix. For CI integration, swap terminal output for JSON and add a fail-onthreshold: Exit code 3 on any finding at or above the threshold; defaultbehavior is exit 0 regardless of findings, so the tool wires intopipelines without surprise failures. Known-noise patterns suppress with anchored regex flags:--ignore-metric "node_.*", --ignore-label "container_label_.*",--ignore-alert "TestAlert.*", --ignore-dashboard "Legacy .*". Theflags are repeatable; the patterns wrap as ^(<pattern>)$ sosubstring matches don't accidentally suppress unrelated findings. The report subcommand produces the same findings as scan but inself-contained HTML (--format html) or Markdown (--format markdown)for PR comments and review distribution. v0.1 ships five analyzers. The post-v0.1 roadmap: Further out: a plugin system for custom analyzers (so teams cancodify their own anti-patterns) and a parallel continuous-monitoringSaaS layer with alerts on cardinality spikes, multi-cluster views,and historical trends. Documentation at remetric.dev. Source andissues at github.com/remetric-dev/remetric. Feedback on production scans is the most useful input for the v0.2roadmap: what the tool caught, what it missed, what itfalse-positived. Open a GitHub issue with a redacted output snippet

and the rough shape of the stack. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

remetric scan \ --prometheus https://prometheus-prod-XX.grafana.net/api/prom \ --prom-token "$GRAFANA_CLOUD_PROM_TOKEN" \ --grafana https://YOUR-ORG.grafana.net \ --grafana-token "$GRAFANA_CLOUD_GRAFANA_TOKEN" \ --prom-max-in-flight 2 --grafana-max-in-flight 2 remetric scan \ --prometheus https://prometheus-prod-XX.grafana.net/api/prom \ --prom-token "$GRAFANA_CLOUD_PROM_TOKEN" \ --grafana https://YOUR-ORG.grafana.net \ --grafana-token "$GRAFANA_CLOUD_GRAFANA_TOKEN" \ --prom-max-in-flight 2 --grafana-max-in-flight 2 remetric scan \ --prometheus https://prometheus-prod-XX.grafana.net/api/prom \ --prom-token "$GRAFANA_CLOUD_PROM_TOKEN" \ --grafana https://YOUR-ORG.grafana.net \ --grafana-token "$GRAFANA_CLOUD_GRAFANA_TOKEN" \ --prom-max-in-flight 2 --grafana-max-in-flight 2 remetric scan --prometheus http://localhost:9090 --grafana http://localhost:3000 remetric scan --prometheus http://localhost:9090 --grafana http://localhost:3000 remetric scan --prometheus http://localhost:9090 --grafana http://localhost:3000 remetric scan \ --prometheus http://prom.internal:9090 \ --grafana http://grafana.internal:3000 \ --output json \ --fail-on critical remetric scan \ --prometheus http://prom.internal:9090 \ --grafana http://grafana.internal:3000 \ --output json \ --fail-on critical remetric scan \ --prometheus http://prom.internal:9090 \ --grafana http://grafana.internal:3000 \ --output json \ --fail-on critical # One-liner install (drops to $HOME/.local/bin) curl -sSL https://remetric.dev/install.sh | sh # Homebrew (macOS or Linux) brew install remetric-dev/tap/remetric # Docker (linux/amd64, linux/arm64) docker run --rm ghcr.io/remetric-dev/remetric:latest \ scan --prometheus http://host.docker.internal:9090 # One-liner install (drops to $HOME/.local/bin) curl -sSL https://remetric.dev/install.sh | sh # Homebrew (macOS or Linux) brew install remetric-dev/tap/remetric # Docker (linux/amd64, linux/arm64) docker run --rm ghcr.io/remetric-dev/remetric:latest \ scan --prometheus http://host.docker.internal:9090 # One-liner install (drops to $HOME/.local/bin) curl -sSL https://remetric.dev/install.sh | sh # Homebrew (macOS or Linux) brew install remetric-dev/tap/remetric # Docker (linux/amd64, linux/arm64) docker run --rm ghcr.io/remetric-dev/remetric:latest \ scan --prometheus http://host.docker.internal:9090 - cross over to Grafana to ask "does anything still use this metric?" - check whether alert rules ever fire - detect broken panels (which requires walking dashboards and querying head series in one pass) - ship as a single static binary that runs in CI without a runtime install - Cardinality Management (UI, available on all tiers) shows top metrics and top labels by series count. Same data remetric's cardinality analyzer surfaces, but bound to the UI: no CI integration, no paste-ready fix snippets, no programmatic consumption. - Adaptive Metrics (paid tiers) automatically aggregates unused dimensions to reduce cardinality. Conceptually overlaps with remetric's unused-metrics analyzer, but operates as opaque automation: you don't see which labels were dropped or what dashboards rely on them. - Alert hygiene (never-firing or always-firing rules). - Broken panels (queries pointing at metrics that no longer exist). - CI integration via --fail-on=critical and exit 3. - Auditable, human-reviewed fixes you can land in a PR rather than delegate to an automation. - A class slug: hot-label, unused-metric, never-firing-alert, always-firing-alert, label-pattern-overly-granular, broken-panel. - A severity (critical / high / medium / low) derived from observed series counts, uniqueness ratios, and lookback windows. - Evidence: sample values, series counts, affected panel titles. - A paste-ready fix snippet: YAML for prometheus.yml when the fix is a scrape-config change, an instruction block when the fix is editing a Grafana dashboard. - A documentation URL pointing at remetric.dev/findings/<class> with the canonical write-up: what the pattern is, why it matters, how remetric detects it, known false positives, how to fix. - Loki support. Logs cardinality is its own waste category; the API surface is parallel enough that existing client patterns transfer cleanly. - Recording-rule suggestions. If three dashboards each compute the same expensive aggregation on every refresh, that aggregation is a recording rule waiting to be promoted. The analyzer would surface the candidates. - Snapshot diff. remetric scan --baseline=last-week.json to surface regressions over time, not just absolute state. Cardinality drift is the obvious target. - VictoriaMetrics, Mimir, Thanos extended support. Basic VM support already works (the Prometheus API path is shared); deeper integration would unlock backend-specific signals.