Tools: Report: How to Evaluate an AI SRE Platform

Tools: Report: How to Evaluate an AI SRE Platform

Key Takeaways

Why do generic RFPs fail for AI SRE?

How do you measure AI SRE investigation quality?

Anchor: the RCAEval benchmark

Anchor: the NOFire signal-type ladder

What to measure during evaluation

How do you evaluate AI SRE trust and governance?

Anchor: the Rootly AI SRE Maturity Model

What to evaluate

What deployment sovereignty checks matter for an AI SRE?

What to evaluate

Anchor: cite the project's own documentation

How do you model AI SRE total cost of ownership?

Why we are not publishing competitor pricing numbers

How long should the evaluation take?

What do competitor evaluation frameworks miss?

What are the common AI SRE evaluation mistakes?

Appendix: How to evaluate an AI SRE platform in 21 days

Frequently Asked Questions

How do I benchmark an AI SRE platform's accuracy?

What are the four pillars of AI SRE evaluation?

What is the RCAEval benchmark?

What is the NOFire AI signal-type ladder?

What is Rootly's AI SRE Maturity Model?

How long does an AI SRE evaluation take?

Why do generic RFPs fail for AI SRE evaluation?

How do I evaluate the cost of an AI SRE?

Can I evaluate an AI SRE without buying first?

Why not publish a TCO comparison table? This guide is the deep evaluation framework. For the brief two-week procurement plan, see the HowTo schema in our What is an AI SRE? glossary entry. For the five-capability rubric that filters the shortlist before evaluation starts, see the Five-Capability AI SRE Test in that same post. Everything below assumes the shortlist has already cleared that test. We call the framework below the Four-Pillar AI SRE Evaluation Framework. Each pillar is a separate scoring axis and each is anchored to a primary source where one exists. A typical SaaS procurement RFP covers uptime, security posture, pricing, support tiers, and integrations. None of these surface the failure modes that matter for an AI SRE. The four pillars below replace the generic RFP with an AI-SRE-specific scoring sheet. Investigation quality is the central question and the one most often left to vendor demos. The literature now provides enough scaffolding to measure it without a labelled production dataset. RCAEval (Pham, Zhang, Ha, Salim, Zhang; December 2024; published at ACM Web Conference 2025 per the ACM Digital Library record) ships: The authors describe the work as the first comprehensive benchmark for root-cause analysis of microservices and the gap it fills as "no standard benchmark that includes large-scale datasets and supports comprehensive evaluation environments." For a buyer, this gives a defensible neutral testbed: ask the vendor to run their agent against the RCAEval fault-injection set and report Top-1 accuracy. NOFire AI's published benchmark uses the RCAEval dataset and reports two findings buyers should internalise. Investigation quality without trust controls is a liability. The buyer's question is not "can the agent act" but "under what conditions, with what evidence, and with what rollback." Rootly publishes a four-level maturity model that maps the trust ladder cleanly. The Rootly framing is useful for a reason buyers often miss: the levels are not a feature ranking, they are a deployment posture. A tool that ships Level 3 features is not "better" than a tool that ships Level 1 features; it is appropriate for a different stage of the buyer's adoption arc. For regulated industries, this pillar is a filter, not a scoring axis. The shortlist either includes a deployment posture that matches the buyer's constraints or it does not. For an open-source AI SRE, the source of truth is the project's GitHub repository and official docs site. For a commercial AI SRE, the source of truth is the vendor's data-processing addendum and the published deployment architecture. Avoid relying on the sales conversation for this pillar; the engineering documentation is where commitments are durable. TCO for AI SRE breaks down into four layers, only one of which is the licence. Most direct AI SRE competitors (Resolve.ai, Cleric, PagerDuty SRE Agent, Datadog Bits AI SRE) do not publish per-investigation or per-seat pricing on their public sites. Buyers must request a quote. Any TCO comparison we published with specific dollar figures would either be out of date by the time you read it or fabricated. The correct approach is to issue an RFP that asks for the same shape of cost data (licence floor, per-investigation rate, per-seat rate, included integrations) from every vendor on the shortlist and to model the rest yourself. Our recommendation is a 21-day evaluation sprint, structured as below. This is longer than the 14-day plan in the What is an AI SRE? HowTo schema because the four-pillar framework explicitly measures investigation quality with synthetic fault injection rather than relying on demo-driven impressions. The 21-day sprint is documented in the appendix at the end of this post. Several competitor frameworks exist; each has gaps the Four-Pillar Framework closes. The Four-Pillar Framework is intentionally additive. A buyer who has already adopted Rootly's maturity model and Komodor's testing methodology can use the framework above to fill the deployment-sovereignty and TCO gaps without throwing away their existing work. A three-week evaluation sprint that applies the Four-Pillar AI SRE Evaluation Framework end-to-end. Each step produces a written deliverable a procurement reviewer can sign off. Day 1 to 3: Filter the shortlist on the Five-Capability AI SRE Test. Score every shortlisted tool on multi-step investigation, infrastructure tool execution, dependency-graph awareness, knowledge-base RAG, and structured root-cause output. Drop any tool that scores below 6 out of 15. The deliverable is a one-page capability scorecard for each tool on the remaining shortlist. Day 4 to 8: Measure investigation quality on synthetic fault injection. Replay a sample of RCAEval fault types (CPU throttling, memory leak, network latency, container crash, deployment error) against each tool. Measure Top-1 accuracy, time to first useful finding, and the coherence of the investigation trace. Run the NOFire signal-type ladder by configuring the same fault under metrics-only, metrics-plus-logs, and metrics-plus-logs-plus-traces and report the accuracy curve. Day 9 to 11: Run the trust and governance walkthrough. Map each tool to a Rootly AI SRE Maturity Model level (0, 1, 2, or 3). Document the action classes the tool supports (read-only investigation, PR-based suggestion, sandboxed execution), the audit and rollback paths for each, and the hallucination guardrails the vendor publishes. Match the tool's maturity level to your adoption stage; do not procure higher than you will operate. Day 12 to 14: Apply the deployment sovereignty filter. Document where the LLM call physically runs (SaaS, self-hosted, air-gapped), where telemetry resides, whether the tool supports BYO-LLM, and whether air-gapped operation is supported. For regulated buyers this is a filter, not a scoring axis: a tool that fails residency or air-gapped requirements drops off the shortlist regardless of its score on the other pillars. Day 15 to 17: Model total cost of ownership. Build four cost lines for each tool: licence or subscription, projected LLM inference at expected investigation volume, projected operating-surface increase (observability ingestion, ticket-system API load), and projected engineering time on guardrails and runbook ingestion. Model year one and year three. Add an open-source baseline (HolmesGPT, K8sGPT, or Aurora) to the cost table as a reference floor. Day 18 to 20: Pilot the top two tools in read-only mode. Pick a single SRE team or product squad. Route a defined subset of alerts (one severity tier, one service domain) into each tool in read-only investigation mode. Capture the team's qualitative read on whether the investigation traces are trustable, faster, and surface evidence the team would have missed manually. Day 21: Produce the four-pillar decision memo. Write a one-page memo with the per-tool scores on each pillar: Investigation Quality (Top-1 accuracy on RCAEval-style fault injection plus signal-type robustness), Trust and Governance (Rootly maturity level fit plus published guardrail story), Deployment Sovereignty (pass or fail on inference location and residency), and TCO (year-one total across the four cost lines). The decision is the highest scorer on the buyer's weighted pillar priorities, not the highest scorer overall. Use the RCAEval public benchmark (arxiv 2412.17015), which provides 735 fault-injection cases across three microservice systems with 11 fault types. Ask the vendor to run their agent against the RCAEval set and report Top-1 accuracy. Replay a handful of fault types yourself and inspect the investigation trace for a coherent reasoning chain. Cross-reference against the NOFire AI benchmark for the signal-type ladder, which shows Top-1 accuracy rising from 29 percent on metrics-only inputs to 87 percent when traces are added, and 89 percent on full multi-modal telemetry with agentic reasoning. Investigation quality, trust and governance, deployment sovereignty, and total cost of ownership. Investigation quality is anchored to the RCAEval benchmark and NOFire signal-type ladder. Trust and governance maps to Rootly's four-level AI SRE Maturity Model. Deployment sovereignty is a gating filter on inference location and data residency. TCO covers licence, LLM inference, operating-surface cost, and engineering time on guardrails. RCAEval is the first comprehensive open benchmark for root-cause analysis of microservice systems, published by Pham et al. in December 2024 and presented at the ACM Web Conference 2025. It ships 735 failure cases across three microservice systems, 11 fault types observed in real-world failures, multi-source telemetry (metrics, logs, traces), and 15 reproducible baselines covering coarse-grained and fine-grained RCA. It is the closest thing the AI SRE category has to a neutral evaluation testbed. The NOFire benchmark, built on the RCAEval dataset, reports Top-1 accuracy across four signal configurations: 29 percent on metrics-only inputs, 77 percent when logs are added, 87 percent when traces are added, and 89 percent on full multi-modal telemetry with agentic reasoning. The published implication is that signal type matters more than model choice; a buyer who sends only metrics to the agent should not expect more than a third of investigations to converge regardless of vendor. A four-level trust ladder published by Rootly. Level 0 is manual reliability operations. Level 1 is read-only AI SRE that produces ranked hypotheses linked to evidence but executes nothing. Level 2 is assisted actions with approvals through a governed workflow engine. Level 3 is guardrailed autonomy for a narrow set of pre-approved, reversible failure modes. The levels are a deployment posture, not a feature ranking; buyers should match the level to their adoption stage rather than always procuring at the top. A focused buyer can complete the Four-Pillar evaluation in 21 days: three days on shortlisting against the Five-Capability AI SRE Test, five days on investigation-quality testing with synthetic fault injection, three days on the trust-and-governance walkthrough with security, three days on the deployment-sovereignty filter, three days on TCO modelling, and the remainder on pilot operation and the decision memo. Standard SaaS procurement templates miss the failure modes specific to LLM agents: hallucinated root causes, signal-type sensitivity, trust-ladder mismatch, and inference-location ambiguity. RFPs that score on uptime and integrations without measuring investigation quality on synthetic fault injection misestimate accuracy by a factor of three (per the NOFire benchmark) and tend to push buyers toward over-procurement at higher maturity levels than they will operate. Model four cost layers, not one. The licence or subscription line is the visible cost and is zero for open-source AI SREs. LLM inference is the live-burn cost, modelled against per-token rates from the chosen provider or against compute costs for self-hosted models. The operating-surface cost covers increased ingestion or rate-limit pressure on the systems the agent reads from (observability backends, ticket systems, source control). Engineering time on guardrails and runbook ingestion is the largest hidden cost in year one and frequently exceeds the licence line. Yes for the open-source projects. HolmesGPT, K8sGPT, and Aurora can be installed in a Docker Compose or Helm chart in a single afternoon and run against the RCAEval fault-injection dataset with no commercial commitment. Most commercial AI SREs offer a trial period; Datadog Bits AI SRE documents a 14-day free trial in its launch blog. Use the open-source baseline to calibrate expectations before scoring any commercial pitch. Most direct AI SRE competitors (Resolve.ai, Cleric, PagerDuty SRE Agent, Datadog Bits AI SRE) do not publish per-investigation or per-seat pricing on their public sites. Any TCO comparison with specific dollar figures would either be out of date by the time it was read or fabricated. The honest answer is to issue an RFP that asks for the same shape of cost data (licence floor, per-investigation rate, per-seat rate, included integrations) from every vendor and to model the inference, operating-surface, and engineering layers yourself. Templates let you quickly answer FAQs or store snippets for re-use. as well , this person and/or - Generic SaaS RFPs do not fit AI SRE. The category is younger than most procurement templates and the failure modes (hallucinated root causes, model drift, signal-type sensitivity) are not covered by traditional vendor checklists.

- Investigation quality is measurable. The RCAEval benchmark (Pham et al., December 2024, published at ACM Web Conference 2025 Companion Proceedings) provides 735 fault-injection cases across three microservice systems with 11 fault types and 15 reproducible baselines. The NOFire AI benchmark extends this with a signal-type ladder showing Top-1 accuracy rises from 29 percent on metrics-only inputs to 77 percent when logs are added, 87 percent when traces are added, and 89 percent on full multi-modal telemetry with agentic reasoning.- Trust is a separate axis from capability. Rootly's AI SRE Maturity Model maps the trust ladder in four steps: Level 0 (manual), Level 1 (read-only copilot), Level 2 (assisted actions with approvals), Level 3 (guardrailed autonomy for narrow, reversible failure modes). Buyers should stage trust across that ladder, not buy at the top.- Deployment sovereignty is a gating constraint, not a tiebreaker. Air-gapped, residency-bound, and BYO-LLM buyers must filter the shortlist on inference location before scoring anything else. See our Self-Hosted AI SRE guide.- TCO is not just the licence line. AI SRE cost models include LLM inference, observability surface, runbook ingestion, and the engineering time spent on guardrails. Open-source projects shift the licence cost to zero and surface the operating cost transparently. - Hallucinated root causes. An AI SRE that produces a confident, plausible, wrong root-cause analysis is worse than no AI SRE, because the on-call rotation will follow the suggestion before second-guessing it. RFPs that do not include investigation-quality measurement miss this.- Signal-type sensitivity. The NOFire benchmark shows that the same agent moves from 29 percent Top-1 accuracy on metrics-only inputs to 87 percent when traces are added, and 89 percent on full multi-modal telemetry with agentic reasoning (NOFire AI Benchmark). Buyers who do not test the agent against their actual signal mix will misestimate accuracy by a factor of three.- Trust ladder mismatches. Buyers often evaluate against the wrong trust level. A team that wants Level 1 read-only operation and scores a tool on its Level 3 closed-loop remediation features is grading the tool on capability they will never use.- Inference-location ambiguity. Where the LLM call physically runs is a single-question filter that disqualifies half the shortlist for regulated buyers. Standard RFPs bury this in a "data residency" footnote rather than putting it at the top of the form. - 735 failure cases collected from three microservice systems.- 11 fault types observed in real-world failures, including CPU throttling, memory leaks, network latency, container crashes, deployment errors, resource exhaustion, and database connection failures.- Multi-source telemetry (metrics, logs, and traces) supporting metric-based, trace-based, and multi-source RCA approaches.- 15 reproducible baselines covering coarse-grained and fine-grained RCA. - Signal type matters more than model choice. The benchmark reports Top-1 accuracy of 29 percent on metrics-only inputs, 77 percent when logs are added, 87 percent when traces are added, and 89 percent on full multi-modal telemetry with agentic reasoning. The implication: a buyer who only sends metrics to the agent should not expect more than a third of investigations to converge correctly, regardless of which vendor they pick.- Production-context graphs beat plain LLMs. NOFire reports 89 percent Top-1 accuracy on the RCAEval set, versus 42 percent for the best academic baseline (described on the benchmark page as "Academic SOTA (GALA BARO M2)"). The published implication is that a structured representation of the production environment (services, dependencies, recent changes) gives the agent better grounding than narrative telemetry alone. The vendor publishes this benchmark; treat the absolute numbers with the appropriate scepticism, but the relative ranking and the signal-type ladder are reproducible from the open dataset. - Top-1 accuracy on RCAEval-style fault injection. Replay a handful of fault types and ask the vendor to walk through the investigation trace. The presence or absence of a coherent reasoning chain is the binary signal.- Signal-type robustness. Run the same fault under three configurations: metrics only, metrics plus logs, metrics plus logs plus traces. The shape of the accuracy curve tells you whether the agent compensates for signal gaps or simply degrades.- Time to first useful finding. Not MTTR (which depends on the human response loop), but the elapsed time from alert ingestion to the first piece of evidence a human SRE would have surfaced manually.- Cross-system reasoning. Construct a synthetic incident that spans Kubernetes, a managed database, and a recent deploy. Measure whether the agent reasons across all three or fixates on the loudest signal. - Level 0: Manual reliability operations. No AI assistance. Responders hunt across dashboards.- Level 1: Read-only AI SRE, evidence-first copilot. The "trust-building stage." The AI accelerates context gathering and produces ranked hypotheses linked to evidence, but executes no changes.- Level 2: Assisted actions with approvals. The AI can propose and run approved actions through a governed workflow engine with RBAC, audit logs, verification gates, and rollback readiness.- Level 3: Guardrailed autonomy for narrow, reversible failure modes. The AI autonomously executes pre-approved runbooks for a small set of repeatable incidents, within strict bounds and with continuous verification. - Match the maturity level to the buyer's current state. A team that has not yet shipped Level 1 should not be paying for Level 3 capability they will not turn on. A team ready for Level 2 should not be buying a Level 1-only tool that they will outgrow in twelve months.- Action class boundaries. Read-only investigation, PR-based suggestions, and sandboxed in-cluster execution are three different trust decisions. Document which classes the tool supports and which the buyer will enable.- Audit and rollback. Every action the agent can take must have an audit log entry and a rollback path. Komodor's published benchmarking guide names this dimension as "Transparency: evidence, timelines, and change history alongside every recommendation" (Komodor: Beyond the Hype).- Hallucination guardrails. Komodor's guide also calls out "rigorous testing cycles and closed feedback loops" as the path to "95% RCA precision." Ask the vendor what those feedback loops look like in practice; a tool with no published guardrail story should be downgraded. - Inference location. Does the LLM call run on vendor-managed infrastructure (SaaS), on customer-managed infrastructure (self-hosted), or on a local model (air-gapped)? See our Self-Hosted AI SRE guide for the full deployment-tier framework.- Data residency. Where does telemetry physically reside when sent to the agent? Buyers under GDPR, HIPAA, or sector-specific regimes need a written answer, not a marketing one.- BYO-LLM support. Can the buyer point the agent at their own LLM endpoint? Open-source projects support this directly; HolmesGPT documents OpenAI-Compatible (LiteLLM proxy) and Ollama; K8sGPT registers IBM watsonx, Oracle OCI GenAI, and a generic Custom REST endpoint among others; Aurora supports local inference through Ollama. Most commercial AI SREs offer a smaller backend list.- Air-gapped operation. Can the agent run with no outbound network calls? This is the strictest test and disqualifies most SaaS-only products. - Licence or subscription cost. Open-source AI SREs are zero at this layer. Commercial AI SREs use per-seat, per-investigation, or platform-bundled pricing. Datadog Bits AI SRE bundles into the broader Datadog platform per the product page. PagerDuty SRE Agent bundles into the PagerDuty platform per the product page. Resolve.ai and Traversal price by custom contract.- LLM inference cost. This is the live-burn cost. Frontier-model API pricing changes monthly; buyers should model investigations per month against the published per-token rates of their chosen provider (Anthropic, OpenAI, Google). For BYO-LLM deployments running local models through Ollama or vLLM, the inference cost is reduced to the underlying compute.- Operating surface cost. The agent reads from observability backends, ticket systems, and source-control. Heavy use can increase the costs of the systems it reads from (Datadog ingestion, Splunk indexing, GitHub API rate-limit upgrades). Buyers should add a line item for this and ask their FinOps team to model it before purchase.- Engineering time on guardrails and runbook ingestion. Every AI SRE needs runbooks ingested, integrations configured, and RBAC scoped. The first month of engineering time is the largest hidden TCO cost. - Traversal's "How Should You Evaluate an AI SRE Product?" post focuses on selecting representative incidents, defining success metrics, and a multi-tier accuracy rubric for root cause analysis. It is strong on the testing methodology and quiet on TCO modelling and open-source alternatives.- Komodor's "Beyond the Hype" benchmarking guide is strong on transparency and on the LLM-as-a-Judge testing methodology. The detailed scoring rubric is gated behind an ebook download and the framework does not extend to deployment sovereignty.- Rootly's maturity model is the cleanest published trust ladder, but does not address investigation quality measurement or signal-type sensitivity.- The Traversal LLM benchmarking paper for incident root cause analysis is excellent on model-level evaluation and silent on the buyer-process question of how to map evaluation results onto a maturity-level decision. - Scoring on demo polish. A demo is a curated success path. Evaluate on failure cases the vendor has not seen.- Skipping the signal-type test. The NOFire ladder shows accuracy can vary by a factor of three depending on what telemetry the agent receives. Run the test on the buyer's actual signal mix.- Buying remediation before trust. Most teams should buy at Level 1 (read-only investigation) and stage trust upward across six to twelve months, not procure at Level 3 in a single decision.- Ignoring open-source baselines. A Five-Capability-passing open-source AI SRE deployed in a single afternoon is the fairest baseline against which to measure any commercial pitch. See our open-source three-way comparison.- Treating TCO as the licence line. Inference, operating surface, and guardrail engineering frequently exceed the licence cost in year one.