Tools: How We Built an AI SRE That Replaces Your Log Dashboard

Tools: How We Built an AI SRE That Replaces Your Log Dashboard

The Problem: Log Dashboards Are Broken

The Architecture

Sending Logs (2 Lines of Code)

Anomaly Detection: Signal-Based, Not Threshold-Based

5-Layer Trace Correlation

Auto-Ticketing: From Anomaly to Jira in 90 Seconds

The Cost Problem

Try It in 5 Minutes

What's on the Roadmap

Get Involved TL;DR: We built an open-source platform that ingests logs via OpenTelemetry, detects anomalies using statistical analysis, and auto-creates incident tickets with root cause analysis — in about 90 seconds. It's called LogClaw. Apache 2.0 licensed. You can run docker compose up -d and have a full stack in minutes. The industry average Mean Time to Resolution (MTTR) is 174 minutes. Most of that isn't fixing the problem — it's finding it. Here's what a typical incident looks like: Steps 2-6 are waste. A machine should do them. That's what we built. LogClaw is a Kubernetes-native log intelligence platform. Here's the data flow: The key insight: the Bridge runs 4 threads concurrently — ETL normalization, signal-based anomaly detection, OpenSearch indexing, and trace correlation with blast radius computation. When the anomaly detector's composite score exceeds the threshold (combining 8 signal patterns, statistical z-score, blast radius, velocity, and recurrence signals), it triggers the Ticketing Agent, which pulls relevant log samples and correlated traces, sends them to an LLM for root cause analysis, and creates a deduplicated ticket across 6 platforms. LogClaw uses OpenTelemetry as its sole ingestion protocol. If your app already emits OTEL, you just point it at LogClaw. Java (zero code changes): Most monitoring tools require manual alert thresholds. "Alert me when error rate > 5%." But that approach fails in three ways: it treats validation errors the same as OOM crashes, it can't detect failures before a 30-second window completes, and it misses services with constantly elevated error rates. LogClaw uses a signal-based composite scoring system — not just z-score. Every error log flows through three stages: Stage 1: Signal Extraction — 8 language-agnostic pattern groups with weighted severity: Stage 2: Composite Scoring — Six categories combine into a single score: The contextual signals use 300-second sliding windows to compute: Stage 3: Dual-Path Detection The result: 99.8% detection rate for critical failures, with near-zero false positives. Validation errors (400s) and 404s produce scores below the 0.4 threshold — they never trigger incidents. When an anomaly fires, the Bridge's Request Lifecycle Engine constructs a complete request timeline using 5 correlation layers: This is what turns "your payment service has errors" into "Redis connection pool exhausted in checkout handler → payment-api failing → order-service timing out → notification-service queue backing up." When the composite score exceeds the threshold, the Ticketing Agent: Severity-based routing means critical incidents hit PagerDuty + Slack + Jira simultaneously, while medium severity goes to Jira only. Your team wakes up to a ticket that says: "Payment service composite anomaly score 0.91 (critical) at 03:47 UTC. Signals: db:connection_pool (0.75), blast_radius:4_services (0.85), velocity:12x_baseline (0.90). Root cause: Redis connection pool exhaustion due to unclosed connections in the checkout handler. Affected services: payment-api, order-service, notification-service, email-service. Suggested fix: Add connection pool max_idle_time configuration and close connections in finally block." Here's what 500GB/day of logs costs across vendors: LogClaw Cloud charges $0.30/GB ingested. No per-seat fees. No per-host fees. No per-feature add-ons. The AI anomaly detection and auto-ticketing are included. No Kubernetes required for testing: Open http://localhost:3000 — full dashboard, anomaly detection, and ticketing. For production, deploy on Kubernetes with Helm: Single command gives you: OTel Collector, Kafka, Flink, OpenSearch, Bridge, Ticketing Agent, and Dashboard. LogClaw is currently focused on logs. Here's what's coming: LogClaw is Apache 2.0 licensed. The entire platform is open source. Star the repo if this is useful. Open an issue if you find a bug. PRs welcome. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

Your App (OTEL SDK) ↓ OTLP (gRPC :4317 or HTTP :4318) OTel Collector (batching, tenant enrichment) ↓ Kafka (Strimzi, KRaft mode) ↓ Bridge (Python, 4 concurrent threads) ├── OTLP ETL (flatten JSON, normalize fields) ├── Anomaly Detection (z-score on error rate distributions) ├── OpenSearch Indexer (bulk index, ILM lifecycle) └── Trace Correlation (5-layer request lifecycle engine) ↓ OpenSearch (full-text search, analytics) + Ticketing Agent (RCA via LLM → Jira/ServiceNow/PagerDuty/Slack) Your App (OTEL SDK) ↓ OTLP (gRPC :4317 or HTTP :4318) OTel Collector (batching, tenant enrichment) ↓ Kafka (Strimzi, KRaft mode) ↓ Bridge (Python, 4 concurrent threads) ├── OTLP ETL (flatten JSON, normalize fields) ├── Anomaly Detection (z-score on error rate distributions) ├── OpenSearch Indexer (bulk index, ILM lifecycle) └── Trace Correlation (5-layer request lifecycle engine) ↓ OpenSearch (full-text search, analytics) + Ticketing Agent (RCA via LLM → Jira/ServiceNow/PagerDuty/Slack) Your App (OTEL SDK) ↓ OTLP (gRPC :4317 or HTTP :4318) OTel Collector (batching, tenant enrichment) ↓ Kafka (Strimzi, KRaft mode) ↓ Bridge (Python, 4 concurrent threads) ├── OTLP ETL (flatten JSON, normalize fields) ├── Anomaly Detection (z-score on error rate distributions) ├── OpenSearch Indexer (bulk index, ILM lifecycle) └── Trace Correlation (5-layer request lifecycle engine) ↓ OpenSearch (full-text search, analytics) + Ticketing Agent (RCA via LLM → Jira/ServiceNow/PagerDuty/Slack) from opentelemetry.sdk._logs import LoggerProvider from opentelemetry.sdk._logs.export import BatchLogRecordProcessor from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter exporter = OTLPLogExporter( endpoint="https://otel.logclaw.ai/v1/logs", headers={"x-logclaw-api-key": "lc_proj_your_key"}, ) provider = LoggerProvider() provider.add_log_record_processor(BatchLogRecordProcessor(exporter)) from opentelemetry.sdk._logs import LoggerProvider from opentelemetry.sdk._logs.export import BatchLogRecordProcessor from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter exporter = OTLPLogExporter( endpoint="https://otel.logclaw.ai/v1/logs", headers={"x-logclaw-api-key": "lc_proj_your_key"}, ) provider = LoggerProvider() provider.add_log_record_processor(BatchLogRecordProcessor(exporter)) from opentelemetry.sdk._logs import LoggerProvider from opentelemetry.sdk._logs.export import BatchLogRecordProcessor from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter exporter = OTLPLogExporter( endpoint="https://otel.logclaw.ai/v1/logs", headers={"x-logclaw-api-key": "lc_proj_your_key"}, ) provider = LoggerProvider() provider.add_log_record_processor(BatchLogRecordProcessor(exporter)) const { OTLPLogExporter } = require('@opentelemetry/exporter-logs-otlp-http'); const exporter = new OTLPLogExporter({ url: 'https://otel.logclaw.ai/v1/logs', headers: { 'x-logclaw-api-key': 'lc_proj_your_key' }, }); const { OTLPLogExporter } = require('@opentelemetry/exporter-logs-otlp-http'); const exporter = new OTLPLogExporter({ url: 'https://otel.logclaw.ai/v1/logs', headers: { 'x-logclaw-api-key': 'lc_proj_your_key' }, }); const { OTLPLogExporter } = require('@opentelemetry/exporter-logs-otlp-http'); const exporter = new OTLPLogExporter({ url: 'https://otel.logclaw.ai/v1/logs', headers: { 'x-logclaw-api-key': 'lc_proj_your_key' }, }); java -javaagent:opentelemetry-javaagent.jar \ -Dotel.exporter.otlp.endpoint=https://otel.logclaw.ai \ -Dotel.exporter.otlp.headers=x-logclaw-api-key=lc_proj_your_key \ -jar my-app.jar java -javaagent:opentelemetry-javaagent.jar \ -Dotel.exporter.otlp.endpoint=https://otel.logclaw.ai \ -Dotel.exporter.otlp.headers=x-logclaw-api-key=lc_proj_your_key \ -jar my-app.jar java -javaagent:opentelemetry-javaagent.jar \ -Dotel.exporter.otlp.endpoint=https://otel.logclaw.ai \ -Dotel.exporter.otlp.headers=x-logclaw-api-key=lc_proj_your_key \ -jar my-app.jar git clone https://github.com/logclaw/logclaw.git cd logclaw docker compose up -d git clone https://github.com/logclaw/logclaw.git cd logclaw docker compose up -d git clone https://github.com/logclaw/logclaw.git cd logclaw docker compose up -d helm install logclaw charts/logclaw-tenant \ --namespace logclaw \ --create-namespace helm install logclaw charts/logclaw-tenant \ --namespace logclaw \ --create-namespace helm install logclaw charts/logclaw-tenant \ --namespace logclaw \ --create-namespace - PagerDuty fires at 3 AM (threshold alert you set 6 months ago) - You open Datadog/Splunk/Grafana - You spend 45 minutes grepping through dashboards - You find the error, but not the cause - You spend another hour tracing across services - You open a Jira ticket manually and paste log lines - You fix the bug - Pattern matches (30%) + Statistical z-score (25%) + Contextual signals (15%) + HTTP status (10%) + Log severity (10%) + Structural indicators (10%) - Blast radius: How many services are simultaneously erroring (5+ services = 0.90 weight) - Velocity: Error rate acceleration vs. historical average (5x spike = 0.80 weight) - Recurrence: Novel error templates score higher than known patterns - Immediate path (<100ms): OOM, crashes, and resource exhaustion fire instantly — no waiting for time windows. Your payment service crashes at 3 AM, and there's a ticket before the process restarts. - Windowed path (10-30s): Statistical anomalies detected via z-score analysis on sliding windows. - Trace ID clustering — Groups related logs across services - Temporal proximity — Associates logs within the same time window - Service dependency mapping — Maps caller → callee relationships - Error propagation tracking — Traces the cascade from root cause to symptoms - Blast radius computation — Identifies all affected downstream services - Pulls relevant log samples + the correlated trace timeline from OpenSearch - Sends them to your LLM (OpenAI, Claude, or Ollama for air-gapped deployments) - Generates a root cause analysis with blast radius and suggested fix - Creates a deduplicated ticket on Jira, ServiceNow, PagerDuty, OpsGenie, Slack, or Zammad - Metrics support — ingest OTEL metrics alongside logs - Trace visualization — distributed trace rendering in the dashboard - Deep learning anomaly models — beyond z-score, using autoencoder models for subtle drift detection - Runbook automation — not just tickets, but auto-remediation scripts - GitHub: https://github.com/logclaw/logclaw - Docs: https://docs.logclaw.ai - Managed Cloud: https://console.logclaw.ai (1 GB/day free, no credit card) - Book a Demo: https://calendly.com/robelkidin/logclaw