Tools

Tools: Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

2026-03-06 0 views admin

Tools: Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

Source: Dev.to

Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis ## The Problem: Traditional Incident Investigation ## Why Logs Alone Are Not Enough for AI ## Introducing Topology-Aware AI Agents ## Platform Context: Microservices Running on Amazon EKS ## Building the Service Relationship Graph ## Modeling the Infrastructure Layer ## Modeling the Kubernetes Platform ## Modeling Application Services ## Modeling Caller–Callee Relationships ## Linking Observability Data to the Graph ## Learning from Historical Incidents ## Architecture Overview ## Agent Workflow ## Step 1 — Detect SLO Breach ## Step 2 — Identify Impacted Services ## Step 3 — Traverse Dependencies ## Step 4 — Retrieve Observability Signals ## Step 5 — LLM Reasoning ## Results from the Prototype ## Why Graph-Based Observability Matters ## Future Directions ## Final Thoughts Modern cloud systems are complex distributed architectures where a single user journey may depend on dozens of services running across multiple infrastructure layers. When a Service Level Objective (SLO) breach occurs, identifying the root cause often requires navigating logs, metrics, traces, service dependencies, and infrastructure relationships. In many organizations, this investigation is still manual and time-consuming. In a recent project, I explored how AI agents can automate incident investigation by combining: This approach reduced investigation time from 20–30 minutes to under a minute for certain SLO breaches. This article introduces the concept of Topology-Aware AI Agents and how such a system can be implemented using AWS services and graph-based system modeling. When an SLO breach occurs, SRE teams typically perform the following steps: In large microservice environments, this investigation becomes difficult because: Even with powerful observability tools, humans still perform most correlation tasks manually. Many AI troubleshooting systems rely on RAG (Retrieval Augmented Generation) using logs or documentation. However, logs alone do not provide system relationships. Example log entry: Payment API latency spike Without topology context, an AI system cannot determine: To solve this, we need structural knowledge about the system architecture. A Topology-Aware AI Agent combines three major sources of context: Observability Data + Service Topology + Historical Incident Knowledge The agent uses this combined knowledge to automatically: This transforms incident troubleshooting from log searching into graph-based reasoning. In this environment, the application platform was built using Kubernetes running on Amazon Elastic Kubernetes Service (EKS). Each user request travels across multiple layers: User Request ↓ API Gateway / Entry Service ↓ Microservices running on Kubernetes ↓ Databases / external dependencies Each microservice runs inside containers deployed on Kubernetes pods. To enable automated incident analysis, the system needed visibility into: These relationships were modeled as a graph database. The system used Neo4j to build a knowledge graph representing the full platform topology. The graph captured relationships across multiple layers: This structure allowed the AI agent to reason about how failures propagate across the system. The first layer of the graph represented the cloud infrastructure. Cloud Provider AWS Account Region Availability Zone Host (EC2) Example relationships: AWS Account │ DEPLOYS ▼ EKS Cluster │ RUNS_ON ▼ EC2 Worker Node This enables the system to correlate incidents with infrastructure-level problems such as: The next layer represents Kubernetes resources running on the EKS cluster. EKS Cluster Namespace Pod Container Process Group Example relationships: EKS Cluster │ CONTAINS ▼ Namespace │ CONTAINS ▼ Pod │ RUNS ▼ Container Each container instance is mapped to a process group representing a running microservice instance. This structure allows the graph to capture runtime relationships between services and infrastructure nodes. At the application level, the graph represents each microservice as a service node. Service API Database External Dependency Services are connected to the runtime processes executing them. Example relationship: Checkout Service │ RUNS_AS ▼ Process Group │ HOSTED_ON ▼ Kubernetes Pod This mapping enables the system to trace incidents from application failures down to infrastructure components. One of the most critical aspects of the topology graph is capturing service interaction flows. Microservices communicate through APIs, forming caller–callee relationships. Checkout Service │ CALLS ▼ Payment Service │ CALLS ▼ Payment Database These relationships represent the actual runtime service communication paths. By modeling these relationships, the AI agent can identify: Observability signals such as logs and errors are attached to graph nodes. Payment Service │ HAS_ERROR ▼ Timeout Exception Infrastructure events can also be attached: EC2 Worker Node │ HAS_EVENT ▼ CPU Spike This allows the agent to correlate: within a single reasoning model. Each investigated incident is also stored in the graph. Incident ├ impacted service ├ root cause ├ infrastructure correlation └ resolution Over time, this builds a knowledge graph of operational incidents. The AI agent can then detect patterns such as: A simplified architecture for this approach looks like this: SLO Breach Alert │ ▼ Event Trigger (Monitoring / EventBridge) │ ▼ Incident AI Agent │ ├── Service Topology Graph (Neo4j) ├── Observability Data (Logs / Traces) └── Historical Incident Knowledge │ ▼ LLM Reasoning │ ▼ Root Cause Hypothesis AWS services that can support this architecture include: When a new SLO breach occurs, the AI agent performs the following steps. Monitoring tools trigger an alert event. The agent queries the service topology graph. The graph traversal identifies: Logs and errors are retrieved from observability platforms. Structured context is sent to the LLM. SLO breach detected in Checkout Service Impacted services: Checkout Service Payment Service Payment Database Recent errors: Timeout errors in Payment Service Historical incident: Database connection pool exhaustion The LLM then generates a root cause hypothesis. In the prototype implementation: Manual investigation time: AI-assisted investigation: For a specific platinum user journey SLO, the agent achieved: ~52% correlation accuracy between SLO breaches and underlying service problems. While not perfect, it significantly accelerates incident triage. Traditional observability focuses on: However modern systems also require relationship awareness. Graph-based models enable: Combining graph knowledge with LLM reasoning enables a new class of systems: AI-assisted incident response agents. This concept can evolve further with: As distributed architectures continue to grow in complexity, topology-aware AI agents may become an essential part of SRE operations. AI-powered incident investigation is still in its early stages. creates a powerful approach to automated root cause analysis. Topology-aware AI agents represent a promising direction for improving SRE productivity and incident response time in modern cloud-native systems. If you're exploring AI for SRE, observability, or incident automation, I would love to hear your thoughts or experiences. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse - Observability data - Service topology - Kubernetes infrastructure context - Historical incident knowledge - Graph-based reasoning - Identify the impacted user journey - Check monitoring dashboards - Inspect logs and traces - Identify impacted services - Traverse upstream and downstream dependencies - Correlate incidents with infrastructure problems - Logs lack system-wide context - Metrics show symptoms but not relationships - Service dependencies are hard to traverse quickly - Infrastructure and application layers are often disconnected - Which upstream service triggered the issue - Which downstream dependency failed - Whether the issue originated from infrastructure or application layers - Identify impacted services - Traverse dependency graphs - Correlate incidents - Suggest root causes - Cloud infrastructure - Kubernetes resources - Application services - Runtime service interactions - Observability signals - Cloud infrastructure - Kubernetes platform - Application services - Service interactions - Historical incidents - node failures - CPU saturation - network issues - downstream dependencies - cascading failures - shared services impacting multiple user journeys - infrastructure issues - application errors - service dependencies - recurring failures - common dependency issues - infrastructure patterns impacting multiple services - Amazon EventBridge - Amazon Bedrock - Amazon OpenSearch - Amazon Neptune (as a managed graph alternative) - upstream services - downstream dependencies - infrastructure nodes - dependency reasoning - cross-service correlation - historical incident learning - autonomous remediation agents - continuous incident learning - multi-agent observability systems - integration with CI/CD pipelines - observability data - service topology graphs - Kubernetes infrastructure knowledge - historical incident intelligence - LLM reasoning

🏷️ Tags

how-totutorialguidedev.toaillmnetworknodedatabasekubernetes