Tools: Report: Kronveil v0.3: Multi-Cluster Federation, Custom Collector SDK, and Automated Runbooks

Tools: Report: Kronveil v0.3: Multi-Cluster Federation, Custom Collector SDK, and Automated Runbooks

From Single Cluster to Multi-Cluster Production

What's New in v0.3

1. Multi-Cluster Federation

How It Works

Configuration

2. Custom Collector SDK

Full Example: HTTP Health Checker

What the Adapter Handles For You

3. Automated Runbook Engine

How Execution Works

4. Real Cloud Provider Integrations

GitHub Actions (CI/CD Collector)

Kafka Throughput

5. WebSocket Real-Time Streaming

Backend

Frontend

6. Runbooks Dashboard Page

7. OpenTelemetry & Observability

Full Architecture (v0.3)

Local Deployment

Verify it's running

Production Deployment

CI Pipeline

What's Next (v0.4) A couple of weeks ago, I shipped Kronveil v0.2 — a fully running AI infrastructure agent with a dashboard, gRPC transport, secret management, and local Docker deployment. If you missed the original launch post, here's where it all started. v0.2 worked well for a single cluster. But production environments don't run on one cluster. Teams have us-east-prod, eu-west-prod, maybe a staging cluster in ap-south. They have GitHub Actions pipelines they need visibility into. They have Azure VMs and GCP instances alongside Kubernetes workloads. v0.3 addresses all of that. Here's what changed. This is the biggest feature in v0.3. The federation manager sits on top of multiple Kubernetes collectors and aggregates their telemetry into a single event stream. Each cluster's events are tagged with cluster_name and cluster_region metadata before being forwarded. The aggregator deduplicates events using SHA256 fingerprinting — if overlapping collectors emit the same event within a 30-second window, it's counted once. The federation manager implements engine.Collector, so the rest of Kronveil — the intelligence pipeline, the API, the dashboard — doesn't need to know whether it's watching 1 cluster or 20. Aggregate metrics are computed automatically: total pods, total nodes, total events across all clusters. Writing a Kronveil collector used to mean implementing the full engine.Collector interface — managing goroutines, channels, health reporting, and lifecycle. Now you implement three methods: The SDK's Builder handles the rest — polling loop, buffered event channel with backpressure, health reporting, and clean shutdown: One interesting bug I caught during CI: the original Stop() held a mutex while waiting on a WaitGroup. The polling goroutine needed the same mutex to record errors. Classic deadlock — the goroutine couldn't finish because it couldn't acquire the lock, and Stop() couldn't return because it was waiting for the goroutine. Fixed by releasing the lock before wg.Wait(). When Kronveil detects an incident, it can now execute a predefined playbook instead of just alerting. The runbook engine ships with 4 default runbooks: In v0.3, all action handlers run in dry-run mode — they log what they would do without executing. This lets you validate runbook logic before enabling live remediation. Live execution is on the v0.4 roadmap. You can register custom runbooks: v0.2 had stub implementations for cloud providers. v0.3 wires up real SDKs. The CI/CD collector now polls the GitHub REST API for workflow runs across configured repositories: It tracks run status changes, emits events for new runs and state transitions, and maps GitHub conclusions to severity levels: The Kafka collector now dials brokers directly, reads partition offsets, and computes real messages/second throughput per topic. No more mock data. The dashboard no longer polls the REST API for updates. Events flow over WebSocket. The Go server manages a WebSocket hub with a broadcaster that pushes engine status to all connected clients every 2 seconds: Two new React hooks power the live experience: The Overview page shows a green pulsing Live indicator when WebSocket is connected. When disconnected, it falls back to mock data gracefully. New /runbooks route in the dashboard: Summary cards at the top: Each runbook card shows: Same dark theme as the rest of the dashboard. Built with the same Tailwind patterns. Kronveil exports traces and metrics via OpenTelemetry, fitting into your existing observability stack: Point your Jaeger, Tempo, or Datadog backend at these endpoints, or configure Prometheus to scrape Kronveil's metrics. Everything runs with Docker Compose — same as v0.2: Kronveil ships with a Helm chart for Kubernetes deployment. For AWS EKS with Rancher: The Helm chart includes deployment, RBAC (ClusterRole + Role), ServiceAccount with IRSA annotation, NetworkPolicy, and Prometheus scrape annotations. A full production deployment guide covering ECR, MSK, IRSA, ALB Ingress, and TLS is available in the repository. All 7 CI jobs passing: If you've been following along, star the repo and try it out. PRs, issues, and feedback are always welcome. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

┌─────────────────────────────────────────────────┐ │ Federation Manager │ │ implements engine.Collector │ ├─────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ us-east-prod │ │ eu-west-prod │ ... │ │ │ K8s Collector│ │ K8s Collector│ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ └──────┬───────────┘ │ │ ▼ │ │ ┌──────────────────────────┐ │ │ │ Aggregator │ │ │ │ SHA256 dedup (30s window)│ │ │ │ Cross-cluster metrics │ │ │ └──────────────────────────┘ │ │ │ └─────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────┐ │ Federation Manager │ │ implements engine.Collector │ ├─────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ us-east-prod │ │ eu-west-prod │ ... │ │ │ K8s Collector│ │ K8s Collector│ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ └──────┬───────────┘ │ │ ▼ │ │ ┌──────────────────────────┐ │ │ │ Aggregator │ │ │ │ SHA256 dedup (30s window)│ │ │ │ Cross-cluster metrics │ │ │ └──────────────────────────┘ │ │ │ └─────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────┐ │ Federation Manager │ │ implements engine.Collector │ ├─────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ us-east-prod │ │ eu-west-prod │ ... │ │ │ K8s Collector│ │ K8s Collector│ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ └──────┬───────────┘ │ │ ▼ │ │ ┌──────────────────────────┐ │ │ │ Aggregator │ │ │ │ SHA256 dedup (30s window)│ │ │ │ Cross-cluster metrics │ │ │ └──────────────────────────┘ │ │ │ └─────────────────────────────────────────────────┘ collectors: kubernetes: clusters: - name: us-east-prod kubeconfig_path: ~/.kube/us-east context: prod-context namespaces: ["default", "payments", "auth"] poll_interval: 15s - name: eu-west-prod kubeconfig_path: ~/.kube/eu-west context: prod-context namespaces: ["default", "payments"] poll_interval: 15s collectors: kubernetes: clusters: - name: us-east-prod kubeconfig_path: ~/.kube/us-east context: prod-context namespaces: ["default", "payments", "auth"] poll_interval: 15s - name: eu-west-prod kubeconfig_path: ~/.kube/eu-west context: prod-context namespaces: ["default", "payments"] poll_interval: 15s collectors: kubernetes: clusters: - name: us-east-prod kubeconfig_path: ~/.kube/us-east context: prod-context namespaces: ["default", "payments", "auth"] poll_interval: 15s - name: eu-west-prod kubeconfig_path: ~/.kube/eu-west context: prod-context namespaces: ["default", "payments"] poll_interval: 15s type Plugin interface { Name() string Collect(ctx context.Context) ([]*collector.Event, error) Healthcheck(ctx context.Context) error } type Plugin interface { Name() string Collect(ctx context.Context) ([]*collector.Event, error) Healthcheck(ctx context.Context) error } type Plugin interface { Name() string Collect(ctx context.Context) ([]*collector.Event, error) Healthcheck(ctx context.Context) error } col := collector.NewBuilder(&myPlugin{}). WithPollInterval(10 * time.Second). WithBufferSize(128). WithLogger(slog.Default()). Build() // col implements engine.Collector — register it like any built-in collector registry.RegisterCollector(col) col := collector.NewBuilder(&myPlugin{}). WithPollInterval(10 * time.Second). WithBufferSize(128). WithLogger(slog.Default()). Build() // col implements engine.Collector — register it like any built-in collector registry.RegisterCollector(col) col := collector.NewBuilder(&myPlugin{}). WithPollInterval(10 * time.Second). WithBufferSize(128). WithLogger(slog.Default()). Build() // col implements engine.Collector — register it like any built-in collector registry.RegisterCollector(col) type HTTPChecker struct { url string } func (h *HTTPChecker) Name() string { return "http-checker" } func (h *HTTPChecker) Collect(ctx context.Context) ([]*collector.Event, error) { start := time.Now() resp, err := http.Get(h.url) if err != nil { return []*collector.Event{{ Type: "http_check_failed", Severity: "high", Payload: map[string]interface{}{"url": h.url, "error": err.Error()}, }}, nil } defer resp.Body.Close() return []*collector.Event{{ Type: "http_check", Payload: map[string]interface{}{ "url": h.url, "status": resp.StatusCode, "latency_ms": time.Since(start).Milliseconds(), }, }}, nil } func (h *HTTPChecker) Healthcheck(ctx context.Context) error { return nil } type HTTPChecker struct { url string } func (h *HTTPChecker) Name() string { return "http-checker" } func (h *HTTPChecker) Collect(ctx context.Context) ([]*collector.Event, error) { start := time.Now() resp, err := http.Get(h.url) if err != nil { return []*collector.Event{{ Type: "http_check_failed", Severity: "high", Payload: map[string]interface{}{"url": h.url, "error": err.Error()}, }}, nil } defer resp.Body.Close() return []*collector.Event{{ Type: "http_check", Payload: map[string]interface{}{ "url": h.url, "status": resp.StatusCode, "latency_ms": time.Since(start).Milliseconds(), }, }}, nil } func (h *HTTPChecker) Healthcheck(ctx context.Context) error { return nil } type HTTPChecker struct { url string } func (h *HTTPChecker) Name() string { return "http-checker" } func (h *HTTPChecker) Collect(ctx context.Context) ([]*collector.Event, error) { start := time.Now() resp, err := http.Get(h.url) if err != nil { return []*collector.Event{{ Type: "http_check_failed", Severity: "high", Payload: map[string]interface{}{"url": h.url, "error": err.Error()}, }}, nil } defer resp.Body.Close() return []*collector.Event{{ Type: "http_check", Payload: map[string]interface{}{ "url": h.url, "status": resp.StatusCode, "latency_ms": time.Since(start).Milliseconds(), }, }}, nil } func (h *HTTPChecker) Healthcheck(ctx context.Context) error { return nil } Incident Detected │ ▼ FindRunbooks(incidentType) │ ▼ For each matching runbook: ├── autoExecute=true → Execute immediately └── autoExecute=false → Queue for approval │ ▼ Execute each Step sequentially: ├── kubectl_scale → Scale deployment replicas ├── restart_pod → Delete pod for controller restart ├── notify_oncall → Slack/PagerDuty notification ├── run_diagnostic → Execute diagnostic command └── custom_script → Run remediation script │ ▼ Record ExecutionResult (timing, step results, success/failure) Incident Detected │ ▼ FindRunbooks(incidentType) │ ▼ For each matching runbook: ├── autoExecute=true → Execute immediately └── autoExecute=false → Queue for approval │ ▼ Execute each Step sequentially: ├── kubectl_scale → Scale deployment replicas ├── restart_pod → Delete pod for controller restart ├── notify_oncall → Slack/PagerDuty notification ├── run_diagnostic → Execute diagnostic command └── custom_script → Run remediation script │ ▼ Record ExecutionResult (timing, step results, success/failure) Incident Detected │ ▼ FindRunbooks(incidentType) │ ▼ For each matching runbook: ├── autoExecute=true → Execute immediately └── autoExecute=false → Queue for approval │ ▼ Execute each Step sequentially: ├── kubectl_scale → Scale deployment replicas ├── restart_pod → Delete pod for controller restart ├── notify_oncall → Slack/PagerDuty notification ├── run_diagnostic → Execute diagnostic command └── custom_script → Run remediation script │ ▼ Record ExecutionResult (timing, step results, success/failure) executor.RegisterRunbook(runbook.Runbook{ ID: "custom-db-failover", Name: "Database Failover", IncidentTypes: []string{"DatabaseDown", "ReplicationLag"}, AutoExecute: false, Steps: []runbook.Step{ {Name: "Check replication", Action: "run_diagnostic", Config: map[string]string{"command": "pg_stat_replication"}}, {Name: "Promote standby", Action: "custom_script", Config: map[string]string{"script": "/opt/scripts/promote-standby.sh"}}, {Name: "Notify DBA team", Action: "notify_oncall", Config: map[string]string{"channel": "#dba-oncall"}}, }, }) executor.RegisterRunbook(runbook.Runbook{ ID: "custom-db-failover", Name: "Database Failover", IncidentTypes: []string{"DatabaseDown", "ReplicationLag"}, AutoExecute: false, Steps: []runbook.Step{ {Name: "Check replication", Action: "run_diagnostic", Config: map[string]string{"command": "pg_stat_replication"}}, {Name: "Promote standby", Action: "custom_script", Config: map[string]string{"script": "/opt/scripts/promote-standby.sh"}}, {Name: "Notify DBA team", Action: "notify_oncall", Config: map[string]string{"channel": "#dba-oncall"}}, }, }) executor.RegisterRunbook(runbook.Runbook{ ID: "custom-db-failover", Name: "Database Failover", IncidentTypes: []string{"DatabaseDown", "ReplicationLag"}, AutoExecute: false, Steps: []runbook.Step{ {Name: "Check replication", Action: "run_diagnostic", Config: map[string]string{"command": "pg_stat_replication"}}, {Name: "Promote standby", Action: "custom_script", Config: map[string]string{"script": "/opt/scripts/promote-standby.sh"}}, {Name: "Notify DBA team", Action: "notify_oncall", Config: map[string]string{"channel": "#dba-oncall"}}, }, }) collectors: cicd: github_token: "ghp_..." repo_filters: - "your-org/your-repo" - "your-org/another-repo" poll_interval: 60s collectors: cicd: github_token: "ghp_..." repo_filters: - "your-org/your-repo" - "your-org/another-repo" poll_interval: 60s collectors: cicd: github_token: "ghp_..." repo_filters: - "your-org/your-repo" - "your-org/another-repo" poll_interval: 60s Client connects → wsHub.add(conn) │ Broadcaster (2s) ───┤──→ JSON to all clients │ Client disconnects → wsHub.remove(conn) Client connects → wsHub.add(conn) │ Broadcaster (2s) ───┤──→ JSON to all clients │ Client disconnects → wsHub.remove(conn) Client connects → wsHub.add(conn) │ Broadcaster (2s) ───┤──→ JSON to all clients │ Client disconnects → wsHub.remove(conn) ┌───────────────────────────────────────────────────────┐ │ Dashboard (React) │ │ WebSocket <── REST API (Go) ──> │ ├───────────────────────────────────────────────────────┤ │ Engine Core │ │ ┌────────────┐ ┌──────────┐ ┌────────────────┐ │ │ │ Federation │ │ Runbook │ │ AI Intelligence│ │ │ │ Manager │ │ Executor │ │ (AWS Bedrock) │ │ │ └────────────┘ └──────────┘ └────────────────┘ │ ├───────────────────────────────────────────────────────┤ │ Collectors │ │ ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐ ┌───┐ ┌────────┐ │ │ │ K8s │ │Kafka│ │CI/CD │ │Azure│ │GCP│ │Custom │ │ │ │ │ │ │ │GitHub│ │ │ │ │ │ (SDK) │ │ │ └─────┘ └─────┘ └──────┘ └─────┘ └───┘ └────────┘ │ ├───────────────────────────────────────────────────────┤ │ Integrations & Export │ │ Slack · PagerDuty · Vault · Prometheus · OTel │ └───────────────────────────────────────────────────────┘ ┌───────────────────────────────────────────────────────┐ │ Dashboard (React) │ │ WebSocket <── REST API (Go) ──> │ ├───────────────────────────────────────────────────────┤ │ Engine Core │ │ ┌────────────┐ ┌──────────┐ ┌────────────────┐ │ │ │ Federation │ │ Runbook │ │ AI Intelligence│ │ │ │ Manager │ │ Executor │ │ (AWS Bedrock) │ │ │ └────────────┘ └──────────┘ └────────────────┘ │ ├───────────────────────────────────────────────────────┤ │ Collectors │ │ ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐ ┌───┐ ┌────────┐ │ │ │ K8s │ │Kafka│ │CI/CD │ │Azure│ │GCP│ │Custom │ │ │ │ │ │ │ │GitHub│ │ │ │ │ │ (SDK) │ │ │ └─────┘ └─────┘ └──────┘ └─────┘ └───┘ └────────┘ │ ├───────────────────────────────────────────────────────┤ │ Integrations & Export │ │ Slack · PagerDuty · Vault · Prometheus · OTel │ └───────────────────────────────────────────────────────┘ ┌───────────────────────────────────────────────────────┐ │ Dashboard (React) │ │ WebSocket <── REST API (Go) ──> │ ├───────────────────────────────────────────────────────┤ │ Engine Core │ │ ┌────────────┐ ┌──────────┐ ┌────────────────┐ │ │ │ Federation │ │ Runbook │ │ AI Intelligence│ │ │ │ Manager │ │ Executor │ │ (AWS Bedrock) │ │ │ └────────────┘ └──────────┘ └────────────────┘ │ ├───────────────────────────────────────────────────────┤ │ Collectors │ │ ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐ ┌───┐ ┌────────┐ │ │ │ K8s │ │Kafka│ │CI/CD │ │Azure│ │GCP│ │Custom │ │ │ │ │ │ │ │GitHub│ │ │ │ │ │ (SDK) │ │ │ └─────┘ └─────┘ └──────┘ └─────┘ └───┘ └────────┘ │ ├───────────────────────────────────────────────────────┤ │ Integrations & Export │ │ Slack · PagerDuty · Vault · Prometheus · OTel │ └───────────────────────────────────────────────────────┘ git clone https://github.com/kronveil/kronveil.git cd kronveil docker compose up --build git clone https://github.com/kronveil/kronveil.git cd kronveil docker compose up --build git clone https://github.com/kronveil/kronveil.git cd kronveil docker compose up --build curl http://localhost:8080/api/v1/health | jq . curl http://localhost:8080/api/v1/health | jq . curl http://localhost:8080/api/v1/health | jq . { "status": "healthy", "components": [ {"name": "kubernetes", "status": "healthy"}, {"name": "kafka", "status": "healthy"}, {"name": "cicd-collector", "status": "healthy"}, {"name": "cloud-aws", "status": "healthy"} ], "uptime": "2m30s" } { "status": "healthy", "components": [ {"name": "kubernetes", "status": "healthy"}, {"name": "kafka", "status": "healthy"}, {"name": "cicd-collector", "status": "healthy"}, {"name": "cloud-aws", "status": "healthy"} ], "uptime": "2m30s" } { "status": "healthy", "components": [ {"name": "kubernetes", "status": "healthy"}, {"name": "kafka", "status": "healthy"}, {"name": "cicd-collector", "status": "healthy"}, {"name": "cloud-aws", "status": "healthy"} ], "uptime": "2m30s" } # Build and push to ECR docker build -f deploy/Dockerfile.agent -t <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3 . docker push <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3 # Deploy with Helm helm install kronveil helm/kronveil/ \ -n kronveil --create-namespace \ -f values-prod.yaml # Build and push to ECR docker build -f deploy/Dockerfile.agent -t <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3 . docker push <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3 # Deploy with Helm helm install kronveil helm/kronveil/ \ -n kronveil --create-namespace \ -f values-prod.yaml # Build and push to ECR docker build -f deploy/Dockerfile.agent -t <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3 . docker push <account>.dkr.ecr.<region>.amazonaws.com/kronveil/agent:v0.3 # Deploy with Helm helm install kronveil helm/kronveil/ \ -n kronveil --create-namespace \ -f values-prod.yaml - Polling loop at your configured interval - Immediate first collect (no waiting for the first tick) - Buffered channel with drop + warn when full - Health status combining Healthcheck() result with recent collect errors - Thread-safe start/stop lifecycle - Auth: azidentity.DefaultAzureCredential — supports managed identity, CLI, environment variables - Metrics: Azure Monitor azquery.MetricsClient queries CPU, memory, disk, and network - Resources: ARM armresources.Client with full pagination support - Config: Set AZURE_SUBSCRIPTION_ID and standard Azure credentials - Auth: Application Default Credentials (ADC) - Metrics: Cloud Monitoring ListTimeSeries with 5-minute lookback window - Resources: Cloud Asset SearchAllResources for inventory - Config: Set GCP_PROJECT_ID or GOOGLE_CLOUD_PROJECT - useWebSocket — generic hook with auto-reconnect and exponential backoff (1s to 30s) - useEventStream — wraps WebSocket for the events endpoint, maintains a 100-event rolling buffer, provides memoized filtered views for incidents, anomalies, and all events - Total runbooks - Auto-execute count - Executions in last 24 hours - Average success rate - Name and description - Auto/manual execution badge (green dot for auto, gray for manual) - Incident type tags - Step count, last run time, success rate - Recent run indicators — green and red dots for the last 3 executions - Live runbook execution — move from dry-run to real kubectl and script execution with approval gates - Collector marketplace — share and install community-built collectors via the SDK - Cross-cluster incident correlation — AI-powered correlation across federated clusters - Dashboard runbook triggers — execute runbooks directly from the UI - Grafana plugin — embed Kronveil panels in existing Grafana dashboards - GitHub: github.com/kronveil/kronveil - v0.1 post: I Built an AI-Powered Infrastructure Observability Agent from Scratch - v0.2 post: Kronveil v0.2: Dashboard, gRPC, Secret Management, and Local Deployment