Tools: AI Agents Mapped My Legacy Production Environment in One Hour. It Cost $0. - Full Analysis

Tools: AI Agents Mapped My Legacy Production Environment in One Hour. It Cost $0. - Full Analysis

Setup: 30 seconds, zero footprint

How it actually works

What the agents discovered

What I got

The cost

Why this matters

Safety model

What I'm building I inherited a black box. Three VMs. A hundred-something microservices. Redis, ClickHouse, MySQL, some homegrown database nobody could name. Kafka and Zookeeper thrown in because of course they were. Nobody knew how the services connected. The original team was gone. The architecture lived entirely in oral tradition, and the last person who could recite it had left six months ago. This is not a metaphor. This is Tuesday for anyone who's done SRE work long enough. I already had Teleport for daily ops. SSH access, session recording. It worked, I didn't want to break it. That's it. Nothing new on my production machines. The agents ride the Teleport session I already had, with the permissions I'd already defined. Non-invasive — not in the "we promise it's lightweight" sense. In the "there is literally nothing new running on your production machines" sense. The agents SSH in through Teleport. Plain SSH commands, same ones you'd type yourself. What makes this safe rather than terrifying: The sandbox: strict AST parsing + default-deny whitelist. The agents can look at everything but touch nothing without asking. Step 1: OS inventory — kernel, distro, packages. All 3 VMs in parallel. Step 2: Process mapping — ps aux, parsed. Hundreds of processes tagged with binary path, resource footprint, parent-child relationships. Step 3: Process → Service resolution The AI doesn't hallucinate service names into your architecture map. It asks. Step 4: Service → Business Island grouping A business island = logical grouping by business function (billing, user auth, order processing). The thing that exists in every architect's head but never in any document. Step 5: Connection mapping — four evidence sources, cross-referenced: Cross-reference. Resolve conflicts. Draw edges. Architecture diagrams — topology maps of each business island, services as nodes, dependencies as edges, data flows labeled. The kind of diagram you'd pay a consultant a week to produce. Things I needed to know. Things dashboards would never show me. Knox gives free credits on signup. Enough for a small cluster for a long time. No credit card. No trial-that-converts-to-paid. One binary on a jump host. Most AIOps tools treat metrics as the final answer. They're not. They're the starting point. Real outages hide in blind spots: To find root cause, you have to log into machines and build an evidence chain. That's what humans do. That's what these agents do. Monitoring tells you a metric crossed a threshold. It doesn't tell you: Those aren't metric problems. They're structure problems. LLMs are uniquely good at structure — if you give them a way to see it without breaking anything. Letting AI touch production should sound terrifying. That's why: The agents never need their own access path. They never open a new hole in your security posture. That's the difference between an agent you'd let near production and one you wouldn't. It's called KnoxOps. Core idea: infrastructure is an object graph, not a flat list of resources. Model it that way and LLMs can reason like a senior SRE — tracing dependencies, calculating blast radius, finding what dashboards miss. The goal: delegate routine SRE toil so developers can focus on building. More connectors coming. The principle stays the same: use the access paths you already trust. If you've inherited a system nobody understands — I'd like to hear from you. I'm the founder of KnoxOps. Currently in open beta — use code DEVTO26 for 10,000 free credits on signup. Templates let you quickly answer FAQs or store snippets for re-use. as well , this person and/or - Installed knoxd on my Teleport proxy (not on the servers)

- AI agent team auto-configured a Teleport connector - Check name service first- If unregistered (most weren't — legacy system), infer from install path- Flag for human confirmation before writing anything back - Single points of failure- Circular dependencies- Kafka topics with no visible consumer group- One Redis instance holding session state for 6 business islands, zero isolation - System logs nobody tails- Manual changes nobody tracked- Config drift APM tools don't see - Service X and Y form a circular dependency that will cascade- Your session store is a single point of failure for half the platform - AST-parsed command validation — not string matching, actual syntax tree analysis- Default-deny whitelist — everything blocked unless explicitly allowed- Human-in-the-loop — any destructive action requires a plan + approval- Connector model — agents use paths you already trust (Teleport, SSH, AWS, Prometheus)