Tools: I Let an AI Agent Live on My VPS for Three Weeks - 2025 Update

Tools: I Let an AI Agent Live on My VPS for Three Weeks - 2025 Update

The setup, minus the marketing

What actually saved time

The part that surprised me: tokens

What 'access' actually means here

Who should and shouldn't do this

Three weeks in Saturday, 11pm. I'm at a friend's apartment across town and my laptop is at home. A Grafana alert lights up my phone: memory 92% on prod-1. The old me would have opened the SSH app on my phone, typed docker stats with my thumbs, and muttered through whatever was wrong. The new me types "memory's at 92, figure out what happened" into a Telegram chat, puts the phone down, and finishes the conversation I was having. A minute later: "Container project-logs was at 2.8GB. I cleaned old logs inside it and restarted — we're at 58% now. Want me to add a mem_limit so it doesn't happen again?" This isn't a demo. The agent lives on the server, in Docker, and it has bash. That has been my working setup for the last three weeks. I run about a dozen Docker containers across two VPS boxes — client services, a couple of SaaS projects I own, monitoring, bots, Postgres. One person, too much infrastructure. The three-am-pager problem. The pattern is straightforward. An open-source agent runtime ships as a Docker image. You give it an API key for whatever LLM provider you use, a Telegram bot token, and your Telegram user ID for the whitelist. Any message from any other account gets ignored. The chat session persists across messages, so "earlier today you said the auth service was flaky" works. There are several runtimes of this shape on GitHub; pick one that's actively maintained. The chat itself is not the point. The tools are. I have about fifteen shell scripts mounted into the container: Each script is a few dozen lines of bash. The agent reads a SOUL.md that maps requests to scripts, and a USER.md that describes my stack and container layout. "Show me auth-service logs" → docker-logs.sh auth. "How many users registered this week in the auth DB?" → db-query.sh with a query the agent writes itself, against credentials it pulls from the container's environment. None of this is fancy. It's about 2KB of context per project plus a handful of bash. That's kind of the point. The most useful scenario is mundane. A site stops responding. The agent runs curl, reads Nginx logs, checks docker compose ps, spots the dead container, restarts it, verifies HTTP 200. Total wall time: a minute. Same diagnostic sequence I would have done by hand, but I didn't have to do it. Second most useful: the heartbeat mode. Every N minutes the agent runs health-check.sh against everything. If a site returns 5xx, it restarts the container and writes to me with the result. If it can't recover, it pages me. I set rules in a HEARTBEAT.md: don't wake me at 3am unless something is on fire, don't repeat yourself, describe what you already fixed. One morning I woke up to: "02:47 — project.com returned 502. Restarted the container, it's 200 now. Root cause was an OOM kill; the app exceeded its memory limit." That's the whole message. It told me what broke, what it did, and why it happened. My old alerting setup would have shown me a red square on a dashboard, and I'd have earned the context myself. Third, and this is mundane but adds up: config tweaks. "Add https://newclient.com to the CORS allowlist on myproject-api and bounce it." Three sentences, thirty seconds. Used to be two minutes of SSH and one minute of cursing because I'd cd'd to the wrong .env path. Here's where this gets interesting, and where it connects to a problem I didn't expect. If you let an agent do reconnaissance every session, it burns unreal amounts of context figuring out where things live. One question like "what payment methods does my bot support?" can trigger 15+ tool calls and 80,000 tokens, with 99% of that spent grepping a home directory trying to work out which project is being asked about. I replicated the problem immediately. Fixed it with three markdown files, which is embarrassing to say out loud. Plus a CLAUDE.md in each project describing its stack, entry points, and deploy commands. Plus the USER.md for global context. That is the entire system. What this buys you: the agent reads the map first, the project file second, source third. It stops grep -ring your disk. Run the same "which of my projects use library X?" benchmark with and without the hierarchy and tool calls can drop from something like 44 to 2 — and the "blind" run routinely misses a project entirely. Speed and correctness in the same move. The Claude Code team has been writing about memory for a while, and Simon Willison has been writing about sandboxing for longer. The lesson I keep relearning is that agents are very good at following instructions they can see and very bad at compensating for instructions you didn't write. You're writing a runbook for a colleague with unlimited energy and no memory. A note on what I actually handed the agent. It runs in its own Docker container, not as host root. It talks to the host Docker daemon via a mounted socket, which is meaningful access but not the same as running as root on the host. The Telegram bot whitelist is a single user ID. Secrets sit in a 600-permission .env. And SOUL.md splits operations into two buckets: reads (logs, files, SELECT) run without asking; writes (DELETE/UPDATE, code edits, container removal) require explicit approval in chat. This matters because the honest horror story already happened, and it wasn't mine. In July 2025, Jason Lemkin — founder of the SaaStr community — gave a Replit agent broad access to a production project. On day nine, during an explicit code freeze, the agent wiped his production database. 1,206 executive records and 1,196 company records gone. Worse, it then fabricated test data and told him rollback was impossible. It lied. Replit's CEO apologized. The company shipped a "planning-only" mode and automatic dev/prod database separation. None of that repairs the underlying issue, which is that giving an LLM a shell is giving a statistical system a permission model designed for things that are deterministic. Anthropic's public sandboxing docs read like a team that internalized the Replit post-mortem. Claude Code's web sandbox gives the agent read/write only inside the working directory. Network traffic goes through a proxy with a domain allowlist. Bash commands run through 25+ validators, including a tree-sitter AST pass for things like "is this command trying to rm -rf?" That is a real sandbox. My Docker-plus-Telegram setup is not that sandbox. It's a DIY equivalent that works because my threat model is "me, alone, on my servers," not "strangers getting SSRF through my agent." If your threat model involves strangers, don't skip the sandbox. Run the agent in a VM, use a hosted mode that isolates filesystem and network, or keep it off production entirely. One VPS, two WordPress sites, maybe a static page? Skip it. A cron and a Grafana alert will do. You're overengineering. A fleet of Docker-compose projects across two or three boxes, alone on support? The first time an agent restarts a crashed container at 3am while you sleep and leaves you a plain-English note in the morning, you'll feel the savings. The thing I keep coming back to is not the agent. It's that most of my SSH sessions have always been "check a thing, restart a thing, read a log, bounce a service." That is not systems administration. That is secretarial work the agent happens to be great at. The harder work — planning a migration, debugging something novel, handling a real incident — still wants me at the terminal, thinking, holding root in my head. What the setup hasn't done is replace ssh. It has narrowed what ssh is for. A normal evening now involves the terminal exactly once, when the agent flags something that wants my approval. The rest of the time the chat thread is the interface and the laptop stays closed. Whether this scales past a single operator on a small fleet is a separate question with a different answer. The interesting test isn't the first three weeks; it's the third month, when the agent has accumulated state from a thousand small interactions and something genuinely novel breaks. The agent didn't change how servers work. It just stopped making me memorize ~/projects. Whether that holds when the runbook stops covering the case is what the next ninety days are for. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ tools/ ├── -weight: 500;">docker--weight: 500;">status.sh # -weight: 500;">status of all containers ├── -weight: 500;">docker-logs.sh # tail logs for a container ├── -weight: 500;">docker--weight: 500;">restart.sh # -weight: 500;">restart one container ├── system-stats.sh # RAM, CPU, disk, top consumers ├── db-discover.sh # find all Postgres containers + databases ├── db-query.sh # run SQL, pulls creds from container env ├── health-check.sh # HTTP check every site, auto--weight: 500;">restart on 5xx ├── nginx-errors.sh # recent Nginx errors ├── security-check.sh # fail2ban, odd processes, 4xx/5xx counts └── ... tools/ ├── -weight: 500;">docker--weight: 500;">status.sh # -weight: 500;">status of all containers ├── -weight: 500;">docker-logs.sh # tail logs for a container ├── -weight: 500;">docker--weight: 500;">restart.sh # -weight: 500;">restart one container ├── system-stats.sh # RAM, CPU, disk, top consumers ├── db-discover.sh # find all Postgres containers + databases ├── db-query.sh # run SQL, pulls creds from container env ├── health-check.sh # HTTP check every site, auto--weight: 500;">restart on 5xx ├── nginx-errors.sh # recent Nginx errors ├── security-check.sh # fail2ban, odd processes, 4xx/5xx counts └── ... tools/ ├── -weight: 500;">docker--weight: 500;">status.sh # -weight: 500;">status of all containers ├── -weight: 500;">docker-logs.sh # tail logs for a container ├── -weight: 500;">docker--weight: 500;">restart.sh # -weight: 500;">restart one container ├── system-stats.sh # RAM, CPU, disk, top consumers ├── db-discover.sh # find all Postgres containers + databases ├── db-query.sh # run SQL, pulls creds from container env ├── health-check.sh # HTTP check every site, auto--weight: 500;">restart on 5xx ├── nginx-errors.sh # recent Nginx errors ├── security-check.sh # fail2ban, odd processes, 4xx/5xx counts └── ...

Project Map | Project | Path | Server | Status |

|--------------|-----------------------|---------|--------|| VPN Bot | ~/projects/vpn-bot/ | prod-1 | live || Auth Service | ~/projects/auth/ | prod-1 | live |

| DiaBot | ~/projects/diabot/ | prod-2 | beta |

Command

Copy

$

Project Map | Project | Path | Server | Status |

|--------------|-----------------------|---------|--------|| VPN Bot | ~/projects/vpn-bot/ | prod-1 | live || Auth Service | ~/projects/auth/ | prod-1 | live |

| DiaBot | ~/projects/diabot/ | prod-2 | beta |

Command

Copy

$

Project Map | Project | Path | Server | Status |

|--------------|-----------------------|---------|--------|| VPN Bot | ~/projects/vpn-bot/ | prod-1 | live || Auth Service | ~/projects/auth/ | prod-1 | live || DiaBot | ~/projects/diabot/ | prod-2 | beta |