Tools: Agents Don't Fail at AI — They Fail at DevOps
Failure 1: Orphaned Processes ## Failure 2: Context Window Exhaustion ## Failure 3: Silent Auth Token Expiry ## Failure 4: Log Disk Fill ## The Pattern Across All These Failures ## The Ops Checklist When people imagine AI agents failing, they picture the wrong things. They imagine hallucinations, confused responses, bad reasoning. Those happen, but they are not where production systems actually break. Production agents fail at DevOps. Orphaned processes nobody is watching. Context windows that hit the ceiling mid-task. Auth tokens that expire silently and cause agents to fail for hours before anyone notices. Logs that fill disks. Services that restart into broken state and stay there. I have been running 23 agents in production across five businesses for six months. The model has almost never been the problem. The ops layer has been the problem, repeatedly, in ways that were entirely preventable. Here is what broke and how I fixed it. The first major incident was an agent that got stuck in a loop. The session appeared active, was consuming API credits, but was not producing useful output. Nobody noticed for almost four hours because there was no alerting. When I finally investigated, the process was alive but the session state was corrupted. Killing the process required manually tracking down the PID — the service had spawned a child process that the parent could not clean up on its own. The fix was two things: KillMode=control-group in systemd: KillMode=control-group tells systemd to kill the entire cgroup when a service stops — not just the main process, but every child it spawned. Before this, stopping a service would leave zombie child processes running. After this, stop means stop. Session health monitoring: I added a simple heartbeat check. Each agent is expected to respond to a poll message every 30 minutes during active hours. If it does not respond, the monitor sends an alert. This turned a 4-hour incident into a 35-minute one. This one is sneaky because it does not look like a failure at first. The agent keeps working, keeps responding — but it is operating on a truncated view of its own context. Recent memory files do not fit. Earlier instructions get dropped. The agent starts making decisions based on incomplete state. I had an agent handling a multi-step legal document review hit the context ceiling on step 7 of 9. The last two steps were completed, but without the context from steps 3-6. The output looked fine. It was not fine. Pre-compaction memory flush. Before context usage hits 90%, agents now write their working state to a checkpoint file: On fresh start after compaction, the agent reads WORKSTATE.md first, before anything else. Continuity restored. Chunked task execution. Long multi-step tasks now get broken into phases, with a checkpoint written between phases. Phase 1 completes and writes output. Phase 2 starts fresh, reads Phase 1 output. No single session tries to hold the entire context of a long job. OAuth tokens expire. This is known. What is not obvious is how agents fail when they do. They do not fail loudly with a clear error. They fail softly — the tool call returns an auth error, the agent interprets it as a temporary issue, retries a few times, then either gives up silently or reports a vague failure. Meanwhile, every task requiring that auth is broken until someone notices and rotates the token. I had an email management agent run for six hours thinking it was processing emails while actually hitting 401s on every IMAP call. Six hours of missed emails, zero alerts. Explicit auth validation at session start. Every agent that uses external auth now runs a quick validation check at the start of each session: If auth fails at startup, the agent reports the failure immediately instead of silently degrading. Token rotation schedule. OAuth tokens for Microsoft Graph now rotate every 45 days on a cron, before they hit the 90-day expiry. The rotation script runs on the server and sends a Telegram confirmation when done. No more surprise expirations. Failure mode documentation. Every agent config now has an explicit section on what to do when specific tools fail: Agents are chatty. journald captures everything. On a busy day with nine agents running, log volume can hit several gigabytes. I discovered this the hard way when the VPS ran out of disk space and three agents failed simultaneously with cryptic errors. Fix is boring but important: Log rotation capped at 2GB total, 1-week retention. Agents still log verbosely — that is useful for debugging — but the disk never fills. Every failure had the same root cause: the system did not know it was broken. Orphaned processes were running — but nobody was checking. Auth tokens were expiring — but agents were not validating. Context was filling — but there was no checkpoint. Logs were growing — but there was no cap. The fix in every case was the same thing: explicit monitoring, explicit validation, explicit failure modes. Never assume a system is healthy just because it has not reported a problem. Make it report problems. After six months of production incidents, here is what I check for on every new agent deployment: None of this is AI. All of it is ops. And all of it matters more than prompt engineering for keeping a production agent fleet reliable. The models are getting better fast. The ops fundamentals are not going to change. If you want agents that run reliably in production, invest in the boring stuff. The KillMode config that nobody talks about matters more than the latest model benchmark. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK:
[Service]
KillMode=control-group
KillSignal=SIGTERM
TimeoutStopSec=30 CODE_BLOCK:
[Service]
KillMode=control-group
KillSignal=SIGTERM
TimeoutStopSec=30 CODE_BLOCK:
[Service]
KillMode=control-group
KillSignal=SIGTERM
TimeoutStopSec=30 COMMAND_BLOCK:
# WORKSTATE.md — 2026-03-07 14:32 SAST
Active Tasks
- Legal review contract-2847: steps 1-6 complete, step 7 pending - Key finding so far: clause 14 is non-standard, flag for review - Next: check termination provisions ## Critical Context
- Client is government entity — standard commercial terms do not apply
- Review scope: liability clauses and termination only (per brief) COMMAND_BLOCK:
# WORKSTATE.md — 2026-03-07 14:32 SAST
Active Tasks
- Legal review contract-2847: steps 1-6 complete, step 7 pending - Key finding so far: clause 14 is non-standard, flag for review - Next: check termination provisions ## Critical Context
- Client is government entity — standard commercial terms do not apply
- Review scope: liability clauses and termination only (per brief) COMMAND_BLOCK:
# WORKSTATE.md — 2026-03-07 14:32 SAST
Active Tasks
- Legal review contract-2847: steps 1-6 complete, step 7 pending - Key finding so far: clause 14 is non-standard, flag for review - Next: check termination provisions ## Critical Context
- Client is government entity — standard commercial terms do not apply
- Review scope: liability clauses and termination only (per brief) COMMAND_BLOCK:
#!/bin/bash
# scripts/validate-auth.sh
mog auth list | grep -q "[email protected]" || { echo "AUTH_FAILED: MOG token expired" exit 1
}
echo "AUTH_OK" COMMAND_BLOCK:
#!/bin/bash
# scripts/validate-auth.sh
mog auth list | grep -q "[email protected]" || { echo "AUTH_FAILED: MOG token expired" exit 1
}
echo "AUTH_OK" COMMAND_BLOCK:
#!/bin/bash
# scripts/validate-auth.sh
mog auth list | grep -q "[email protected]" || { echo "AUTH_FAILED: MOG token expired" exit 1
}
echo "AUTH_OK" COMMAND_BLOCK:
Failure Modes
- Email auth fails: Send Telegram alert to owner, do not retry silently
- API rate limit hit: Wait 60 seconds, retry once, then alert
- File not found: Log and skip, do not halt entire task COMMAND_BLOCK:
Failure Modes
- Email auth fails: Send Telegram alert to owner, do not retry silently
- API rate limit hit: Wait 60 seconds, retry once, then alert
- File not found: Log and skip, do not halt entire task COMMAND_BLOCK:
Failure Modes
- Email auth fails: Send Telegram alert to owner, do not retry silently
- API rate limit hit: Wait 60 seconds, retry once, then alert
- File not found: Log and skip, do not halt entire task COMMAND_BLOCK:
# /etc/systemd/journald.conf
[Journal]
SystemMaxUse=2G
SystemKeepFree=500M
MaxRetentionSec=1week COMMAND_BLOCK:
# /etc/systemd/journald.conf
[Journal]
SystemMaxUse=2G
SystemKeepFree=500M
MaxRetentionSec=1week COMMAND_BLOCK:
# /etc/systemd/journald.conf
[Journal]
SystemMaxUse=2G
SystemKeepFree=500M
MaxRetentionSec=1week COMMAND_BLOCK:
Agent Ops Checklist
- [ ] systemd service with KillMode=control-group
- [ ] Restart=on-failure with RestartSec=10
- [ ] Auth validation at session start
- [ ] Token rotation schedule documented
- [ ] Heartbeat/health monitoring configured
- [ ] Log retention capped
- [ ] WORKSTATE.md checkpoint protocol in AGENTS.md
- [ ] Failure modes documented per tool
- [ ] Alert path for silent failures COMMAND_BLOCK:
Agent Ops Checklist
- [ ] systemd service with KillMode=control-group
- [ ] Restart=on-failure with RestartSec=10
- [ ] Auth validation at session start
- [ ] Token rotation schedule documented
- [ ] Heartbeat/health monitoring configured
- [ ] Log retention capped
- [ ] WORKSTATE.md checkpoint protocol in AGENTS.md
- [ ] Failure modes documented per tool
- [ ] Alert path for silent failures COMMAND_BLOCK:
Agent Ops Checklist
- [ ] systemd service with KillMode=control-group
- [ ] Restart=on-failure with RestartSec=10
- [ ] Auth validation at session start
- [ ] Token rotation schedule documented
- [ ] Heartbeat/health monitoring configured
- [ ] Log retention capped
- [ ] WORKSTATE.md checkpoint protocol in AGENTS.md
- [ ] Failure modes documented per tool
- [ ] Alert path for silent failures