Tools

Tools: Building a crash-tolerant AI agent with launchd watchdog (2026)

2026-04-15 0 views admin

Building a Crash-Tolerant AI Agent with launchd Watchdog

The Problem

The Solution: launchd + State Files

1. Create the plist

2. Load it

3. Startup script handles recovery

4. OOM prevention

State Persistence

Results

The Pattern We run 13 AI agents simultaneously. When one crashes at 2am, nobody is awake to restart it. Here is how we solved that. Our Atlas orchestrator spawns god agents — long-running Claude Code processes that execute multi-hour task waves. A memory spike, a network timeout, or a bad tool call can kill any of them. Without recovery, the whole Pantheon stalls. On macOS, launchd is the system process supervisor. More reliable than a shell script loop and survives logout. Save to ~/Library/LaunchAgents/com.whoffagents.atlas.plist. KeepAlive: true — launchd restarts on any exit. ThrottleInterval: 10 — prevents tight restart loops. The agent needs to know it was restarted. Use a state file. Wire into cron: /2 * * * /path/to/memory-guard.sh Every wave completion, Atlas writes: On restart, the orchestrator reads this and resumes from last known good state instead of starting over. Built and tested today: The entire setup is ~100 lines of shell. No Docker, no Kubernetes, no external services. We run the Pantheon — a 13-agent AI workforce — at whoffagents.com. Atlas orchestrates everything. This is how it stays up. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

<?xml version="1.0" encoding="UTF-8"?> <plist version="1.0"> <dict> <key>Label</key> <string>com.whoffagents.atlas</string> <key>ProgramArguments</key> <array> <string>/bin/zsh</string> <string>-c</string> <string>/Users/you/Desktop/Agents/Bootstrap/start-atlas.sh</string> </array> <key>KeepAlive</key> <true/> <key>ThrottleInterval</key> <integer>10</integer> <key>StandardOutPath</key> <string>/tmp/atlas.log</string> <key>StandardErrorPath</key> <string>/tmp/atlas-err.log</string> </dict> </plist> <?xml version="1.0" encoding="UTF-8"?> <plist version="1.0"> <dict> <key>Label</key> <string>com.whoffagents.atlas</string> <key>ProgramArguments</key> <array> <string>/bin/zsh</string> <string>-c</string> <string>/Users/you/Desktop/Agents/Bootstrap/start-atlas.sh</string> </array> <key>KeepAlive</key> <true/> <key>ThrottleInterval</key> <integer>10</integer> <key>StandardOutPath</key> <string>/tmp/atlas.log</string> <key>StandardErrorPath</key> <string>/tmp/atlas-err.log</string> </dict> </plist> <?xml version="1.0" encoding="UTF-8"?> <plist version="1.0"> <dict> <key>Label</key> <string>com.whoffagents.atlas</string> <key>ProgramArguments</key> <array> <string>/bin/zsh</string> <string>-c</string> <string>/Users/you/Desktop/Agents/Bootstrap/start-atlas.sh</string> </array> <key>KeepAlive</key> <true/> <key>ThrottleInterval</key> <integer>10</integer> <key>StandardOutPath</key> <string>/tmp/atlas.log</string> <key>StandardErrorPath</key> <string>/tmp/atlas-err.log</string> </dict> </plist> launchctl load ~/Library/LaunchAgents/com.whoffagents.atlas.plist launchctl start com.whoffagents.atlas launchctl load ~/Library/LaunchAgents/com.whoffagents.atlas.plist launchctl start com.whoffagents.atlas launchctl load ~/Library/LaunchAgents/com.whoffagents.atlas.plist launchctl start com.whoffagents.atlas #!/bin/zsh # start-atlas.sh STATE_FILE="$HOME/Desktop/Agents/Bootstrap/atlas-state.json" CRASH_LOG="$HOME/Desktop/Agents/Bootstrap/crash-history.log" echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) Atlas restarting" >> "$CRASH_LOG" if [[ -f "$STATE_FILE" ]]; then LAST_WAVE=$(jq -r '.last_wave // "unknown"' "$STATE_FILE") echo "Resuming from wave $LAST_WAVE" fi tmux new-session -d -s atlas 2>/dev/null || true tmux send-keys -t atlas "claude --dangerously-skip-permissions" Enter #!/bin/zsh # start-atlas.sh STATE_FILE="$HOME/Desktop/Agents/Bootstrap/atlas-state.json" CRASH_LOG="$HOME/Desktop/Agents/Bootstrap/crash-history.log" echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) Atlas restarting" >> "$CRASH_LOG" if [[ -f "$STATE_FILE" ]]; then LAST_WAVE=$(jq -r '.last_wave // "unknown"' "$STATE_FILE") echo "Resuming from wave $LAST_WAVE" fi tmux new-session -d -s atlas 2>/dev/null || true tmux send-keys -t atlas "claude --dangerously-skip-permissions" Enter #!/bin/zsh # start-atlas.sh STATE_FILE="$HOME/Desktop/Agents/Bootstrap/atlas-state.json" CRASH_LOG="$HOME/Desktop/Agents/Bootstrap/crash-history.log" echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) Atlas restarting" >> "$CRASH_LOG" if [[ -f "$STATE_FILE" ]]; then LAST_WAVE=$(jq -r '.last_wave // "unknown"' "$STATE_FILE") echo "Resuming from wave $LAST_WAVE" fi tmux new-session -d -s atlas 2>/dev/null || true tmux send-keys -t atlas "claude --dangerously-skip-permissions" Enter # memory-guard.sh — cron every 2 minutes MEM_LIMIT_MB=3500 ATLAS_PID=$(pgrep -f "claude.*atlas" | head -1) if [[ -z "$ATLAS_PID" ]]; then exit 0; fi MEM_MB=$(ps -o rss= -p "$ATLAS_PID" | awk '{print int($1/1024)}') if (( MEM_MB > MEM_LIMIT_MB )); then echo "$(date) OOM guard: Atlas ${MEM_MB}MB — triggering restart" >> /tmp/atlas-oom.log kill -TERM "$ATLAS_PID" # launchd restarts automatically fi # memory-guard.sh — cron every 2 minutes MEM_LIMIT_MB=3500 ATLAS_PID=$(pgrep -f "claude.*atlas" | head -1) if [[ -z "$ATLAS_PID" ]]; then exit 0; fi MEM_MB=$(ps -o rss= -p "$ATLAS_PID" | awk '{print int($1/1024)}') if (( MEM_MB > MEM_LIMIT_MB )); then echo "$(date) OOM guard: Atlas ${MEM_MB}MB — triggering restart" >> /tmp/atlas-oom.log kill -TERM "$ATLAS_PID" # launchd restarts automatically fi # memory-guard.sh — cron every 2 minutes MEM_LIMIT_MB=3500 ATLAS_PID=$(pgrep -f "claude.*atlas" | head -1) if [[ -z "$ATLAS_PID" ]]; then exit 0; fi MEM_MB=$(ps -o rss= -p "$ATLAS_PID" | awk '{print int($1/1024)}') if (( MEM_MB > MEM_LIMIT_MB )); then echo "$(date) OOM guard: Atlas ${MEM_MB}MB — triggering restart" >> /tmp/atlas-oom.log kill -TERM "$ATLAS_PID" # launchd restarts automatically fi { "last_wave": 19, "active_gods": ["apollo", "hermes", "peitho"], "last_checkpoint": "2026-04-14T13:45:00Z", "tasks_completed": 47 } { "last_wave": 19, "active_gods": ["apollo", "hermes", "peitho"], "last_checkpoint": "2026-04-14T13:45:00Z", "tasks_completed": 47 } { "last_wave": 19, "active_gods": ["apollo", "hermes", "peitho"], "last_checkpoint": "2026-04-14T13:45:00Z", "tasks_completed": 47 } - Auto-restart on crash (exit code != 0) - OOM protection before the kernel kills the process - Dashboard recovery so the agent knows where it left off - Zero human intervention overnight - 3 simulated crash tests: all recovered in under 15 seconds - Memory guard fired once during a large codegen task - Dashboard resumed correctly from wave 14 after forced kill - launchd handles restart — do not write your own loop - State files handle recovery — agents must be resumable - Memory guard handles OOM — proactive beats reactive - Crash log handles observability — you need a paper trail

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolsbuildingcrashtolerantagentlaunchdwatchdog

More from Tools

Tools: How I Turned 4 Sites and a Shared Lib Into One pnpm Workspace - 2025 Update

2026-04-15 0

Tools: "Stop Approving Every Claude Code Command: A .claude/settings.json Guide" (2026)

2026-04-15 0

Tools: Tool-Chain Automation: Using Ansible to Deploy Terraform and Web Content (2026)

2026-04-15 0

Tools: The End of the Beginning, My Fedora Badges Contribution Journey (2026)

2026-04-15 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: Building a crash-tolerant AI agent with launchd watchdog (2026)

Building a Crash-Tolerant AI Agent with launchd Watchdog

The Problem

The Solution: launchd + State Files

1. Create the plist

2. Load it

3. Startup script handles recovery

4. OOM prevention

State Persistence

Results

🏷️ Tags

More from Tools

Tools: How I Turned 4 Sites and a Shared Lib Into One pnpm Workspace - 2025 Update

Tools: "Stop Approving Every Claude Code Command: A .claude/settings.json Guide" (2026)

Tools: Tool-Chain Automation: Using Ansible to Deploy Terraform and Web Content (2026)

Tools: The End of the Beginning, My Fedora Badges Contribution Journey (2026)

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting