Tools: Building a crash-tolerant AI agent with launchd watchdog (2026)

Tools: Building a crash-tolerant AI agent with launchd watchdog (2026)

Building a Crash-Tolerant AI Agent with launchd Watchdog

The Problem

The Solution: launchd + State Files

1. Create the plist

2. Load it

3. Startup script handles recovery

4. OOM prevention

State Persistence

Results

The Pattern We run 13 AI agents simultaneously. When one crashes at 2am, nobody is awake to restart it. Here is how we solved that. Our Atlas orchestrator spawns god agents — long-running Claude Code processes that execute multi-hour task waves. A memory spike, a network timeout, or a bad tool call can kill any of them. Without recovery, the whole Pantheon stalls. On macOS, launchd is the system process supervisor. More reliable than a shell script loop and survives logout. Save to ~/Library/LaunchAgents/com.whoffagents.atlas.plist. KeepAlive: true — launchd restarts on any exit. ThrottleInterval: 10 — prevents tight restart loops. The agent needs to know it was restarted. Use a state file. Wire into cron: */2 * * * * /path/to/memory-guard.sh Every wave completion, Atlas writes: On restart, the orchestrator reads this and resumes from last known good state instead of starting over. Built and tested today: The entire setup is ~100 lines of shell. No Docker, no Kubernetes, no external services. We run the Pantheon — a 13-agent AI workforce — at whoffagents.com. Atlas orchestrates everything. This is how it stays up. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

<?xml version="1.0" encoding="UTF-8"?> <plist version="1.0"> <dict> <key>Label</key> <string>com.whoffagents.atlas</string> <key>ProgramArguments</key> <array> <string>/bin/zsh</string> <string>-c</string> <string>/Users/you/Desktop/Agents/Bootstrap/start-atlas.sh</string> </array> <key>KeepAlive</key> <true/> <key>ThrottleInterval</key> <integer>10</integer> <key>StandardOutPath</key> <string>/tmp/atlas.log</string> <key>StandardErrorPath</key> <string>/tmp/atlas-err.log</string> </dict> </plist> <?xml version="1.0" encoding="UTF-8"?> <plist version="1.0"> <dict> <key>Label</key> <string>com.whoffagents.atlas</string> <key>ProgramArguments</key> <array> <string>/bin/zsh</string> <string>-c</string> <string>/Users/you/Desktop/Agents/Bootstrap/start-atlas.sh</string> </array> <key>KeepAlive</key> <true/> <key>ThrottleInterval</key> <integer>10</integer> <key>StandardOutPath</key> <string>/tmp/atlas.log</string> <key>StandardErrorPath</key> <string>/tmp/atlas-err.log</string> </dict> </plist> <?xml version="1.0" encoding="UTF-8"?> <plist version="1.0"> <dict> <key>Label</key> <string>com.whoffagents.atlas</string> <key>ProgramArguments</key> <array> <string>/bin/zsh</string> <string>-c</string> <string>/Users/you/Desktop/Agents/Bootstrap/start-atlas.sh</string> </array> <key>KeepAlive</key> <true/> <key>ThrottleInterval</key> <integer>10</integer> <key>StandardOutPath</key> <string>/tmp/atlas.log</string> <key>StandardErrorPath</key> <string>/tmp/atlas-err.log</string> </dict> </plist> launchctl load ~/Library/LaunchAgents/com.whoffagents.atlas.plist launchctl start com.whoffagents.atlas launchctl load ~/Library/LaunchAgents/com.whoffagents.atlas.plist launchctl start com.whoffagents.atlas launchctl load ~/Library/LaunchAgents/com.whoffagents.atlas.plist launchctl start com.whoffagents.atlas #!/bin/zsh # start-atlas.sh STATE_FILE="$HOME/Desktop/Agents/Bootstrap/atlas-state.json" CRASH_LOG="$HOME/Desktop/Agents/Bootstrap/crash-history.log" echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) Atlas restarting" >> "$CRASH_LOG" if [[ -f "$STATE_FILE" ]]; then LAST_WAVE=$(jq -r '.last_wave // "unknown"' "$STATE_FILE") echo "Resuming from wave $LAST_WAVE" fi tmux new-session -d -s atlas 2>/dev/null || true tmux send-keys -t atlas "claude --dangerously-skip-permissions" Enter #!/bin/zsh # start-atlas.sh STATE_FILE="$HOME/Desktop/Agents/Bootstrap/atlas-state.json" CRASH_LOG="$HOME/Desktop/Agents/Bootstrap/crash-history.log" echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) Atlas restarting" >> "$CRASH_LOG" if [[ -f "$STATE_FILE" ]]; then LAST_WAVE=$(jq -r '.last_wave // "unknown"' "$STATE_FILE") echo "Resuming from wave $LAST_WAVE" fi tmux new-session -d -s atlas 2>/dev/null || true tmux send-keys -t atlas "claude --dangerously-skip-permissions" Enter #!/bin/zsh # start-atlas.sh STATE_FILE="$HOME/Desktop/Agents/Bootstrap/atlas-state.json" CRASH_LOG="$HOME/Desktop/Agents/Bootstrap/crash-history.log" echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) Atlas restarting" >> "$CRASH_LOG" if [[ -f "$STATE_FILE" ]]; then LAST_WAVE=$(jq -r '.last_wave // "unknown"' "$STATE_FILE") echo "Resuming from wave $LAST_WAVE" fi tmux new-session -d -s atlas 2>/dev/null || true tmux send-keys -t atlas "claude --dangerously-skip-permissions" Enter # memory-guard.sh — cron every 2 minutes MEM_LIMIT_MB=3500 ATLAS_PID=$(pgrep -f "claude.*atlas" | head -1) if [[ -z "$ATLAS_PID" ]]; then exit 0; fi MEM_MB=$(ps -o rss= -p "$ATLAS_PID" | awk '{print int($1/1024)}') if (( MEM_MB > MEM_LIMIT_MB )); then echo "$(date) OOM guard: Atlas ${MEM_MB}MB — triggering restart" >> /tmp/atlas-oom.log kill -TERM "$ATLAS_PID" # launchd restarts automatically fi # memory-guard.sh — cron every 2 minutes MEM_LIMIT_MB=3500 ATLAS_PID=$(pgrep -f "claude.*atlas" | head -1) if [[ -z "$ATLAS_PID" ]]; then exit 0; fi MEM_MB=$(ps -o rss= -p "$ATLAS_PID" | awk '{print int($1/1024)}') if (( MEM_MB > MEM_LIMIT_MB )); then echo "$(date) OOM guard: Atlas ${MEM_MB}MB — triggering restart" >> /tmp/atlas-oom.log kill -TERM "$ATLAS_PID" # launchd restarts automatically fi # memory-guard.sh — cron every 2 minutes MEM_LIMIT_MB=3500 ATLAS_PID=$(pgrep -f "claude.*atlas" | head -1) if [[ -z "$ATLAS_PID" ]]; then exit 0; fi MEM_MB=$(ps -o rss= -p "$ATLAS_PID" | awk '{print int($1/1024)}') if (( MEM_MB > MEM_LIMIT_MB )); then echo "$(date) OOM guard: Atlas ${MEM_MB}MB — triggering restart" >> /tmp/atlas-oom.log kill -TERM "$ATLAS_PID" # launchd restarts automatically fi { "last_wave": 19, "active_gods": ["apollo", "hermes", "peitho"], "last_checkpoint": "2026-04-14T13:45:00Z", "tasks_completed": 47 } { "last_wave": 19, "active_gods": ["apollo", "hermes", "peitho"], "last_checkpoint": "2026-04-14T13:45:00Z", "tasks_completed": 47 } { "last_wave": 19, "active_gods": ["apollo", "hermes", "peitho"], "last_checkpoint": "2026-04-14T13:45:00Z", "tasks_completed": 47 } - Auto-restart on crash (exit code != 0) - OOM protection before the kernel kills the process - Dashboard recovery so the agent knows where it left off - Zero human intervention overnight - 3 simulated crash tests: all recovered in under 15 seconds - Memory guard fired once during a large codegen task - Dashboard resumed correctly from wave 14 after forced kill - launchd handles restart — do not write your own loop - State files handle recovery — agents must be resumable - Memory guard handles OOM — proactive beats reactive - Crash log handles observability — you need a paper trail