Tools: What If LLM Agents Coordinated Through the Filesystem Instead of HTTP?

Tools: What If LLM Agents Coordinated Through the Filesystem Instead of HTTP?

What If LLM Agents Coordinated Through the Filesystem Instead of HTTP?

The Frustration That Started This

The Insight I Can't Stop Thinking About

The Architecture I'm Designing

Approach 1: Versioned File Naming

The state machine lives in the filename

What the manifest.json would look like

Conceptual agent loop

Crash recovery — the part I find most compelling

Approach 2: Named Pipes (FIFOs)

When I'd reach for pipes vs versioned files

State Management: Two-Layer Design

Layer 1 — Filename (per-task state)

Layer 2 — manifest.json (system state)

What This Would and Wouldn't Solve

It would solve:

It wouldn't solve:

Why I Think This Direction Is Worth Exploring An architecture I've been thinking through — and why I think it might actually work. Every time I look at multi-agent AI frameworks, I see the same pattern: For what? To have two LLM processes pass text to each other. I'm a systems programmer. My instinct when I see this is: this is too much infrastructure for the actual problem. So I started asking a simpler question: What's the minimum coordination layer two agents actually need? The answer I keep coming back to has been sitting in Unix since the 1970s. LLM agents have a property that's easy to overlook: they are extraordinarily slow workers. A single inference call takes 2–10 seconds. Your agents are not going to saturate a network pipe. They're not going to race condition your shared memory. The bottleneck is never the IPC layer — it's always the model. This changes the calculus completely. Every classical argument against filesystem-based IPC — performance, latency, throughput — evaporates when your workers operate in seconds, not microseconds. What's left are the advantages: This is the Unix philosophy applied to LLM agents: write programs that communicate through text streams, because that is a universal interface. Here's the directory structure I'm thinking through: Two things coordinate everything: No agent talks directly to another agent. They communicate by mutating the filesystem. The simpler, more crash-safe approach I want to explore first. Renaming a file is atomic on POSIX filesystems. That's your concurrency primitive — no explicit locks needed. If an agent dies mid-task, the file sits at task_001_inprogress_agent_a.md. The orchestrator can detect stale inprogress files older than a threshold and reset them to pending. No data lost. No complex recovery logic. The filesystem is your persistent state. For use cases where you want real-time streaming handoff between two agents rather than polling. My current thinking: start with versioned files. Pipes are compelling for strict sequential pipelines but add blocking complexity that's hard to debug. Atomic, visible at a glance, no extra tooling. Single source of truth. Write atomically: write to manifest.tmp.json, rename to manifest.json. POSIX rename is atomic — safe for concurrent readers. This is deliberately a single-machine, low-throughput, high-debuggability architecture. I think that matches the actual deployment reality of most local LLM agent use cases — which are rarely distributed, rarely high-throughput, but almost always painful to debug. Current agent frameworks are built assuming agents are fast, distributed, and network-native. They inherit the full complexity of distributed systems design. LLM agents are none of those things. They're slow, usually running locally or single-tenant, and their output is human-readable text. The filesystem is a better fit for these actual properties. The 1970s Unix designers had this right for a different reason — and it might be accidentally correct again for a new one. I'm working on a reference implementation and plan to share it once it's solid enough to be useful. Would be genuinely interested in whether others have tried this approach or hit the walls I'm anticipating — especially around the atomic rename behavior on non-POSIX filesystems (Windows, NFS edge cases). Harshad Biradar — Systems programmer, building things from first principles. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

workspace/ ├── manifest.json ← coordinator's index, tracks all tasks ├── tasks/ │ ├── task_001_pending.md │ ├── task_002_inprogress_agent_a.md │ └── task_003_done.md ├── agents/ │ ├── orchestrator.md ← system prompt / role definition │ ├── agent_a.md │ └── agent_b.md └── outputs/ └── task_003_result.md workspace/ ├── manifest.json ← coordinator's index, tracks all tasks ├── tasks/ │ ├── task_001_pending.md │ ├── task_002_inprogress_agent_a.md │ └── task_003_done.md ├── agents/ │ ├── orchestrator.md ← system prompt / role definition │ ├── agent_a.md │ └── agent_b.md └── outputs/ └── task_003_result.md workspace/ ├── manifest.json ← coordinator's index, tracks all tasks ├── tasks/ │ ├── task_001_pending.md │ ├── task_002_inprogress_agent_a.md │ └── task_003_done.md ├── agents/ │ ├── orchestrator.md ← system prompt / role definition │ ├── agent_a.md │ └── agent_b.md └── outputs/ └── task_003_result.md task_001_pending.md task_001_inprogress_agent_a.md task_001_done.md manifest.json task_001_pending.md → available for any agent to pick up task_001_inprogress_agent_a.md → agent renames it (atomic claim) task_001_done.md → agent renames on completion task_001_failed.md → unrecoverable error task_001_pending.md → available for any agent to pick up task_001_inprogress_agent_a.md → agent renames it (atomic claim) task_001_done.md → agent renames on completion task_001_failed.md → unrecoverable error task_001_pending.md → available for any agent to pick up task_001_inprogress_agent_a.md → agent renames it (atomic claim) task_001_done.md → agent renames on completion task_001_failed.md → unrecoverable error { "tasks": [ { "id": "task_001", "status": "done", "owner": "agent_a", "created_at": "2025-03-13T10:00:00Z", "completed_at": "2025-03-13T10:00:42Z", "output": "outputs/task_001_result.md" } ] } { "tasks": [ { "id": "task_001", "status": "done", "owner": "agent_a", "created_at": "2025-03-13T10:00:00Z", "completed_at": "2025-03-13T10:00:42Z", "output": "outputs/task_001_result.md" } ] } { "tasks": [ { "id": "task_001", "status": "done", "owner": "agent_a", "created_at": "2025-03-13T10:00:00Z", "completed_at": "2025-03-13T10:00:42Z", "output": "outputs/task_001_result.md" } ] } while True: pending = glob("tasks/*_pending.md") if not pending: sleep(2) continue task_file = pending[0] claimed = claim_task(task_file) # atomic rename if not claimed: continue # another agent got it first task_content = read(claimed) result = call_llm(task_content) write(f"outputs/{task_id}_result.md", result) rename(claimed, f"tasks/{task_id}_done.md") update_manifest(task_id, status="done") while True: pending = glob("tasks/*_pending.md") if not pending: sleep(2) continue task_file = pending[0] claimed = claim_task(task_file) # atomic rename if not claimed: continue # another agent got it first task_content = read(claimed) result = call_llm(task_content) write(f"outputs/{task_id}_result.md", result) rename(claimed, f"tasks/{task_id}_done.md") update_manifest(task_id, status="done") while True: pending = glob("tasks/*_pending.md") if not pending: sleep(2) continue task_file = pending[0] claimed = claim_task(task_file) # atomic rename if not claimed: continue # another agent got it first task_content = read(claimed) result = call_llm(task_content) write(f"outputs/{task_id}_result.md", result) rename(claimed, f"tasks/{task_id}_done.md") update_manifest(task_id, status="done") task_001_inprogress_agent_a.md mkfifo workspace/pipes/orchestrator_to_agent_a mkfifo workspace/pipes/agent_a_to_orchestrator mkfifo workspace/pipes/orchestrator_to_agent_a mkfifo workspace/pipes/agent_a_to_orchestrator mkfifo workspace/pipes/orchestrator_to_agent_a mkfifo workspace/pipes/agent_a_to_orchestrator # Orchestrator sends task downstream with open("pipes/orchestrator_to_agent_a", "w") as pipe: pipe.write(task_content) # Agent reads, processes, responds back with open("pipes/orchestrator_to_agent_a", "r") as pipe: task = pipe.read() result = call_llm(task) with open("pipes/agent_a_to_orchestrator", "w") as pipe: pipe.write(result) # Orchestrator sends task downstream with open("pipes/orchestrator_to_agent_a", "w") as pipe: pipe.write(task_content) # Agent reads, processes, responds back with open("pipes/orchestrator_to_agent_a", "r") as pipe: task = pipe.read() result = call_llm(task) with open("pipes/agent_a_to_orchestrator", "w") as pipe: pipe.write(result) # Orchestrator sends task downstream with open("pipes/orchestrator_to_agent_a", "w") as pipe: pipe.write(task_content) # Agent reads, processes, responds back with open("pipes/orchestrator_to_agent_a", "r") as pipe: task = pipe.read() result = call_llm(task) with open("pipes/agent_a_to_orchestrator", "w") as pipe: pipe.write(result) manifest.tmp.json manifest.json watch -n1 ls tasks/ - Install LangChain / CrewAI / AutoGen - Set up 4 different API keys - Configure a message broker or HTTP server for agent communication - Handle serialization, retries, timeouts, and routing - Debug a system where the state lives... somewhere inside a Python object in memory - State is just files. Human-readable, inspectable, grep-able. - Crash recovery is free. If an agent dies mid-task, the file is still there. - No serialization protocol. Agents write markdown. Other agents read markdown. - Debuggability is trivial. Your logs ARE your state. ls tasks/ is your dashboard. - Zero dependencies. No broker. No database. No framework. - Filename encodes state. task_001_pending.md → task_001_inprogress_agent_a.md → task_001_done.md - manifest.json is the coordinator's index. Tracks all tasks, ownership, timestamps. - Framework fatigue — zero dependencies - Debuggability — state is always inspectable on disk - Crash recovery — filesystem is persistent by default - Context isolation — each agent is an independent CLI process - Observability — watch -n1 ls tasks/ is literally your dashboard - Distributed systems — agents must share a filesystem (same machine or NFS) - High throughput — if you need 1000 tasks/sec, use a proper queue - Real-time streaming — versioned files add ~2s polling latency