Tools: ๐ OpenHands โ Deep Dive & Build-Your-Own Guide ๐
Table of Contents
๐ก TL;DR โ what OpenHands is, in one paragraph
1. ๐ง The mental model โ Agent / Conversation / Workspace / Event Stream
Why this shape works
The four V1 design principles (steal these)
2. โ๏ธ The agent loop โ the canonical 30 lines
Worked example: one task end-to-end through the loop
3. ๐ Actions and Observations โ the universal protocol
CodeAct โ why "code" is the action language
The four-phase methodology baked into the system prompt
Build-your-own action set
5. ๐ณ Sandboxing โ Workspace + Action Execution Server
Build-your-own sandbox
6. ๐๏ธ Memory โ the Condenser
Build-your-own condenser
7. ๐ Microagents / Skills โ knowledge that auto-loads
Skill + MCP โ the dynamic-tool pattern
`markdown
DATABASE_URL: "$DATABASE_URL"
Postgres read-only access
Build-your-own skills
8. ๐ค Sub-agent delegation โ parallel agents on a shared workspace
Then the LLM can call the delegate tool with multiple targets:
delegate(targets=[
{"agent": "explore", "task": "Find all usages of authMiddleware"},
{"agent": "bash", "task": "Run the failing test and capture stderr"}])
Build-your-own delegation
9. ๐จ Stuck detection โ the "agent has lost the plot" alarm
10. ๐ Security โ confirmation policy + risk analyzer
11. ๐ The LLM layer โ LiteLLM, Router, and prompt caching
Concrete cost economics
Build-your-own
12. ๐ What makes OpenHands highly autonomous โ synthesis
13. ๐๏ธ Building your own โ minimum viable autonomous agent
Skeleton
Build order (each step takes ~half a day)
Build-order rules of thumb
What to skip in v1
14. ๐ญ Production features OpenHands ships that you'll want eventually
15. โ ๏ธ Honest pitfalls and gotchas
16. ๐ Reading list (curated, in order)
17. ๐ฏ Closing โ the mental model that makes everything click A practical, technical walkthrough of how OpenHands (formerly OpenDevin) actually works, what makes it highly autonomous, and how you can build a similar agent from first principles. Written April 2026. Based on the V1 SDK paper (arXiv 2511.03690), the original OpenDevin paper (arXiv 2407.16741), the OpenHands docs, and the source of All-Hands-AI/OpenHands and OpenHands/software-agent-sdk. OpenHands is an open-source autonomous software-engineering agent. It scores ~77% on SWE-Bench Verified with Claude Sonnet 4.5, opens GitHub PRs without supervision, and ships under MIT. Architecturally it is a tiny core: a stateless Agent that emits Actions, a Conversation that runs the loop and stores an append-only EventLog, a Workspace (local process or Docker container) that executes Actions and returns Observations, and an LLM wrapped by LiteLLM for provider portability. Everything else โ memory compression, microagent knowledge, sub-agent delegation, security review, stuck detection โ is a small auxiliary service hanging off the event stream. The result is a system where a few thousand lines of Python turn an LLM into something that can run for hours, recover from its own errors, and finish real engineering tasks. If you only remember three slogans: Four roles, sharp boundaries: This is the V1 architecture (Nov 2025 onward). The original V0 had an explicit AgentController class plus a Runtime abstraction; V1 collapsed that into Conversation + Workspace because the controller didn't earn its keep. The V1 paper opens with four explicit principles. They explain why the architecture looks the way it does, and they're worth lifting wholesale into your own design doc: The cautionary tale: V0 reportedly had 140+ config fields, 15 config classes, and 2.8K LOC just for configuration before the V1 rewrite. If your config grows faster than your features, that's a smell. Here is the actual Agent.step() from the V1 SDK (openhands-sdk/openhands/sdk/agent/agent.py), distilled. This is the single most important function in the project. Read it twice. That's it. The Conversation calls step() in a while not finished: loop. Everything else โ memory compression, microagent injection, security review โ happens inside one of those 5 phases as a hook or as another event being emitted. Phases worth memorizing: Build-your-own: when you write your own version, copy this 5-phase shape exactly. The hardest bug in agent loops is "the LLM responded but my code didn't know what to do with it" โ explicit response classification kills that bug. A concrete trace makes the abstraction click. Imagine the user says: "Find the failing test in this repo and fix it." What happens, event by event: Three things to notice: Every interaction with the world is either an Action (something the agent decided to do) or an Observation (what happened as a result). Both are typed Pydantic models. The list is short and stable: The flagship CodeActAgent (openhands/agenthub/codeact_agent/) is built around one observation from the original paper: instead of giving the LLM 20 bespoke tools each with their own JSON schema, give it bash, Python, and a browser DSL, and let it express anything as code. Empirically this generalizes far better and dramatically reduces parsing errors. The trade-off: a giant unified action space relies on the LLM being a strong code generator. With weaker models you may need narrower, more guided tools. With Claude Sonnet 4.5 / GPT-5, "give it a shell" is the strongest baseline. The CodeActAgent prompt is doing more than just listing tools โ it imposes a methodology that drives autonomous behavior. Roughly: The system prompt also encodes etiquette: configure git user.name=openhands and [email protected] if missing, prefer str_replace_editor over rewriting whole files, ask the user (MessageAction(wait_for_response=True)) if truly blocked instead of guessing. Why this matters for autonomy: the verification loop is the difference between an agent that hallucinates "done" and one that actually finishes. If you take only one prompting lesson from OpenHands, take this: make your agent re-run the test suite as the last action before finish. The whole "ran for 30 minutes and didn't break anything" story falls apart without it. The actual prompt template lives in openhands/agenthub/codeact_agent/prompts/system_prompt.j2. Read it directly when designing your own โ it's a cheat sheet for what works. If you're building your own agent, you can ship a useful prototype with three actions: Add Browse and RunPython later. Keep observations boringly literal: stdout + stderr + exit code, file diff, page text. Don't pre-summarize โ let the LLM see the raw world. Every Action and Observation is wrapped as an Event with id, source โ {agent, user, environment}, and timestamp, then appended to the EventLog. The log is: The auxiliary services in V1 โ Persistence, Stuck Detection, Visualization, Secret Registry โ all read from the event log and never mutate state directly. State mutation only happens by appending a new event. That single rule is what makes the system replayable and gives you free time-travel debugging. Build-your-own: write your event log as events.jsonl plus a state.json for cached materialized state. Don't get fancy โ it's a list of dicts. The discipline of "all state changes are events" pays for itself the first time you have to debug why an agent did something weird at minute 47. The Workspace is where Actions get executed. Three implementations: The clever part: the Docker container runs a small FastAPI server inside it, the Action Execution Server. The agent process on the host sends actions to it as REST POST /execute_action, and the server runs them against: Plus it ships VSCode Server on a sibling port so a human can attach. The agent talks to the box exactly the way a remote developer would. That's ~300 LOC and gives you 80% of what DockerRuntime does. Don't try to be clever about networking/cgroups until you need to. LLM context windows are finite. Long agent runs blow through them. OpenHands handles this with condensers, plug-in objects that decide whether to compress history before each LLM call. Default policy (get_default_condenser): Translation: when the visible event count exceeds 80, ask a (cheap) LLM to summarize all events except the first 4 (which usually contain the system prompt and original task) and the last few (recent work). Replace the middle with that summary. The V1 paper claims this reduces API spend by ~2ร with no quality loss on benchmarks; in practice it depends heavily on your task length, but it's the difference between "agent stops at hour 1 with a context error" and "agent runs for 8 hours." Don't summarize on every step โ only when over a threshold. Cache aggressively. The cheapest thing you can do is just truncate with a small head + recent tail; LLM summarization is the upgrade. This is one of the biggest autonomy multipliers and the easiest to underrate. The problem: the system prompt is finite. You can't cram every framework's conventions, every project's quirks, every secret-handling rule, into one giant blob โ it would burn tokens and confuse the model. The solution: Skills (formerly "microagents"). Markdown files with YAML frontmatter, organized by trigger: This is how an OpenHands agent dropped into your repo "knows" your conventions without you doing anything: the always-on repo skills get glued onto the system prompt at conversation start. This is one of the more under-discussed power moves. A skill can ship its own MCP server and tools, only when activated: name: postgres-readonly
trigger: type: keyword keywords: ["database", "query", "sql", "postgres"]mcp_tools: mcpServers: pg: command: "uvx" args: ["mcp-server-postgres", "--readonly"] env: You have read-only DB access via the pg MCP server. Schema:!psql -c "\dt" $DATABASE_URL` When the user mentions "database", the skill activates: the MCP server is spawned, its tools (pg.query, pg.describe) are registered into agent.tools_map, and the rendered schema is injected into the system prompt. No tools at all when the skill isn't active โ token-cheap, attack-surface-light, and self-documenting. Build this pattern and you stop bloating your global tool list. `pydef load_skills(repo_path, latest_user_message): skills = [] # always-on for f in (repo_path / ".agents" / "skills").glob("*.md"): meta, body = parse_frontmatter(f) if not meta.get("trigger"): skills.append(render(body)) # keyword-triggered for f in (repo_path / ".agents" / "skills").glob("*.md"): meta, body = parse_frontmatter(f) kw = (meta.get("trigger") or {}).get("keywords", []) if any(k.lower() in latest_user_message.lower() for k in kw): skills.append(render(body)) return "\n\n".join(skills)` render() does the !`...` substitution. Cap output at 50KB to prevent prompt-injection-via-huge-files. OpenHands V1 treats delegation as just another tool, not a special core mechanism. The delegation tool offers: `pythonagent.register_subagent("bash", custom_prompt="...")agent.register_subagent("explore", tools=[GlobTool, GrepTool, FileReadTool]) Why this matters for autonomy: parallel exploration kills the latency tax on long tasks. While the parent agent is reasoning, two sub-agents are simultaneously grepping and running tests. The parent gets back a summary, not an essay of grep output. Independent context is the second insight: sub-agents don't pollute the parent's window. The parent never sees the 200 lines of grep output โ only the sub-agent's distilled answer. This is just concurrent.futures.ThreadPoolExecutor with a tool that takes a list of {agent_name, task} dicts. Each thread instantiates a child Conversation against the same workspace, runs to completion, returns its AgentFinishAction.outputs. Aggregate, return as one observation. The main rule: sub-agents share the workspace but not the conversation. Critical for keeping context clean. Without this, agents burn money in loops. OpenHands runs a StuckDetector (docs) on the event log every step. It flags five patterns: Comparison is semantic, not object identity: actions are matched by tool name + content (timestamps and metrics ignored). When stuck, the agent transitions to ERROR or emits a LoopRecoveryAction for the user to handle. Build-your-own: trivial. Maintain a sliding window of the last N events. Hash (action.tool, action.body, observation.body) tuples and count repeats. When count exceeds threshold, abort or notify. This single 100-LOC detector saves more money than any other optimization. OpenHands has two layers: Plus a Secret Registry that: Headless mode hard-disables confirmation (it's NeverConfirm always). That means headless mode's blast radius is whatever the workspace allows โ which is exactly why headless mode wants Docker. OpenHands wraps everything through LiteLLM so users get 100+ providers (OpenAI, Anthropic, Bedrock, Azure, Google, local Ollama) for free. Notable layer features: Two data points worth knowing: For your own builds, expect order-of-magnitude: Cost ceilings to set on day one: MAX_ITERATIONS (default ~100 in OpenHands), LLM_NUM_RETRIES (default 8), and a hard accumulated-cost cutoff that aborts the conversation. Don't ship a headless agent without all three. Don't write the LLM client yourself โ depend on LiteLLM. Add three things on top: Twelve concrete mechanisms. If you want your agent to be autonomous, you need most of these: The pattern: autonomy is not a single feature. It's the union of "can keep going" (memory, budget), "can recover" (observations, stuck detection), "knows what to do" (skills), and "won't blow up the world" (sandbox, confirmation policy). Skip any of these and the agent is fragile. Here's a concrete, achievable plan to build a clone with the same shape. Roughly 2,000 LOC of Python. `plaintextyour_agent/ agent.py # Agent class with .step() conversation.py # Conversation runner + EventLog events.py # Event/Action/Observation Pydantic models tools/ bash.py edit.py finish.py workspace/ local.py docker.py llm.py # LiteLLM wrapper + RouterLLM condenser.py # Threshold-based summarizer skills.py # Markdown skill loader stuck.py # Sliding-window detector
` Stop when you have steps 1โ7 working end-to-end on a real task. That's already a usable agent. Steps 8โ14 are the "make it autonomous for hours" upgrades. A reasonable path: ship CLI + headless first, add GUI when the team complains, add the resolver-style "one-shot from a ticket" mode last because it requires high trust. A guide that only lists strengths is a brochure. Things to know: The whole project rests on one idea: an autonomous agent is a function from event history to next event, run in a loop. Every architectural choice in OpenHands is downstream of that: Everything else โ condensers, skills, sub-agents, security analyzers โ is a hook into that one loop. There is no big design. There is one tight kernel and a lot of small components hanging off it. Build the kernel first. Make sure it actually closes the loop on observations. Then earn each of the 12 autonomy features by removing a class of failure you observed in practice. That's the path. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse