Tools: ๐Ÿ™Œ OpenHands โ€” Deep Dive & Build-Your-Own Guide ๐Ÿ“˜

Tools: ๐Ÿ™Œ OpenHands โ€” Deep Dive & Build-Your-Own Guide ๐Ÿ“˜

Table of Contents

๐Ÿ’ก TL;DR โ€” what OpenHands is, in one paragraph

1. ๐Ÿง  The mental model โ€” Agent / Conversation / Workspace / Event Stream

Why this shape works

The four V1 design principles (steal these)

2. โš™๏ธ The agent loop โ€” the canonical 30 lines

Worked example: one task end-to-end through the loop

3. ๐Ÿ”„ Actions and Observations โ€” the universal protocol

CodeAct โ€” why "code" is the action language

The four-phase methodology baked into the system prompt

Build-your-own action set

5. ๐Ÿณ Sandboxing โ€” Workspace + Action Execution Server

Build-your-own sandbox

6. ๐Ÿ—œ๏ธ Memory โ€” the Condenser

Build-your-own condenser

7. ๐Ÿ”Œ Microagents / Skills โ€” knowledge that auto-loads

Skill + MCP โ€” the dynamic-tool pattern

`markdown

DATABASE_URL: "$DATABASE_URL"

Postgres read-only access

Build-your-own skills

8. ๐Ÿค– Sub-agent delegation โ€” parallel agents on a shared workspace

Then the LLM can call the delegate tool with multiple targets:

delegate(targets=[

{"agent": "explore", "task": "Find all usages of authMiddleware"},

{"agent": "bash", "task": "Run the failing test and capture stderr"}])

Build-your-own delegation

9. ๐Ÿšจ Stuck detection โ€” the "agent has lost the plot" alarm

10. ๐Ÿ”’ Security โ€” confirmation policy + risk analyzer

11. ๐ŸŒ The LLM layer โ€” LiteLLM, Router, and prompt caching

Concrete cost economics

Build-your-own

12. ๐Ÿš€ What makes OpenHands highly autonomous โ€” synthesis

13. ๐Ÿ—๏ธ Building your own โ€” minimum viable autonomous agent

Skeleton

Build order (each step takes ~half a day)

Build-order rules of thumb

What to skip in v1

14. ๐Ÿญ Production features OpenHands ships that you'll want eventually

15. โš ๏ธ Honest pitfalls and gotchas

16. ๐Ÿ“š Reading list (curated, in order)

17. ๐ŸŽฏ Closing โ€” the mental model that makes everything click A practical, technical walkthrough of how OpenHands (formerly OpenDevin) actually works, what makes it highly autonomous, and how you can build a similar agent from first principles. Written April 2026. Based on the V1 SDK paper (arXiv 2511.03690), the original OpenDevin paper (arXiv 2407.16741), the OpenHands docs, and the source of All-Hands-AI/OpenHands and OpenHands/software-agent-sdk. OpenHands is an open-source autonomous software-engineering agent. It scores ~77% on SWE-Bench Verified with Claude Sonnet 4.5, opens GitHub PRs without supervision, and ships under MIT. Architecturally it is a tiny core: a stateless Agent that emits Actions, a Conversation that runs the loop and stores an append-only EventLog, a Workspace (local process or Docker container) that executes Actions and returns Observations, and an LLM wrapped by LiteLLM for provider portability. Everything else โ€” memory compression, microagent knowledge, sub-agent delegation, security review, stuck detection โ€” is a small auxiliary service hanging off the event stream. The result is a system where a few thousand lines of Python turn an LLM into something that can run for hours, recover from its own errors, and finish real engineering tasks. If you only remember three slogans: Four roles, sharp boundaries: This is the V1 architecture (Nov 2025 onward). The original V0 had an explicit AgentController class plus a Runtime abstraction; V1 collapsed that into Conversation + Workspace because the controller didn't earn its keep. The V1 paper opens with four explicit principles. They explain why the architecture looks the way it does, and they're worth lifting wholesale into your own design doc: The cautionary tale: V0 reportedly had 140+ config fields, 15 config classes, and 2.8K LOC just for configuration before the V1 rewrite. If your config grows faster than your features, that's a smell. Here is the actual Agent.step() from the V1 SDK (openhands-sdk/openhands/sdk/agent/agent.py), distilled. This is the single most important function in the project. Read it twice. That's it. The Conversation calls step() in a while not finished: loop. Everything else โ€” memory compression, microagent injection, security review โ€” happens inside one of those 5 phases as a hook or as another event being emitted. Phases worth memorizing: Build-your-own: when you write your own version, copy this 5-phase shape exactly. The hardest bug in agent loops is "the LLM responded but my code didn't know what to do with it" โ€” explicit response classification kills that bug. A concrete trace makes the abstraction click. Imagine the user says: "Find the failing test in this repo and fix it." What happens, event by event: Three things to notice: Every interaction with the world is either an Action (something the agent decided to do) or an Observation (what happened as a result). Both are typed Pydantic models. The list is short and stable: The flagship CodeActAgent (openhands/agenthub/codeact_agent/) is built around one observation from the original paper: instead of giving the LLM 20 bespoke tools each with their own JSON schema, give it bash, Python, and a browser DSL, and let it express anything as code. Empirically this generalizes far better and dramatically reduces parsing errors. The trade-off: a giant unified action space relies on the LLM being a strong code generator. With weaker models you may need narrower, more guided tools. With Claude Sonnet 4.5 / GPT-5, "give it a shell" is the strongest baseline. The CodeActAgent prompt is doing more than just listing tools โ€” it imposes a methodology that drives autonomous behavior. Roughly: The system prompt also encodes etiquette: configure git user.name=openhands and [email protected] if missing, prefer str_replace_editor over rewriting whole files, ask the user (MessageAction(wait_for_response=True)) if truly blocked instead of guessing. Why this matters for autonomy: the verification loop is the difference between an agent that hallucinates "done" and one that actually finishes. If you take only one prompting lesson from OpenHands, take this: make your agent re-run the test suite as the last action before finish. The whole "ran for 30 minutes and didn't break anything" story falls apart without it. The actual prompt template lives in openhands/agenthub/codeact_agent/prompts/system_prompt.j2. Read it directly when designing your own โ€” it's a cheat sheet for what works. If you're building your own agent, you can ship a useful prototype with three actions: Add Browse and RunPython later. Keep observations boringly literal: stdout + stderr + exit code, file diff, page text. Don't pre-summarize โ€” let the LLM see the raw world. Every Action and Observation is wrapped as an Event with id, source โˆˆ {agent, user, environment}, and timestamp, then appended to the EventLog. The log is: The auxiliary services in V1 โ€” Persistence, Stuck Detection, Visualization, Secret Registry โ€” all read from the event log and never mutate state directly. State mutation only happens by appending a new event. That single rule is what makes the system replayable and gives you free time-travel debugging. Build-your-own: write your event log as events.jsonl plus a state.json for cached materialized state. Don't get fancy โ€” it's a list of dicts. The discipline of "all state changes are events" pays for itself the first time you have to debug why an agent did something weird at minute 47. The Workspace is where Actions get executed. Three implementations: The clever part: the Docker container runs a small FastAPI server inside it, the Action Execution Server. The agent process on the host sends actions to it as REST POST /execute_action, and the server runs them against: Plus it ships VSCode Server on a sibling port so a human can attach. The agent talks to the box exactly the way a remote developer would. That's ~300 LOC and gives you 80% of what DockerRuntime does. Don't try to be clever about networking/cgroups until you need to. LLM context windows are finite. Long agent runs blow through them. OpenHands handles this with condensers, plug-in objects that decide whether to compress history before each LLM call. Default policy (get_default_condenser): Translation: when the visible event count exceeds 80, ask a (cheap) LLM to summarize all events except the first 4 (which usually contain the system prompt and original task) and the last few (recent work). Replace the middle with that summary. The V1 paper claims this reduces API spend by ~2ร— with no quality loss on benchmarks; in practice it depends heavily on your task length, but it's the difference between "agent stops at hour 1 with a context error" and "agent runs for 8 hours." Don't summarize on every step โ€” only when over a threshold. Cache aggressively. The cheapest thing you can do is just truncate with a small head + recent tail; LLM summarization is the upgrade. This is one of the biggest autonomy multipliers and the easiest to underrate. The problem: the system prompt is finite. You can't cram every framework's conventions, every project's quirks, every secret-handling rule, into one giant blob โ€” it would burn tokens and confuse the model. The solution: Skills (formerly "microagents"). Markdown files with YAML frontmatter, organized by trigger: This is how an OpenHands agent dropped into your repo "knows" your conventions without you doing anything: the always-on repo skills get glued onto the system prompt at conversation start. This is one of the more under-discussed power moves. A skill can ship its own MCP server and tools, only when activated: name: postgres-readonly

trigger: type: keyword keywords: ["database", "query", "sql", "postgres"]mcp_tools: mcpServers: pg: command: "uvx" args: ["mcp-server-postgres", "--readonly"] env: You have read-only DB access via the pg MCP server. Schema:!psql -c "\dt" $DATABASE_URL` When the user mentions "database", the skill activates: the MCP server is spawned, its tools (pg.query, pg.describe) are registered into agent.tools_map, and the rendered schema is injected into the system prompt. No tools at all when the skill isn't active โ€” token-cheap, attack-surface-light, and self-documenting. Build this pattern and you stop bloating your global tool list. `pydef load_skills(repo_path, latest_user_message): skills = [] # always-on for f in (repo_path / ".agents" / "skills").glob("*.md"): meta, body = parse_frontmatter(f) if not meta.get("trigger"): skills.append(render(body)) # keyword-triggered for f in (repo_path / ".agents" / "skills").glob("*.md"): meta, body = parse_frontmatter(f) kw = (meta.get("trigger") or {}).get("keywords", []) if any(k.lower() in latest_user_message.lower() for k in kw): skills.append(render(body)) return "\n\n".join(skills)` render() does the !`...` substitution. Cap output at 50KB to prevent prompt-injection-via-huge-files. OpenHands V1 treats delegation as just another tool, not a special core mechanism. The delegation tool offers: `pythonagent.register_subagent("bash", custom_prompt="...")agent.register_subagent("explore", tools=[GlobTool, GrepTool, FileReadTool]) Why this matters for autonomy: parallel exploration kills the latency tax on long tasks. While the parent agent is reasoning, two sub-agents are simultaneously grepping and running tests. The parent gets back a summary, not an essay of grep output. Independent context is the second insight: sub-agents don't pollute the parent's window. The parent never sees the 200 lines of grep output โ€” only the sub-agent's distilled answer. This is just concurrent.futures.ThreadPoolExecutor with a tool that takes a list of {agent_name, task} dicts. Each thread instantiates a child Conversation against the same workspace, runs to completion, returns its AgentFinishAction.outputs. Aggregate, return as one observation. The main rule: sub-agents share the workspace but not the conversation. Critical for keeping context clean. Without this, agents burn money in loops. OpenHands runs a StuckDetector (docs) on the event log every step. It flags five patterns: Comparison is semantic, not object identity: actions are matched by tool name + content (timestamps and metrics ignored). When stuck, the agent transitions to ERROR or emits a LoopRecoveryAction for the user to handle. Build-your-own: trivial. Maintain a sliding window of the last N events. Hash (action.tool, action.body, observation.body) tuples and count repeats. When count exceeds threshold, abort or notify. This single 100-LOC detector saves more money than any other optimization. OpenHands has two layers: Plus a Secret Registry that: Headless mode hard-disables confirmation (it's NeverConfirm always). That means headless mode's blast radius is whatever the workspace allows โ€” which is exactly why headless mode wants Docker. OpenHands wraps everything through LiteLLM so users get 100+ providers (OpenAI, Anthropic, Bedrock, Azure, Google, local Ollama) for free. Notable layer features: Two data points worth knowing: For your own builds, expect order-of-magnitude: Cost ceilings to set on day one: MAX_ITERATIONS (default ~100 in OpenHands), LLM_NUM_RETRIES (default 8), and a hard accumulated-cost cutoff that aborts the conversation. Don't ship a headless agent without all three. Don't write the LLM client yourself โ€” depend on LiteLLM. Add three things on top: Twelve concrete mechanisms. If you want your agent to be autonomous, you need most of these: The pattern: autonomy is not a single feature. It's the union of "can keep going" (memory, budget), "can recover" (observations, stuck detection), "knows what to do" (skills), and "won't blow up the world" (sandbox, confirmation policy). Skip any of these and the agent is fragile. Here's a concrete, achievable plan to build a clone with the same shape. Roughly 2,000 LOC of Python. `plaintextyour_agent/ agent.py # Agent class with .step() conversation.py # Conversation runner + EventLog events.py # Event/Action/Observation Pydantic models tools/ bash.py edit.py finish.py workspace/ local.py docker.py llm.py # LiteLLM wrapper + RouterLLM condenser.py # Threshold-based summarizer skills.py # Markdown skill loader stuck.py # Sliding-window detector

` Stop when you have steps 1โ€“7 working end-to-end on a real task. That's already a usable agent. Steps 8โ€“14 are the "make it autonomous for hours" upgrades. A reasonable path: ship CLI + headless first, add GUI when the team complains, add the resolver-style "one-shot from a ticket" mode last because it requires high trust. A guide that only lists strengths is a brochure. Things to know: The whole project rests on one idea: an autonomous agent is a function from event history to next event, run in a loop. Every architectural choice in OpenHands is downstream of that: Everything else โ€” condensers, skills, sub-agents, security analyzers โ€” is a hook into that one loop. There is no big design. There is one tight kernel and a lot of small components hanging off it. Build the kernel first. Make sure it actually closes the loop on observations. Then earn each of the 12 autonomy features by removing a class of failure you observed in practice. That's the path. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

+--------------+ +-----------------+ +---------------+ | Agent |<-----| Conversation |----->| Workspace | | (stateless, | | (loop runner, | | (Local/Docker | | Pydantic) | | state, EventLog)| | /Remote) | +------^-------+ +--------^--------+ +-------^-------+ | | | | uses | persists / streams | executes v v v +-------+ +----------+ +---------------+ | LLM | | EventLog | | bash, python | |+Cond. | +----------+ | jupyter, | +-------+ | browser, FS | +---------------+ +--------------+ +-----------------+ +---------------+ | Agent |<-----| Conversation |----->| Workspace | | (stateless, | | (loop runner, | | (Local/Docker | | Pydantic) | | state, EventLog)| | /Remote) | +------^-------+ +--------^--------+ +-------^-------+ | | | | uses | persists / streams | executes v v v +-------+ +----------+ +---------------+ | LLM | | EventLog | | bash, python | |+Cond. | +----------+ | jupyter, | +-------+ | browser, FS | +---------------+ +--------------+ +-----------------+ +---------------+ | Agent |<-----| Conversation |----->| Workspace | | (stateless, | | (loop runner, | | (Local/Docker | | Pydantic) | | state, EventLog)| | /Remote) | +------^-------+ +--------^--------+ +-------^-------+ | | | | uses | persists / streams | executes v v v +-------+ +----------+ +---------------+ | LLM | | EventLog | | bash, python | |+Cond. | +----------+ | jupyter, | +-------+ | browser, FS | +---------------+ def step(self, conversation, on_event, on_token=None) -> None: state = conversation.state # 1. Drain confirmed actions waiting to execute. pending = ConversationState.get_unmatched_actions(state.events) if pending: self._execute_actions(conversation, pending, on_event) return # 2. Honor any UserPromptSubmit hook that wants to block the message. if state.last_user_message_id is not None: reason = state.pop_blocked_message(state.last_user_message_id) if reason is not None: state.execution_status = ConversationExecutionStatus.FINISHED return # 3. Build the LLM prompt โ€” may return a Condensation event instead. msgs_or_cond = prepare_llm_messages( state.events, condenser=self.condenser, llm=self.llm) if isinstance(msgs_or_cond, Condensation): on_event(msgs_or_cond); return # 4. Call the LLM with retry. try: response = make_llm_completion( self.llm, msgs_or_cond, tools=list(self.tools_map.values()), on_token=on_token) except LLMContextWindowExceedError: if self.condenser and self.condenser.handles_condensation_requests(): on_event(CondensationRequest()); return raise # 5. Classify and dispatch. match classify_response(response.message): case LLMResponseType.TOOL_CALLS: self._handle_tool_calls(...) case LLMResponseType.CONTENT: self._handle_content_response(...) case LLMResponseType.REASONING_ONLY | LLMResponseType.EMPTY: self._handle_no_content_response(...) def step(self, conversation, on_event, on_token=None) -> None: state = conversation.state # 1. Drain confirmed actions waiting to execute. pending = ConversationState.get_unmatched_actions(state.events) if pending: self._execute_actions(conversation, pending, on_event) return # 2. Honor any UserPromptSubmit hook that wants to block the message. if state.last_user_message_id is not None: reason = state.pop_blocked_message(state.last_user_message_id) if reason is not None: state.execution_status = ConversationExecutionStatus.FINISHED return # 3. Build the LLM prompt โ€” may return a Condensation event instead. msgs_or_cond = prepare_llm_messages( state.events, condenser=self.condenser, llm=self.llm) if isinstance(msgs_or_cond, Condensation): on_event(msgs_or_cond); return # 4. Call the LLM with retry. try: response = make_llm_completion( self.llm, msgs_or_cond, tools=list(self.tools_map.values()), on_token=on_token) except LLMContextWindowExceedError: if self.condenser and self.condenser.handles_condensation_requests(): on_event(CondensationRequest()); return raise # 5. Classify and dispatch. match classify_response(response.message): case LLMResponseType.TOOL_CALLS: self._handle_tool_calls(...) case LLMResponseType.CONTENT: self._handle_content_response(...) case LLMResponseType.REASONING_ONLY | LLMResponseType.EMPTY: self._handle_no_content_response(...) def step(self, conversation, on_event, on_token=None) -> None: state = conversation.state # 1. Drain confirmed actions waiting to execute. pending = ConversationState.get_unmatched_actions(state.events) if pending: self._execute_actions(conversation, pending, on_event) return # 2. Honor any UserPromptSubmit hook that wants to block the message. if state.last_user_message_id is not None: reason = state.pop_blocked_message(state.last_user_message_id) if reason is not None: state.execution_status = ConversationExecutionStatus.FINISHED return # 3. Build the LLM prompt โ€” may return a Condensation event instead. msgs_or_cond = prepare_llm_messages( state.events, condenser=self.condenser, llm=self.llm) if isinstance(msgs_or_cond, Condensation): on_event(msgs_or_cond); return # 4. Call the LLM with retry. try: response = make_llm_completion( self.llm, msgs_or_cond, tools=list(self.tools_map.values()), on_token=on_token) except LLMContextWindowExceedError: if self.condenser and self.condenser.handles_condensation_requests(): on_event(CondensationRequest()); return raise # 5. Classify and dispatch. match classify_response(response.message): case LLMResponseType.TOOL_CALLS: self._handle_tool_calls(...) case LLMResponseType.CONTENT: self._handle_content_response(...) case LLMResponseType.REASONING_ONLY | LLMResponseType.EMPTY: self._handle_no_content_response(...) # Drop in DockerWorkspace, no code change anywhere else. from openhands.workspace import DockerWorkspace with DockerWorkspace(host_port=8010, extra_ports=True) as ws: conversation = Conversation(agent=agent, workspace=ws) conversation.send_message("Refactor the auth module.") conversation.run() # Drop in DockerWorkspace, no code change anywhere else. from openhands.workspace import DockerWorkspace with DockerWorkspace(host_port=8010, extra_ports=True) as ws: conversation = Conversation(agent=agent, workspace=ws) conversation.send_message("Refactor the auth module.") conversation.run() # Drop in DockerWorkspace, no code change anywhere else. from openhands.workspace import DockerWorkspace with DockerWorkspace(host_port=8010, extra_ports=True) as ws: conversation = Conversation(agent=agent, workspace=ws) conversation.send_message("Refactor the auth module.") conversation.run() LLMSummarizingCondenser(llm=summarizer_llm, max_size=80, keep_first=4) LLMSummarizingCondenser(llm=summarizer_llm, max_size=80, keep_first=4) LLMSummarizingCondenser(llm=summarizer_llm, max_size=80, keep_first=4) def maybe_condense(events, summarizer, max_size=80, keep_first=4): if len(events) <= max_size: return events head = events[:keep_first] tail = events[-(max_size // 2):] middle = events[keep_first:-(max_size // 2)] summary = summarizer.complete( "Summarize the following agent history concisely, preserving " "decisions, findings, and current state:\n" + dump(middle)) return head + [SummaryEvent(text=summary)] + tail def maybe_condense(events, summarizer, max_size=80, keep_first=4): if len(events) <= max_size: return events head = events[:keep_first] tail = events[-(max_size // 2):] middle = events[keep_first:-(max_size // 2)] summary = summarizer.complete( "Summarize the following agent history concisely, preserving " "decisions, findings, and current state:\n" + dump(middle)) return head + [SummaryEvent(text=summary)] + tail def maybe_condense(events, summarizer, max_size=80, keep_first=4): if len(events) <= max_size: return events head = events[:keep_first] tail = events[-(max_size // 2):] middle = events[keep_first:-(max_size // 2)] summary = summarizer.complete( "Summarize the following agent history concisely, preserving " "decisions, findings, and current state:\n" + dump(middle)) return head + [SummaryEvent(text=summary)] + tail --- name: kubernetes trigger: type: keyword keywords: ["kubernetes", "k8s", "kubectl"] --- # Kubernetes guidance - Always use `kubectl --context=<ctx>` explicitly. - Current cluster: !`kubectl config current-context` - Common namespaces: !`kubectl get ns -o name | head -10` --- name: kubernetes trigger: type: keyword keywords: ["kubernetes", "k8s", "kubectl"] --- # Kubernetes guidance - Always use `kubectl --context=<ctx>` explicitly. - Current cluster: !`kubectl config current-context` - Common namespaces: !`kubectl get ns -o name | head -10` --- name: kubernetes trigger: type: keyword keywords: ["kubernetes", "k8s", "kubectl"] --- # Kubernetes guidance - Always use `kubectl --context=<ctx>` explicitly. - Current cluster: !`kubectl config current-context` - Common namespaces: !`kubectl get ns -o name | head -10` - ๐Ÿ’ก TL;DR โ€” what OpenHands is, in one paragraph - 1. ๐Ÿง  The mental model โ€” Agent / Conversation / Workspace / Event Stream - 2. โš™๏ธ The agent loop โ€” the canonical 30 lines - 3. ๐Ÿ”„ Actions and Observations โ€” the universal protocol - 4. ๐Ÿ“ก The Event Stream โ€” single source of truth - 5. ๐Ÿณ Sandboxing โ€” Workspace + Action Execution Server - 6. ๐Ÿ—œ๏ธ Memory โ€” the Condenser - 7. ๐Ÿ”Œ Microagents / Skills โ€” knowledge that auto-loads - 8. ๐Ÿค– Sub-agent delegation โ€” parallel agents on a shared workspace - 9. ๐Ÿšจ Stuck detection โ€” the "agent has lost the plot" alarm - 10. ๐Ÿ”’ Security โ€” confirmation policy + risk analyzer - 11. ๐ŸŒ The LLM layer โ€” LiteLLM, Router, and prompt caching - 12. ๐Ÿš€ What makes OpenHands highly autonomous โ€” synthesis - 13. ๐Ÿ—๏ธ Building your own โ€” minimum viable autonomous agent - 14. ๐Ÿญ Production features OpenHands ships that you'll want eventually - 15. โš ๏ธ Honest pitfalls and gotchas - 16. ๐Ÿ“š Reading list (curated, in order) - 17. ๐ŸŽฏ Closing โ€” the mental model that makes everything click - ๐Ÿ’ป Code is the universal action. Don't design 20 bespoke tools. Give the agent bash + Python + a file editor + a browser, then let it write code. - ๐Ÿ“ฆ State lives in one place. All components are immutable Pydantic models. The only mutable thing is ConversationState. This makes the system replayable, debuggable, and safe to parallelize. - ๐Ÿ”„ Observations close the loop. Every error, stderr, exit code, and HTTP response goes back into the next prompt. Self-correction is not a feature โ€” it's a side effect of letting the LLM see its own consequences. - Agent โ€” pure function from history โ†’ next Action. No state of its own. Configured by LLM, a list of Tools, a Condenser, optional MCP config, and system_prompt_kwargs. - Conversation โ€” owns ConversationState, drives the loop, persists the EventLog, and is the only mutable thing in the system. - Workspace โ€” knows how to execute commands and shuttle files. Three implementations: in-process (LocalWorkspace), container (DockerWorkspace), or HTTP (RemoteAPIWorkspace). Same agent code; just swap the workspace. - Event โ€” every interaction is an event: MessageEvent, ActionEvent, ObservationEvent, AgentErrorEvent, Condensation, etc. The event log is append-only and the single source of truth โ€” replaying it reconstructs the entire conversation. - Optional isolation, not mandatory sandboxing. The agent runs in-process by default; you swap LocalWorkspace โ†’ DockerWorkspace for isolation without changing any agent code. Don't make sandboxing a build-time decision. - Stateless components, single source of truth. Agent, Tool, LLM, and Condenser are immutable Pydantic models. The only mutable thing in the entire system is ConversationState. State changes happen by appending events โ€” never by mutating objects. - Strict separation of concerns. The SDK never imports applications. The CLI, GUI, GitHub resolver, and your custom integration all consume the SDK as a library. This sounds obvious; it is not what V0 did. - Two-layer composability. Compose at package level (swap workspaces, swap servers) and at component level (swap tools, prompts, condensers, LLMs). Both layers exist intentionally. - Drain pending actions (confirmation flow). - Block if a hook rejected the user message. - Prepare prompt โ€” condenser may decide to summarize first. - Call the LLM, with explicit handling for context-window overflow. - Dispatch the response: tool call โ†’ execute, plain text โ†’ emit message, empty โ†’ ask the LLM to try again. - The LLM never "remembers" what it did โ€” it sees the entire event log every step. That's why the EventLog has to be cheap to materialize. - The error in step 3 is what triggered the next action. The agent didn't have to be told "if a test fails, read it" โ€” that came from the LLM reasoning over the observation. - Steps 8โ€“9 are verification, not optimism. A well-prompted agent re-runs tests after editing. That's what stops it from declaring victory on broken code (more on this in ยง3). - Exploration โ€” read the repo, find relevant files, understand the surface area before doing anything. (grep, find, cat, ls.) - Analysis โ€” form a hypothesis about what to change and why. The ThinkTool exists specifically for this โ€” it produces no observation, it just gives the model a slot to reason without committing to an action. - Implementation โ€” make the smallest change that addresses the analysis. Prefer editing existing files over creating duplicates. Don't write README.md unless asked. Don't commit secrets. - Verification โ€” re-run the tests, lints, build. Loop back to analysis if it fails. Only call finish when verification passes. - RunCommand(command: str) โ€” bash via subprocess or docker exec. - EditFile(path: str, old: str, new: str) โ€” string-replace editor (much more reliable for LLMs than full-file rewrites). - Finish(summary: str) โ€” terminate. - Append-only โ€” events are never edited, only superseded by a Condensation event that marks ranges as "forgotten." - Persisted incrementally โ€” each event is one JSON file; full state is rebuildable from disk. - Pub/sub for V0 (EventStream.subscribe(...)) or read by auxiliary services for V1. - a persistent tmux bash session (so cd and shell history survive across actions), - a persistent IPython kernel (%pip install once, use forever), - a Playwright Chromium browser, - a str-replace file editor with undo. - Build a Docker image with bash, python, your project deps, and a tiny FastAPI server. - The server has one endpoint: POST /exec taking {kind: "bash"|"python", body: "..."}. - Use tmux for bash persistence (or just keep a subprocess.Popen open and write into its stdin). - Mount the workspace dir as a volume. - Stream output back chunked so the agent can show progress. - Add a watchdog: kill anything running over N seconds. - Proactive โ€” View.from_events() checks size on each step. - Reactive โ€” when the LLM raises LLMContextWindowExceedError, the agent emits a CondensationRequest event and tries again next step. - !`shell command` โ€” run a command at activation time and inline the output (e.g. !git branch --show-current`` to inject current branch into the prompt). - mcp_tools: block in the YAML โ€” spin up an MCP server when this skill activates, register its tools dynamically. - Repository skills auto-discover AGENTS.md / CLAUDE.md / GEMINI.md in the repo root. - Spawn โ€” register sub-agents by name, optionally with custom prompts and tool subsets. Each sub-agent inherits the parent's LLM and shares the workspace, but has its own independent EventLog. - Delegate โ€” dispatch one or more named sub-agents in parallel threads. Block until all complete. Return a consolidated result. - Risk analyzer (openhands.sdk.security) โ€” every Action gets a SecurityRisk โˆˆ {LOW, MEDIUM, HIGH, UNKNOWN} score. The default LLMSecurityAnalyzer adds a security_risk field to every tool's JSON schema, so the LLM scores its own action inline with no extra call. The MCP tool annotations (readOnlyHint, destructiveHint, etc.) feed in. - Confirmation policy โ€” AlwaysConfirm, NeverConfirm, or ConfirmRisky(threshold=HIGH). With ConfirmRisky, low-risk actions auto-execute; risky ones pause the conversation in WAITING_FOR_CONFIRMATION until the user approves. - Stores secrets per-session, late-bound (resolved only at exec time). - Masks them in stdout/stderr (<secret-hidden>). - Encrypts at rest, supports rotation, supports callable resolvers (refresh tokens, etc.). - The TerminalTool scans commands for known secret keys, exports them as env vars, and replaces matches in output. - Two completion modes: classic Chat Completions (function calling) and OpenAI's Responses API (for GPT-5 reasoning models). Auto-detected per-model from a model_features.py registry. - Reasoning/thinking blocks are first-class. Anthropic extended thinking is captured as ThinkingBlocks; OpenAI reasoning items as ReasoningItemModel. The agent persists these on ActionEvent/MessageEvent (reasoning_content, thinking_blocks) so they're replayable and can be fed back to the model on the next turn โ€” required to maintain reasoning continuity for o-series and Sonnet thinking-mode. - NonNativeToolCallingMixin โ€” for models without native function calling, it serializes tools into a structured prompt and parses responses with regex. Lets even small open-source models drive the agent loop. The pattern: detect, then either call native function-calling or fall back to prompt-and-parse โ€” same agent code path. - RouterLLM โ€” abstract base; subclass with select_llm(messages) -> str. Real example: route image-containing messages to a vision model and text-only to a cheap model. Composes recursively (a router can route to a router), so you can build cost-optimization trees. - Prompt caching โ€” Anthropic cache_control breakpoints inserted at stable prefix points (system prompt, tool defs, condensed history). Big savings on long conversations. (V0 had a known caching bug; V1 fixed it โ€” verify in your own implementation that your cache hit rate is what you expect.) - Telemetry โ€” every call records tokens in/out, computed cost, latency, error counts. Cost shows up at conversation.state.stats.accumulated_cost. - Retries with exponential backoff baked in. - Original OpenDevin paper: CodeActAgent v1.8 on Claude 3.5 Sonnet hit 26% on SWE-Bench Lite at $1.10 per instance. That's the cost-per-task baseline for a previous-generation model on a hard benchmark. - V1 paper: the default condenser claims to cut API spend by ~2ร— on long sessions with no measurable quality loss. - Trivial task (few file edits, no tests): $0.05โ€“$0.30 per run on a frontier model. - SWE-Bench-style real fix (explore + analyze + edit + verify): $0.50โ€“$3 per task. - Multi-hour autonomous run (resolver mode on a complex issue): $5โ€“$30, easily more without a condenser. - A retry wrapper. - A cost tracker (pull prompt_tokens/completion_tokens off the response, multiply by your rate card; if the model returns reasoning tokens, account for them separately โ€” they're often billed differently). - A response classifier: did the model call a tool, return text, return reasoning only, or return nothing? Branch explicitly. - A loop with no human in the middle by default. Conversation.run() doesn't ask permission โ€” it runs until FinishAction, stuck detection, budget exhaustion, or explicit pause. Headless mode hard-codes this. - Self-correction via observations. Every error becomes an Observation in the next prompt. The LLM literally sees its own stderr and adjusts. - Long-horizon memory. The condenser lets sessions exceed the context window indefinitely. A persistent EventLog means full replay even after compression. - Tool diversity = "anything a developer can do." Bash + Python + browser + file edit + MCP. The agent isn't shoehorned into 5 narrow operations. - Microagents/Skills. Conventions and project knowledge load automatically when triggered. The agent "knows" your project the moment it lands in the repo. - Sub-agent delegation. Parallel exploration with isolated contexts. Big tasks decompose without the parent context blowing up. - Stuck detection. Five semantic patterns, every step. Pathological loops die early. - Budget controls. Max iterations, max retries, accumulated-cost tracking. Hard ceilings on runaway spend. - Risk-aware confirmation policy. ConfirmRisky lets the agent fly through routine work and pause only on destructive ops. - Replayable event log. When the agent screws up, you can rewind to any event and try a different model or prompt. Debug loops are short. - Same code, multiple isolation levels. LocalWorkspace for dev, DockerWorkspace for prod. No code changes โ€” meaning you actually use isolation in prod, instead of half-disabling it for "dev convenience." - End-to-end resolver mode. The GitHub Action wires GitHub issue โ†’ sandbox โ†’ CodeActAgent โ†’ PR. No human in the loop. This is the maximum-autonomy configuration in production. - Events + EventLog. Define Event, ActionEvent, ObservationEvent, MessageEvent. Append-only list, JSON-serializable, persistable. - Tools: bash + finish. Two Pydantic action models, two executor classes. Use subprocess.Popen for bash; keep stdin open for persistence. - LLM wrapper. LiteLLM call + retry + cost tracking. Function-calling tool format. - Agent.step(). Build messages from events, call LLM, classify response, dispatch. Copy the 5-phase shape from ยง2. - Conversation.run(). while not finished: agent.step(). - LocalWorkspace. Dirt simple: cwd plus a tool registry. - First end-to-end test. Give it "create a hello world Python script and run it". It should succeed. - File edit tool. str_replace_editor semantics โ€” read, then replace exact string. Refuse if the string doesn't appear or appears more than once. - DockerWorkspace. Build a small image. Run a FastAPI server inside that exposes POST /exec. Forward bash and file-edit actions over HTTP. - Condenser. Threshold check + LLM-summarize-the-middle. Cache summaries. - Skills loader. Parse .agents/skills/*.md, evaluate triggers, inject into system prompt. - Stuck detector. Sliding window + hash compare. Halt on repeats. - Security gate. Add a risk field to tool schemas; pause when risk โ‰ฅ HIGH unless the user pre-approved. - Sub-agent delegation tool. ThreadPoolExecutor over child Conversations sharing the workspace. - Don't build a UI first. A CLI that prints events as they happen is enough to develop against. Headless mode is the actual product anyway. - Don't write a controller class. It's just while not done: agent.step(). Adding ceremony hurts. - Don't build a mock workspace. Use the real one (subprocess + cwd) from day 1. Mocks lie. - Do log every prompt and response to disk. When the agent does something weird, you'll need to see exactly what it saw. - Do use Pydantic for every event. Schemas catch 80% of bugs at the boundary. - Do measure tokens and cost from step 3. Otherwise you'll get a $400 bill the first time you run an overnight loop. - Browser automation (real browsing is hard; cover 90% of cases with curl/Python requests from bash). - MCP integration (YAGNI until your tools outgrow the built-ins). - Multi-agent delegation (single agent + good condenser handles surprisingly long tasks). - Streaming. (Token streaming is a UI feature, not an autonomy feature.) - The default condenser is aggressive (max_size=80, keep_first=4). Long sessions trigger it often. Real cost savings vary by workload. - Stuck detection thresholds are conservative (4+ identical pairs). An agent can burn meaningful tokens in a near-loop before being killed. Tune thresholds for your tolerance. - Headless mode = always-approve = blast radius is whatever the workspace allows. Always use Docker in headless. Don't mount more than the working directory. - 77% on SWE-Bench Verified is Claude Sonnet 4.5 dependent. Cheaper models drop hard (Qwen3 Coder 480B was 65%; smaller models do worse). The architecture isn't magic. - The V0 โ†’ V1 split (Nov 2025) means a lot of public material describes a different codebase. The original arXiv paper describes V0; the new V1 paper and the SDK repo describe the architecture this guide focuses on. When you read OpenHands content, check the date. - MCP integration is powerful but adds attack surface. External MCP servers run with the agent's privileges. Treat them like dependencies โ€” pin and audit. - Browsing is the flakiest tool. Site changes, JS-heavy pages, and bot detection make it unreliable. Reach for curl or library-level integrations whenever you can. - OpenHands V1 SDK paper (arXiv 2511.03690) โ€” the canonical architecture writeup. Read this first. - Original OpenDevin paper (arXiv 2407.16741) โ€” context on CodeAct and the original design. - Agent.step() source in openhands-sdk/openhands/sdk/agent/agent.py โ€” the loop, in 100 lines. - Architecture overview docs โ€” how the pieces fit. - Skill format โ€” the autonomy multiplier. - Stuck detector guide โ€” the loop-prevention patterns. - Sub-agent delegation guide โ€” parallel agents. - CodeActAgent system prompt โ€” actual prompt text. - Headless mode docs โ€” the autonomous configuration. - GitHub Action / Resolver โ€” issue โ†’ PR pipeline. - "Function" โ†’ stateless Agent. - "Event history" โ†’ append-only EventLog. - "Next event" โ†’ Action, executed by Workspace, producing Observation. - "Run in a loop" โ†’ Conversation, until Finish or stuck.