How Deep Agents Actually Work: A Browsr Architecture Walkthrough

How Deep Agents Actually Work: A Browsr Architecture Walkthrough

Source: Dev.to

What Makes an Agent “Deep” ## Browsr ## Debugging with vLLora ## Sample Traces ## Average cost and no. of steps using gpt-4.1-mini ## Why Observability Is Critical for Deep Agents ## Key Takeaways Deep agents don’t fail loudly — they drift. Once an agent runs for 10–50 steps, debugging becomes guesswork. You don’t know which tool call caused the issue, why the plan changed, or where cost and context exploded. In this post, we’ll break down how a real deep agent works under the hood by walking through the architecture of Browsr, a browser-based deep agent, and observing its execution step by step. Because it can plan, remember, and correct itself, a deep agent can run for a long duration, tens or hundreds of steps without losing the thread of the task. Let’s debug and observe Browsr using vLLora(a tool for agent observability) and see what happens under the hood. Browsr is a headless browser agent that lets you create sequences using a deep agent pattern and then hands you the payloads to run over APIs at scale. It also exports website data as structured or LLM-friendly markdown. At a high level, Browsr is a deep agent that: You can explore the definition and related configurations in this repo. Note: Always respect the copyright rules and terms of the sites you scrape. To make the execution observable, we’ll inspect the agent using request-level traces and timelines captured during execution. vLLora lets you debug and observe your agents locally. vLLora can help us to better understand our architecture; toolcalls and observe the full agent timeline. It also works with all popular models. Browsr iterates in 1–3 command bursts as a single step, saving context to artifacts and completes the task with final tool. Lets further examine tool definitions as stated below. browser_step is the driver between steps. The system prompt forces the model to read the latest DOM and screenshot, report the current state, and then decide what to do next. Each turn must include: You can checkout the full agent defintion here. Example: In one representative run, Browsr used the available context to navigate in step one, click in step two, and then run a JS evaluation to return structured data from the page. Once agents move beyond single-shot prompts, debugging stops being straightforward. Engineers often find themselves tweaking system prompts, stepping through tool calls, and guessing what went wrong somewhere in the middle of a long run. When an agent executes 50+ steps and makes hundreds of decisions, failures rarely have a single obvious cause. This is where observability becomes essential. Drift over time An agent may start out doing exactly what you expect, then gradually veer off course due to noisy context, misinterpreted instructions, or a small mistake early on that compounds across later steps. Cost and context visibility Without traces, it’s hard to see where tokens spike, context balloons, or expensive branches are triggered — especially when comparing behavior across different models. Traceable decisions Lining up what the agent read, decided, and executed at each step makes cause-and-effect visible instead of speculative. End-to-end execution clarity Long-running agents blur where time and money are spent: planning, tool execution, retries, or extraction. Observability provides the full picture. Tools like vLLora make this practical by exposing request-level traces and timelines, allowing you to see what a deep agent is actually doing across an entire run — not just the final output. If you want to discuss observability patterns, agent anatomy, or agent tooling in more detail, join the vLLora Slack community to connect with other developers. As deep agents become more common, observability isn’t optional — it’s the difference between hoping an agent works and knowing why it does. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse - Keeps a running plan / TODO list of what still needs to be done. - Uses tools (like a browser, shell, APIs) to act in the world step by step. - Stores persistent memory (artifacts, notes, intermediate results) so it doesn’t forget earlier work. - Regularly evaluates its own progress, adjusts the plan, and retries when something fails. - Plans its next action explicitly - Executes browser commands in controlled steps - Persists state between iterations - Evaluates progress before continuing - Driver: browser_step is the main executor; every turn runs 1–3 browser commands with explicit thinking, evaluation_previous_goal, memory, and next_goal. - Context control: Large tool outputs are written to disk so the model can drop token-heavy responses and reload them on demand. - Stateful loop: Up to eight iterations, each grounded in the latest observation block (DOM + screenshot) to avoid hallucinating. - Strict tool contract: Exactly one tool call per reply (no free text), keeping the agent deterministic and debuggable. - thinking: Reasoning about the current state. - evaluation_previous_goal: Verdict on last step - next_goal: Next immediate goal in one sentence. - commands: Array of commands to be executed. - Average cost per trace ≈ $0.0303 per run - Average steps ≈ 10.5 steps per run - Drift over time An agent may start out doing exactly what you expect, then gradually veer off course due to noisy context, misinterpreted instructions, or a small mistake early on that compounds across later steps. - Cost and context visibility Without traces, it’s hard to see where tokens spike, context balloons, or expensive branches are triggered — especially when comparing behavior across different models. - Traceable decisions Lining up what the agent read, decided, and executed at each step makes cause-and-effect visible instead of speculative. - End-to-end execution clarity Long-running agents blur where time and money are spent: planning, tool execution, retries, or extraction. Observability provides the full picture. - Deep agents fail gradually, not catastrophically - Observability turns debugging from guesswork into inspection - Cost, context, and behavior are architectural concerns - Deterministic tool execution makes long runs understandable