Tools

Automating a Browser with Anthropic’s Computer Use to Play Tic-Tac-Toe

2025-12-16 0 views admin

Automating a Browser with Anthropic’s Computer Use to Play Tic-Tac-Toe

Source: Dev.to

What “computer use” enables at a technical level ## 1) A perception channel grounded in pixels ## 2) A constrained action vocabulary ## 3) Closed-loop autonomy ## How your Tic-Tac-Toe project leverages these tools (and what it demonstrates) ## 1) The UI becomes the “API surface” ## 2) You turn the model into a controller, not a narrator ## 3) You anchor termination to UI truth (critical for reliability) ## 4) You surface real provider constraints (OpenAI truncation, Anthropic context bloat) ## 5) You demonstrate why providers add more than computer control: persistent memory ## Why this matters beyond Tic-Tac-Toe For years, the “agent” story was mostly text → API calls → text. That works when software exposes clean endpoints, but the real world is full of: Provider-native computer use tools are a response to that gap: they let a model operate software the same way a human does—by seeing the screen and performing input actions. OpenAI frames this as a “Computer-Using Agent” capability aimed at controlling real interfaces and measuring progress on benchmarks like OSWorld (a sign they’re treating UI control as a first-class modality, not a hack) (OpenAI: Computer-Using Agent). Anthropic positions “computer use” as enabling Claude to interact with existing interfaces directly while highlighting operational safety concerns (e.g., isolate execution in a dedicated environment) (Anthropic computer use docs, Anthropic announcement). Under the hood, the important idea is standardization: That’s why these tools matter: you’re not just “running Selenium with an LLM”—you’re using a model/tool pair designed together as a control system. Provider computer-use is basically a minimal OS/UI control API with three properties: The model can request a screenshot and interpret UI state: text, layout, icons, highlights, banners, disabled buttons, etc. This is the “state observation” step in a control loop. Instead of arbitrary code execution, the model emits actions like: This constraint is good: fewer degrees of freedom means fewer unsafe/irreversible actions and more predictable orchestration. The model can iterate: observe → act → observe, handling uncertainty and recovery: This is what makes “computer use” different from one-shot vision: it’s not just recognition; it’s interactive control. This demo is valuable because it isolates the core computer-use loop without lots of app complexity—and still exposes the hard parts. Your agent does not get a structured board array. It must infer the board from screenshots and interact via clicks. That’s the entire point of computer-use: operate systems where the UI is the interface. To make that reliable, the project adds an important “agent affordance”: cell labels (TOP-LEFT, CENTER, …). This is a general pattern: if you want robust UI control, you design UI elements that are easy for vision models to anchor on (stable text, consistent placement, clear state cues). The implementation forces an explicit loop: That “verify after action” step is the difference between a demo that “usually works” and one that can recover from inevitable UI mistakes. Both prompts insist the agent must only end the game when it sees the on-screen banner (“Player X wins!”, “It’s a draw!”), not when it believes it has three in a row. This is a broadly applicable safety/reliability pattern for computer-use: It reduces hallucinated “success” and makes runs auditable. Provider-native tools come with operational requirements that show up immediately in multi-step UI loops: OpenAI: your agent sets truncation: "auto" because OpenAI’s computer-use flow expects automatic truncation to keep long interactive sessions viable (OpenAI computer use guide). This is a concrete example of “provider tool != generic LLM call”; there are mode-specific runtime contracts. Anthropic: your agent uses middleware to clear old tool uses (screenshots). That’s essentially context garbage collection—and it’s not optional in screenshot-heavy loops. Without pruning, you hit context limits or degrade performance as stale observations pile up. This is one of the biggest “why computer-use is hard” lessons: the environment is unstructured, and the data (images) is heavy. The Anthropic player adds a native memory tool and stores learnings as markdown (strategy, opponent patterns, mistakes). In practice, this turns a single-session controller into something that can: The demo’s memory files show exactly the value proposition: the agent loses once due to a missed threat, then blocks the same pattern next game. That’s a minimal but real example of “agent improvement” that’s hard to get from prompts alone. This project is a good representation of where computer-use shines and where it bites: In other words: computer-use is best understood as a systems discipline—a control loop combining model behavior, tool constraints, UI design, and runtime safeguards. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse - Legacy UIs with no API - SaaS products where the API is incomplete or locked down - Workflows that span apps (browser + spreadsheet + admin UI) - Tasks where the UI is the source of truth (what’s visible, what’s enabled, what error banners appear) - Providers define a tool schema (action types, fields, image formats). - They train (and safety-tune) models to reliably emit that schema. - They enforce constraints (environment type, context handling) that make the loop workable in production. - click / move / drag - type / keypress - screenshot (again) - “Did my click land?” - “Did the UI change?” - “Do I need to wait for the next state?” - Take screenshot - Choose a move - Take screenshot to verify - Wait for opponent - Never end (or submit, pay, delete, send) based on internal inference alone - Require screen evidence for critical transitions - OpenAI: your agent sets truncation: "auto" because OpenAI’s computer-use flow expects automatic truncation to keep long interactive sessions viable (OpenAI computer use guide). This is a concrete example of “provider tool != generic LLM call”; there are mode-specific runtime contracts. - Anthropic: your agent uses middleware to clear old tool uses (screenshots). That’s essentially context garbage collection—and it’s not optional in screenshot-heavy loops. Without pruning, you hit context limits or degrade performance as stale observations pile up. - review prior outcomes before starting - encode opponent-specific openings - avoid repeating mistakes across games - Shines when you need to automate UI-only workflows quickly, without building bespoke integrations. - Bites because reliability depends on: UI stability and “readability” verification loops context management isolation/sandboxing (providers explicitly recommend this for safety) (Anthropic computer use docs) - UI stability and “readability” - verification loops - context management - isolation/sandboxing (providers explicitly recommend this for safety) (Anthropic computer use docs) - UI stability and “readability” - verification loops - context management - isolation/sandboxing (providers explicitly recommend this for safety) (Anthropic computer use docs)

🏷️ Tags

how-totutorialguidedev.toaiopenaillm