Computer Use Expert

Overview

Computer Use Expert covers the architecture, implementation, and deployment patterns for building AI agents that interact with computer interfaces through vision and action. These agents perceive screen state through screenshots, reason about what to do next using vision-language models, and execute mouse clicks, keyboard inputs, and shell commands to accomplish tasks on a desktop or in a browser.

Computer use agents represent a fundamentally different paradigm from API-based tool calling. Instead of structured data in and out, the agent sees pixels and must translate visual understanding into precise coordinate-based actions. This makes them capable of automating any software that has a GUI -- including legacy applications with no API -- but also introduces unique challenges around precision, safety, and cost.

Anthropic's Claude was the first frontier model to offer native computer use capabilities, with Claude supporting screenshot capture, mouse/keyboard control, bash execution, and file editing as built-in tools. This template covers both the Anthropic-native implementation and the general patterns for building computer use agents with any vision model.

When to Use

Automating GUI-based workflows in applications that lack APIs (legacy enterprise software, desktop apps)
Building testing agents that interact with web applications through visual understanding rather than DOM selectors
Creating agents that can navigate complex multi-step web processes (form filling, data extraction, procurement workflows)
Implementing agents that need full desktop control (not just browser) for tasks like IDE automation or system administration
Building RPA (Robotic Process Automation) replacements that adapt to UI changes instead of breaking on selector changes
Prototyping automations before building proper API integrations

Quick Start


# Set up a sandboxed computer use environment with Docker
mkdir -p computer-use-agent/{agent,sandbox,configs}
cd computer-use-agent

# Create Python environment
python -m venv .venv && source .venv/bin/activate
pip install anthropic pillow pyautogui

# For sandboxed execution (strongly recommended)
# Create Dockerfile for virtual desktop
cat > Dockerfile << 'DOCKER'
FROM ubuntu:22.04

RUN apt-get update && apt-get install -y \
    xvfb x11vnc fluxbox xterm firefox \
    python3 python3-pip scrot supervisor

RUN useradd -m -s /bin/bash agent
COPY requirements.txt /tmp/
RUN pip3 install -r /tmp/requirements.txt

COPY --chown=agent:agent . /app
WORKDIR /app
USER agent
EXPOSE 5900 8080

CMD ["supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]
DOCKER

echo "anthropic
pillow
pyautogui
fastapi
uvicorn" > requirements.txt


# agent/core.py - Computer use agent with Perception-Reasoning-Action loop
from anthropic import Anthropic
from PIL import Image
import base64
import io
import time

class ComputerUseAgent:
    """
    Perception-Reasoning-Action loop for computer use.
    Captures screenshots, reasons with Claude, executes actions.
    """

    def __init__(self, anthropic_client: Anthropic = None):
        self.client = anthropic_client or Anthropic()
        self.model = "claude-sonnet-4-20250514"
        self.screen_width = 1280
        self.screen_height = 800
        self.max_steps = 50
        self.action_delay = 0.5  # seconds between actions

    def capture_screenshot(self) -> str:
        """Capture screen as base64 PNG."""
        import pyautogui
        screenshot = pyautogui.screenshot()
        screenshot = screenshot.resize(
            (self.screen_width, self.screen_height), Image.LANCZOS
        )
        buffer = io.BytesIO()
        screenshot.save(buffer, format="PNG")
        return base64.b64encode(buffer.getvalue()).decode()

    def run(self, task: str) -> str:
        """Execute a task using the computer."""
        messages = [{"role": "user", "content": [
            {"type": "text", "text": f"Complete this task: {task}"},
            {"type": "image", "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": self.capture_screenshot(),
            }},
        ]}]

        for step in range(self.max_steps):
            response = self.client.messages.create(
                model=self.model,
                max_tokens=4096,
                tools=self._get_tools(),
                messages=messages,
            )

            messages.append({"role": "assistant", "content": response.content})

            # Check for tool use
            tool_uses = [b for b in response.content if b.type == "tool_use"]
            if not tool_uses:
                text = [b.text for b in response.content if hasattr(b, "text")]
                return "\n".join(text)

            # Execute tools and capture new screenshot
            results = []
            for tool_use in tool_uses:
                result = self._execute_tool(tool_use.name, tool_use.input)
                time.sleep(self.action_delay)

                # Capture screenshot after action
                screenshot = self.capture_screenshot()
                results.append({
                    "type": "tool_result",
                    "tool_use_id": tool_use.id,
                    "content": [
                        {"type": "text", "text": result},
                        {"type": "image", "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": screenshot,
                        }},
                    ],
                })

            messages.append({"role": "user", "content": results})

        return "Task did not complete within step limit"

    def _get_tools(self) -> list:
        return [
            {
                "name": "computer",
                "type": "computer_20241022",
                "display_width_px": self.screen_width,
                "display_height_px": self.screen_height,
            },
            {
                "name": "bash",
                "type": "bash_20241022",
            },
            {
                "name": "str_replace_editor",
                "type": "text_editor_20241022",
            },
        ]

    def _execute_tool(self, name: str, input_data: dict) -> str:
        if name == "computer":
            return self._handle_computer(input_data)
        elif name == "bash":
            return self._handle_bash(input_data)
        elif name == "str_replace_editor":
            return self._handle_editor(input_data)
        return f"Unknown tool: {name}"

    def _handle_computer(self, input_data: dict) -> str:
        import pyautogui
        action = input_data.get("action")

        if action == "screenshot":
            return "Screenshot captured"
        elif action == "click":
            x, y = input_data["coordinate"]
            pyautogui.click(x, y)
            return f"Clicked at ({x}, {y})"
        elif action == "type":
            pyautogui.typewrite(input_data["text"], interval=0.02)
            return f"Typed {len(input_data['text'])} characters"
        elif action == "key":
            pyautogui.press(input_data["key"])
            return f"Pressed {input_data['key']}"
        elif action == "scroll":
            direction = input_data.get("direction", "down")
            amount = input_data.get("amount", 3)
            scroll_val = -amount if direction == "down" else amount
            pyautogui.scroll(scroll_val)
            return f"Scrolled {direction} by {amount}"
        elif action == "mouse_move":
            x, y = input_data["coordinate"]
            pyautogui.moveTo(x, y)
            return f"Moved mouse to ({x}, {y})"
        elif action == "drag":
            sx, sy = input_data["start_coordinate"]
            ex, ey = input_data["end_coordinate"]
            pyautogui.moveTo(sx, sy)
            pyautogui.drag(ex - sx, ey - sy)
            return f"Dragged from ({sx},{sy}) to ({ex},{ey})"

        return f"Unknown computer action: {action}"

    def _handle_bash(self, input_data: dict) -> str:
        import subprocess
        cmd = input_data.get("command", "")
        try:
            result = subprocess.run(
                cmd, shell=True, capture_output=True,
                text=True, timeout=30, cwd="/home/agent"
            )
            output = result.stdout[:5000]
            if result.returncode != 0:
                output += f"\nSTDERR: {result.stderr[:2000]}"
            return output or "(no output)"
        except subprocess.TimeoutExpired:
            return "Command timed out after 30 seconds"

    def _handle_editor(self, input_data: dict) -> str:
        command = input_data.get("command")
        path = input_data.get("path", "")

        if command == "view":
            try:
                with open(path) as f:
                    return f.read()[:10000]
            except Exception as e:
                return str(e)
        elif command == "str_replace":
            old = input_data.get("old_str", "")
            new = input_data.get("new_str", "")
            try:
                with open(path) as f:
                    content = f.read()
                content = content.replace(old, new, 1)
                with open(path, "w") as f:
                    f.write(content)
                return f"Replaced in {path}"
            except Exception as e:
                return str(e)
        elif command == "create":
            file_text = input_data.get("file_text", "")
            with open(path, "w") as f:
                f.write(file_text)
            return f"Created {path}"

        return f"Unknown editor command: {command}"

Core Concepts

1. The Perception-Reasoning-Action Loop

Computer use agents follow a fundamentally different loop from tool-calling agents. The perception step is visual (screenshot), the reasoning step uses a vision-language model, and the action step translates into coordinate-based operations.

  +--------------+     +---------------+     +------------+
  |  PERCEIVE    |---->|    REASON     |---->|    ACT     |
  | (screenshot) |     | (vision LLM)  |     | (click/    |
  |              |     |               |     |  type/key) |
  +--------------+     +---------------+     +------------+
       ^                                          |
       |              +-----------+               |
       +--------------| FEEDBACK  |<--------------+
                      | (observe  |
                      |  result)  |
                      +-----------+

Key insight: There is a detectable 1-5 second pause during the "reason" step where the agent is completely still. This is normal -- the vision model needs time to process the screenshot and decide on the next action.

2. Sandboxed Environment Architecture

Computer use agents MUST run in isolated environments. Giving an AI direct access to your desktop with mouse and keyboard control is a critical security risk. Always use Docker containers with virtual displays.


# docker-compose.yml - Secure computer use environment
version: '3.8'

services:
  computer-use-agent:
    build: .
    ports:
      - "5900:5900"   # VNC for observation
      - "8080:8080"   # API for control

    security_opt:
      - no-new-privileges:true

    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G

    networks:
      - agent-network

    read_only: true
    tmpfs:
      - /tmp
      - /run

    environment:
      - DISPLAY=:99
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}

networks:
  agent-network:
    driver: bridge
    internal: true  # No internet access by default


# supervisord.conf - Run virtual display + VNC + agent
[supervisord]
nodaemon=true

[program:xvfb]
command=Xvfb :99 -screen 0 1280x800x24
autorestart=true

[program:fluxbox]
command=fluxbox
environment=DISPLAY=":99"
autorestart=true

[program:x11vnc]
command=x11vnc -display :99 -forever -nopw
autorestart=true

[program:agent-api]
command=python3 -m uvicorn agent.api:app --host 0.0.0.0 --port 8080
directory=/app
environment=DISPLAY=":99"
autorestart=true

3. Screen Resolution and Coordinate Handling

Vision models process screenshots at fixed resolutions. You must ensure your screenshots match the resolution you tell the model about, or coordinates will be off.


# agent/screen.py - Resolution management

class ScreenManager:
    """
    Manages screen resolution, scaling, and coordinate translation.
    Critical for accurate click targeting.
    """

    def __init__(self, display_width: int = 1280, display_height: int = 800):
        self.display_width = display_width
        self.display_height = display_height

    def capture_and_resize(self) -> tuple[str, float]:
        """
        Capture screenshot and resize to model's expected resolution.
        Returns (base64_image, scale_factor).
        """
        import pyautogui
        screenshot = pyautogui.screenshot()
        actual_width, actual_height = screenshot.size

        # Calculate scale factor
        scale_x = actual_width / self.display_width
        scale_y = actual_height / self.display_height

        # Resize to expected dimensions
        resized = screenshot.resize(
            (self.display_width, self.display_height),
            Image.LANCZOS,
        )

        buffer = io.BytesIO()
        resized.save(buffer, format="PNG")
        b64 = base64.b64encode(buffer.getvalue()).decode()

        return b64, (scale_x, scale_y)

    def translate_coordinates(
        self, model_x: int, model_y: int, scale: tuple[float, float]
    ) -> tuple[int, int]:
        """
        Translate model coordinates back to actual screen coordinates.
        """
        actual_x = int(model_x * scale[0])
        actual_y = int(model_y * scale[1])
        return actual_x, actual_y

4. Robust Action Execution with Retries

Screen-based interactions are inherently flaky. Elements load asynchronously, animations delay content, and popups block actions. Robust agents retry with visual verification.


# agent/actions.py - Action execution with verification

class RobustActionExecutor:
    """Execute actions with visual verification and retry logic."""

    def __init__(self, screen_manager: ScreenManager, max_retries: int = 3):
        self.screen = screen_manager
        self.max_retries = max_retries

    def click_with_verify(
        self, x: int, y: int, expected_change: str = None
    ) -> dict:
        """
        Click at coordinates and optionally verify the screen changed.
        """
        import pyautogui

        before = self.screen.capture_and_resize()[0]
        pyautogui.click(x, y)
        time.sleep(0.5)  # Wait for UI to update
        after = self.screen.capture_and_resize()[0]

        # Simple change detection: compare image hashes
        changed = before != after
        return {
            "success": True,
            "screen_changed": changed,
            "coordinates": (x, y),
        }

    def wait_for_element(
        self, description: str, llm_client, timeout: int = 10
    ) -> bool:
        """
        Wait until a described element appears on screen.
        Uses the vision model to verify presence.
        """
        start = time.time()
        while time.time() - start < timeout:
            screenshot = self.screen.capture_and_resize()[0]
            response = llm_client.messages.create(
                model="claude-haiku-4",  # Fast model for verification
                max_tokens=100,
                messages=[{
                    "role": "user",
                    "content": [
                        {"type": "text", "text": f"Is this element visible on screen: '{description}'? Reply YES or NO only."},
                        {"type": "image", "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": screenshot,
                        }},
                    ],
                }],
            )
            if "YES" in response.content[0].text.upper():
                return True
            time.sleep(1)

        return False

5. Cost Management

Computer use agents are expensive -- every action requires a screenshot (vision tokens) plus reasoning tokens. A single task can easily consume thousands of tokens.


# agent/cost.py - Token and cost tracking

class CostTracker:
    """Track and limit computer use costs."""

    # Approximate costs per 1K tokens (as of 2025)
    COST_TABLE = {
        "claude-sonnet-4-20250514": {"input": 0.003, "output": 0.015},
        "claude-haiku-4": {"input": 0.0008, "output": 0.004},
    }

    # Approximate tokens per screenshot at 1280x800
    SCREENSHOT_TOKENS = 1500

    def __init__(self, budget_usd: float = 5.0):
        self.budget = budget_usd
        self.total_cost = 0.0
        self.step_count = 0
        self.screenshots_sent = 0

    def record_step(self, model: str, input_tokens: int, output_tokens: int):
        costs = self.COST_TABLE.get(model, {"input": 0.01, "output": 0.03})
        step_cost = (
            (input_tokens / 1000) * costs["input"]
            + (output_tokens / 1000) * costs["output"]
        )
        self.total_cost += step_cost
        self.step_count += 1

    def can_continue(self) -> bool:
        return self.total_cost < self.budget

    def get_summary(self) -> dict:
        return {
            "total_cost_usd": round(self.total_cost, 4),
            "budget_remaining": round(self.budget - self.total_cost, 4),
            "steps": self.step_count,
            "avg_cost_per_step": round(self.total_cost / max(self.step_count, 1), 4),
        }

Configuration Reference

Parameter	Type	Default	Description
`DISPLAY_WIDTH`	int	`1280`	Screenshot width sent to the model
`DISPLAY_HEIGHT`	int	`800`	Screenshot height sent to the model
`MAX_STEPS`	int	`50`	Maximum perception-action cycles
`ACTION_DELAY`	float	`0.5`	Seconds to wait between actions for UI to settle
`COST_BUDGET_USD`	float	`5.0`	Maximum spend per task
`SANDBOX_TIMEOUT`	int	`300`	Maximum seconds for the entire task
`BASH_TIMEOUT`	int	`30`	Maximum seconds per bash command
`VNC_PORT`	int	`5900`	Port for VNC observation
`API_PORT`	int	`8080`	Port for the control API
`NETWORK_ACCESS`	bool	`false`	Whether the sandbox has internet access
`MAX_OUTPUT_LENGTH`	int	`5000`	Truncate tool output beyond this character count
`VERIFICATION_MODEL`	string	`claude-haiku-4`	Fast model used for visual verification checks

Best Practices

Always sandbox computer use agents. Never run them on your host machine with direct access to your desktop, filesystem, and credentials. Use Docker containers with virtual displays. The blast radius of a mistake must be contained.
Use the smallest effective resolution. 1280x800 is a good balance between visual clarity and token cost. Higher resolutions add cost without proportional improvement in action accuracy for most tasks.
Add delays between actions. Web pages and applications need time to render after clicks and keystrokes. A 0.5-1.0 second delay between actions prevents the agent from acting on stale screenshots.
Prefer keyboard shortcuts over mouse clicks when possible. Clicking requires accurate coordinate prediction, which can fail on small targets. Keyboard shortcuts (Ctrl+S, Ctrl+C, Tab, Enter) are perfectly reliable.
Implement cost tracking from the start. Each step costs 1500+ vision tokens plus reasoning. A 30-step task can easily cost $1-5. Set per-task budgets and surface cost data so users understand the trade-off.
Use a fast model for verification checks. When waiting for an element to appear or verifying a click succeeded, use a cheap fast model (Haiku) instead of the main reasoning model (Sonnet/Opus).
Log screenshots at every step for debugging. Save each screenshot with a step number. When an agent fails, you can replay the visual sequence to understand exactly where it went wrong.
Restrict network access in the sandbox by default. Only open access to the specific domains the agent needs. An agent with unrestricted internet access could visit malicious sites or leak data.
Handle dropdowns and scrollable menus carefully. These are the most common failure points for computer use agents. The vision model may struggle with z-index layering and disappearing elements. Use keyboard navigation (arrow keys) as a fallback.
Set up human observation via VNC. Even in production, someone should be able to watch what the agent is doing. VNC provides real-time visibility into the sandboxed desktop.

Troubleshooting

Problem: Agent clicks in the wrong location consistently. Solution: Verify that the screenshot resolution matches the display_width_px and display_height_px you pass to the tool definition. If your virtual display runs at 1920x1080 but you tell the model it is 1280x800, coordinates will be systematically offset. Implement coordinate translation if resolutions differ.

Problem: Agent gets stuck on a loading screen. Solution: Implement wait_for_element with a visual verification loop. The agent should check for the expected element every 1-2 seconds and only proceed when it appears. Set a timeout to prevent infinite waiting.

Problem: Screenshots consume too many tokens and costs are high. Solution: Reduce resolution to 1024x768 for simpler UIs. Consider cropping screenshots to the relevant region if the task only involves a portion of the screen. Use compression (JPEG at 80% quality instead of PNG) to reduce image size without significant quality loss.

Problem: Agent cannot interact with dropdown menus. Solution: Dropdown menus are a known weakness. Train the agent to use keyboard navigation: click the dropdown, then use arrow keys to navigate options, then press Enter. This is more reliable than trying to click on specific dropdown items.

Problem: Docker container starts but VNC shows a black screen. Solution: Check that Xvfb is running with the correct display number (:99). Verify that the DISPLAY environment variable matches. Check supervisord logs for startup errors. Ensure fluxbox or another window manager is starting correctly.

Problem: Agent runs out of budget mid-task. Solution: Add cost estimation before starting a task. If the estimated cost exceeds the budget, warn the user. Break complex tasks into phases with checkpoints so partial progress is preserved when the budget runs out.

⚠️ Loading Issue

Computer Use Expert

Computer Use Expert

Overview

When to Use

Quick Start

Core Concepts

1. The Perception-Reasoning-Action Loop

2. Sandboxed Environment Architecture

3. Screen Resolution and Coordinate Handling

4. Robust Action Execution with Retries

5. Cost Management

Configuration Reference

Best Practices

Troubleshooting

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace