C

Computer Use Expert

Powerful skill for build, agents, interact, computers. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Computer Use Expert

Overview

Computer Use Expert covers the architecture, implementation, and deployment patterns for building AI agents that interact with computer interfaces through vision and action. These agents perceive screen state through screenshots, reason about what to do next using vision-language models, and execute mouse clicks, keyboard inputs, and shell commands to accomplish tasks on a desktop or in a browser.

Computer use agents represent a fundamentally different paradigm from API-based tool calling. Instead of structured data in and out, the agent sees pixels and must translate visual understanding into precise coordinate-based actions. This makes them capable of automating any software that has a GUI -- including legacy applications with no API -- but also introduces unique challenges around precision, safety, and cost.

Anthropic's Claude was the first frontier model to offer native computer use capabilities, with Claude supporting screenshot capture, mouse/keyboard control, bash execution, and file editing as built-in tools. This template covers both the Anthropic-native implementation and the general patterns for building computer use agents with any vision model.

When to Use

  • Automating GUI-based workflows in applications that lack APIs (legacy enterprise software, desktop apps)
  • Building testing agents that interact with web applications through visual understanding rather than DOM selectors
  • Creating agents that can navigate complex multi-step web processes (form filling, data extraction, procurement workflows)
  • Implementing agents that need full desktop control (not just browser) for tasks like IDE automation or system administration
  • Building RPA (Robotic Process Automation) replacements that adapt to UI changes instead of breaking on selector changes
  • Prototyping automations before building proper API integrations

Quick Start

# Set up a sandboxed computer use environment with Docker mkdir -p computer-use-agent/{agent,sandbox,configs} cd computer-use-agent # Create Python environment python -m venv .venv && source .venv/bin/activate pip install anthropic pillow pyautogui # For sandboxed execution (strongly recommended) # Create Dockerfile for virtual desktop cat > Dockerfile << 'DOCKER' FROM ubuntu:22.04 RUN apt-get update && apt-get install -y \ xvfb x11vnc fluxbox xterm firefox \ python3 python3-pip scrot supervisor RUN useradd -m -s /bin/bash agent COPY requirements.txt /tmp/ RUN pip3 install -r /tmp/requirements.txt COPY --chown=agent:agent . /app WORKDIR /app USER agent EXPOSE 5900 8080 CMD ["supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"] DOCKER echo "anthropic pillow pyautogui fastapi uvicorn" > requirements.txt
# agent/core.py - Computer use agent with Perception-Reasoning-Action loop from anthropic import Anthropic from PIL import Image import base64 import io import time class ComputerUseAgent: """ Perception-Reasoning-Action loop for computer use. Captures screenshots, reasons with Claude, executes actions. """ def __init__(self, anthropic_client: Anthropic = None): self.client = anthropic_client or Anthropic() self.model = "claude-sonnet-4-20250514" self.screen_width = 1280 self.screen_height = 800 self.max_steps = 50 self.action_delay = 0.5 # seconds between actions def capture_screenshot(self) -> str: """Capture screen as base64 PNG.""" import pyautogui screenshot = pyautogui.screenshot() screenshot = screenshot.resize( (self.screen_width, self.screen_height), Image.LANCZOS ) buffer = io.BytesIO() screenshot.save(buffer, format="PNG") return base64.b64encode(buffer.getvalue()).decode() def run(self, task: str) -> str: """Execute a task using the computer.""" messages = [{"role": "user", "content": [ {"type": "text", "text": f"Complete this task: {task}"}, {"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": self.capture_screenshot(), }}, ]}] for step in range(self.max_steps): response = self.client.messages.create( model=self.model, max_tokens=4096, tools=self._get_tools(), messages=messages, ) messages.append({"role": "assistant", "content": response.content}) # Check for tool use tool_uses = [b for b in response.content if b.type == "tool_use"] if not tool_uses: text = [b.text for b in response.content if hasattr(b, "text")] return "\n".join(text) # Execute tools and capture new screenshot results = [] for tool_use in tool_uses: result = self._execute_tool(tool_use.name, tool_use.input) time.sleep(self.action_delay) # Capture screenshot after action screenshot = self.capture_screenshot() results.append({ "type": "tool_result", "tool_use_id": tool_use.id, "content": [ {"type": "text", "text": result}, {"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": screenshot, }}, ], }) messages.append({"role": "user", "content": results}) return "Task did not complete within step limit" def _get_tools(self) -> list: return [ { "name": "computer", "type": "computer_20241022", "display_width_px": self.screen_width, "display_height_px": self.screen_height, }, { "name": "bash", "type": "bash_20241022", }, { "name": "str_replace_editor", "type": "text_editor_20241022", }, ] def _execute_tool(self, name: str, input_data: dict) -> str: if name == "computer": return self._handle_computer(input_data) elif name == "bash": return self._handle_bash(input_data) elif name == "str_replace_editor": return self._handle_editor(input_data) return f"Unknown tool: {name}" def _handle_computer(self, input_data: dict) -> str: import pyautogui action = input_data.get("action") if action == "screenshot": return "Screenshot captured" elif action == "click": x, y = input_data["coordinate"] pyautogui.click(x, y) return f"Clicked at ({x}, {y})" elif action == "type": pyautogui.typewrite(input_data["text"], interval=0.02) return f"Typed {len(input_data['text'])} characters" elif action == "key": pyautogui.press(input_data["key"]) return f"Pressed {input_data['key']}" elif action == "scroll": direction = input_data.get("direction", "down") amount = input_data.get("amount", 3) scroll_val = -amount if direction == "down" else amount pyautogui.scroll(scroll_val) return f"Scrolled {direction} by {amount}" elif action == "mouse_move": x, y = input_data["coordinate"] pyautogui.moveTo(x, y) return f"Moved mouse to ({x}, {y})" elif action == "drag": sx, sy = input_data["start_coordinate"] ex, ey = input_data["end_coordinate"] pyautogui.moveTo(sx, sy) pyautogui.drag(ex - sx, ey - sy) return f"Dragged from ({sx},{sy}) to ({ex},{ey})" return f"Unknown computer action: {action}" def _handle_bash(self, input_data: dict) -> str: import subprocess cmd = input_data.get("command", "") try: result = subprocess.run( cmd, shell=True, capture_output=True, text=True, timeout=30, cwd="/home/agent" ) output = result.stdout[:5000] if result.returncode != 0: output += f"\nSTDERR: {result.stderr[:2000]}" return output or "(no output)" except subprocess.TimeoutExpired: return "Command timed out after 30 seconds" def _handle_editor(self, input_data: dict) -> str: command = input_data.get("command") path = input_data.get("path", "") if command == "view": try: with open(path) as f: return f.read()[:10000] except Exception as e: return str(e) elif command == "str_replace": old = input_data.get("old_str", "") new = input_data.get("new_str", "") try: with open(path) as f: content = f.read() content = content.replace(old, new, 1) with open(path, "w") as f: f.write(content) return f"Replaced in {path}" except Exception as e: return str(e) elif command == "create": file_text = input_data.get("file_text", "") with open(path, "w") as f: f.write(file_text) return f"Created {path}" return f"Unknown editor command: {command}"

Core Concepts

1. The Perception-Reasoning-Action Loop

Computer use agents follow a fundamentally different loop from tool-calling agents. The perception step is visual (screenshot), the reasoning step uses a vision-language model, and the action step translates into coordinate-based operations.

  +--------------+     +---------------+     +------------+
  |  PERCEIVE    |---->|    REASON     |---->|    ACT     |
  | (screenshot) |     | (vision LLM)  |     | (click/    |
  |              |     |               |     |  type/key) |
  +--------------+     +---------------+     +------------+
       ^                                          |
       |              +-----------+               |
       +--------------| FEEDBACK  |<--------------+
                      | (observe  |
                      |  result)  |
                      +-----------+

Key insight: There is a detectable 1-5 second pause during the "reason" step where the agent is completely still. This is normal -- the vision model needs time to process the screenshot and decide on the next action.

2. Sandboxed Environment Architecture

Computer use agents MUST run in isolated environments. Giving an AI direct access to your desktop with mouse and keyboard control is a critical security risk. Always use Docker containers with virtual displays.

# docker-compose.yml - Secure computer use environment version: '3.8' services: computer-use-agent: build: . ports: - "5900:5900" # VNC for observation - "8080:8080" # API for control security_opt: - no-new-privileges:true deploy: resources: limits: cpus: '2' memory: 4G networks: - agent-network read_only: true tmpfs: - /tmp - /run environment: - DISPLAY=:99 - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} networks: agent-network: driver: bridge internal: true # No internet access by default
# supervisord.conf - Run virtual display + VNC + agent [supervisord] nodaemon=true [program:xvfb] command=Xvfb :99 -screen 0 1280x800x24 autorestart=true [program:fluxbox] command=fluxbox environment=DISPLAY=":99" autorestart=true [program:x11vnc] command=x11vnc -display :99 -forever -nopw autorestart=true [program:agent-api] command=python3 -m uvicorn agent.api:app --host 0.0.0.0 --port 8080 directory=/app environment=DISPLAY=":99" autorestart=true

3. Screen Resolution and Coordinate Handling

Vision models process screenshots at fixed resolutions. You must ensure your screenshots match the resolution you tell the model about, or coordinates will be off.

# agent/screen.py - Resolution management class ScreenManager: """ Manages screen resolution, scaling, and coordinate translation. Critical for accurate click targeting. """ def __init__(self, display_width: int = 1280, display_height: int = 800): self.display_width = display_width self.display_height = display_height def capture_and_resize(self) -> tuple[str, float]: """ Capture screenshot and resize to model's expected resolution. Returns (base64_image, scale_factor). """ import pyautogui screenshot = pyautogui.screenshot() actual_width, actual_height = screenshot.size # Calculate scale factor scale_x = actual_width / self.display_width scale_y = actual_height / self.display_height # Resize to expected dimensions resized = screenshot.resize( (self.display_width, self.display_height), Image.LANCZOS, ) buffer = io.BytesIO() resized.save(buffer, format="PNG") b64 = base64.b64encode(buffer.getvalue()).decode() return b64, (scale_x, scale_y) def translate_coordinates( self, model_x: int, model_y: int, scale: tuple[float, float] ) -> tuple[int, int]: """ Translate model coordinates back to actual screen coordinates. """ actual_x = int(model_x * scale[0]) actual_y = int(model_y * scale[1]) return actual_x, actual_y

4. Robust Action Execution with Retries

Screen-based interactions are inherently flaky. Elements load asynchronously, animations delay content, and popups block actions. Robust agents retry with visual verification.

# agent/actions.py - Action execution with verification class RobustActionExecutor: """Execute actions with visual verification and retry logic.""" def __init__(self, screen_manager: ScreenManager, max_retries: int = 3): self.screen = screen_manager self.max_retries = max_retries def click_with_verify( self, x: int, y: int, expected_change: str = None ) -> dict: """ Click at coordinates and optionally verify the screen changed. """ import pyautogui before = self.screen.capture_and_resize()[0] pyautogui.click(x, y) time.sleep(0.5) # Wait for UI to update after = self.screen.capture_and_resize()[0] # Simple change detection: compare image hashes changed = before != after return { "success": True, "screen_changed": changed, "coordinates": (x, y), } def wait_for_element( self, description: str, llm_client, timeout: int = 10 ) -> bool: """ Wait until a described element appears on screen. Uses the vision model to verify presence. """ start = time.time() while time.time() - start < timeout: screenshot = self.screen.capture_and_resize()[0] response = llm_client.messages.create( model="claude-haiku-4", # Fast model for verification max_tokens=100, messages=[{ "role": "user", "content": [ {"type": "text", "text": f"Is this element visible on screen: '{description}'? Reply YES or NO only."}, {"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": screenshot, }}, ], }], ) if "YES" in response.content[0].text.upper(): return True time.sleep(1) return False

5. Cost Management

Computer use agents are expensive -- every action requires a screenshot (vision tokens) plus reasoning tokens. A single task can easily consume thousands of tokens.

# agent/cost.py - Token and cost tracking class CostTracker: """Track and limit computer use costs.""" # Approximate costs per 1K tokens (as of 2025) COST_TABLE = { "claude-sonnet-4-20250514": {"input": 0.003, "output": 0.015}, "claude-haiku-4": {"input": 0.0008, "output": 0.004}, } # Approximate tokens per screenshot at 1280x800 SCREENSHOT_TOKENS = 1500 def __init__(self, budget_usd: float = 5.0): self.budget = budget_usd self.total_cost = 0.0 self.step_count = 0 self.screenshots_sent = 0 def record_step(self, model: str, input_tokens: int, output_tokens: int): costs = self.COST_TABLE.get(model, {"input": 0.01, "output": 0.03}) step_cost = ( (input_tokens / 1000) * costs["input"] + (output_tokens / 1000) * costs["output"] ) self.total_cost += step_cost self.step_count += 1 def can_continue(self) -> bool: return self.total_cost < self.budget def get_summary(self) -> dict: return { "total_cost_usd": round(self.total_cost, 4), "budget_remaining": round(self.budget - self.total_cost, 4), "steps": self.step_count, "avg_cost_per_step": round(self.total_cost / max(self.step_count, 1), 4), }

Configuration Reference

ParameterTypeDefaultDescription
DISPLAY_WIDTHint1280Screenshot width sent to the model
DISPLAY_HEIGHTint800Screenshot height sent to the model
MAX_STEPSint50Maximum perception-action cycles
ACTION_DELAYfloat0.5Seconds to wait between actions for UI to settle
COST_BUDGET_USDfloat5.0Maximum spend per task
SANDBOX_TIMEOUTint300Maximum seconds for the entire task
BASH_TIMEOUTint30Maximum seconds per bash command
VNC_PORTint5900Port for VNC observation
API_PORTint8080Port for the control API
NETWORK_ACCESSboolfalseWhether the sandbox has internet access
MAX_OUTPUT_LENGTHint5000Truncate tool output beyond this character count
VERIFICATION_MODELstringclaude-haiku-4Fast model used for visual verification checks

Best Practices

  1. Always sandbox computer use agents. Never run them on your host machine with direct access to your desktop, filesystem, and credentials. Use Docker containers with virtual displays. The blast radius of a mistake must be contained.

  2. Use the smallest effective resolution. 1280x800 is a good balance between visual clarity and token cost. Higher resolutions add cost without proportional improvement in action accuracy for most tasks.

  3. Add delays between actions. Web pages and applications need time to render after clicks and keystrokes. A 0.5-1.0 second delay between actions prevents the agent from acting on stale screenshots.

  4. Prefer keyboard shortcuts over mouse clicks when possible. Clicking requires accurate coordinate prediction, which can fail on small targets. Keyboard shortcuts (Ctrl+S, Ctrl+C, Tab, Enter) are perfectly reliable.

  5. Implement cost tracking from the start. Each step costs 1500+ vision tokens plus reasoning. A 30-step task can easily cost $1-5. Set per-task budgets and surface cost data so users understand the trade-off.

  6. Use a fast model for verification checks. When waiting for an element to appear or verifying a click succeeded, use a cheap fast model (Haiku) instead of the main reasoning model (Sonnet/Opus).

  7. Log screenshots at every step for debugging. Save each screenshot with a step number. When an agent fails, you can replay the visual sequence to understand exactly where it went wrong.

  8. Restrict network access in the sandbox by default. Only open access to the specific domains the agent needs. An agent with unrestricted internet access could visit malicious sites or leak data.

  9. Handle dropdowns and scrollable menus carefully. These are the most common failure points for computer use agents. The vision model may struggle with z-index layering and disappearing elements. Use keyboard navigation (arrow keys) as a fallback.

  10. Set up human observation via VNC. Even in production, someone should be able to watch what the agent is doing. VNC provides real-time visibility into the sandboxed desktop.

Troubleshooting

Problem: Agent clicks in the wrong location consistently. Solution: Verify that the screenshot resolution matches the display_width_px and display_height_px you pass to the tool definition. If your virtual display runs at 1920x1080 but you tell the model it is 1280x800, coordinates will be systematically offset. Implement coordinate translation if resolutions differ.

Problem: Agent gets stuck on a loading screen. Solution: Implement wait_for_element with a visual verification loop. The agent should check for the expected element every 1-2 seconds and only proceed when it appears. Set a timeout to prevent infinite waiting.

Problem: Screenshots consume too many tokens and costs are high. Solution: Reduce resolution to 1024x768 for simpler UIs. Consider cropping screenshots to the relevant region if the task only involves a portion of the screen. Use compression (JPEG at 80% quality instead of PNG) to reduce image size without significant quality loss.

Problem: Agent cannot interact with dropdown menus. Solution: Dropdown menus are a known weakness. Train the agent to use keyboard navigation: click the dropdown, then use arrow keys to navigate options, then press Enter. This is more reliable than trying to click on specific dropdown items.

Problem: Docker container starts but VNC shows a black screen. Solution: Check that Xvfb is running with the correct display number (:99). Verify that the DISPLAY environment variable matches. Check supervisord logs for startup errors. Ensure fluxbox or another window manager is starting correctly.

Problem: Agent runs out of budget mid-task. Solution: Add cost estimation before starting a task. If the estimated cost exceeds the budget, warn the user. Break complex tasks into phases with checkpoints so partial progress is preserved when the budget runs out.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates