Last verified: 2026-05-06 · Drift risk: medium Official sources: Anthropic computer use tool, Anthropic computer use launch
Anthropic Computer Use¶
Anthropic introduced computer use in October 2024 as a beta capability. It lets Claude see a screenshot of a desktop environment and return actions — mouse moves, clicks, keyboard input — that your code then executes. The loop continues until the task is complete or the model stops returning tool calls.
Tool versions¶
There are two current tool type strings:
| Tool type | Beta header | Supported models |
|---|---|---|
computer_20251124 |
computer-use-2025-11-24 |
Claude Opus 4.7, Opus 4.6, Sonnet 4.6, Opus 4.5 |
computer_20250124 |
computer-use-2025-01-24 |
Sonnet 4.5, Haiku 4.5, Opus 4.1, Sonnet 4, Opus 4, Sonnet 3.7 (deprecated) |
Use the latest tool type with the latest supported model unless you have a specific reason to use an older version. The computer_20250124 version and its beta header are deprecated.
Action types¶
computer_20250124 supports:
screenshot— capture the current display stateleft_click— click at[x, y]type— type a text stringkey— press a key or key combination (for example,"ctrl+s","Return")mouse_move— move cursor to[x, y]
computer_20251124 adds:
scroll— scroll in any direction with amount controlleft_click_drag— click and drag between two coordinate pairsright_click,middle_click— additional mouse buttonsdouble_click,triple_click— repeated clicksleft_mouse_down,left_mouse_up— fine-grained control for drag operationshold_key— hold a key down for a specified duration (seconds)wait— pause between actionszoom— view a specific region at full resolution (requiresenable_zoom: truein the tool definition; available incomputer_20251124only)
Coordinate system¶
Screenshots are sent to the API as base64-encoded images. For most models (before Claude Opus 4.7), the API constrains images to a maximum of 1568 pixels on the longest edge and approximately 1.15 megapixels total. This means a 1512x982 screen gets downsampled before analysis.
Claude analyzes the smaller image and returns click coordinates in that downsampled space. Your code must scale those coordinates back up to screen space before executing them:
import math
def get_scale_factor(width: int, height: int) -> float:
long_edge = max(width, height)
total_pixels = width * height
long_edge_scale = 1568 / long_edge
pixel_scale = math.sqrt(1_150_000 / total_pixels)
return min(1.0, long_edge_scale, pixel_scale)
# Resize the screenshot before sending
scale = get_scale_factor(screen_width, screen_height)
scaled_width = int(screen_width * scale)
scaled_height = int(screen_height * scale)
# Scale coordinates back up before executing
def to_screen_coords(x: int, y: int) -> tuple[int, int]:
return int(x / scale), int(y / scale)
Claude Opus 4.7 supports up to 2576 pixels on the long edge, and its coordinates are 1:1 with image pixels, so no scaling is required for that model.
API parameters¶
{
"type": "computer_20251124", # or computer_20250124
"name": "computer", # must be exactly "computer"
"display_width_px": 1024, # actual screen width
"display_height_px": 768, # actual screen height
"display_number": 1, # optional: X11 display number
"enable_zoom": True # optional: computer_20251124 only
}
The request also needs the beta header:
Each tool definition consumes approximately 735 input tokens. The computer use beta adds 466–499 tokens to the system prompt overhead.
Sandbox guidance¶
Anthropic's documentation for computer use is explicit about the sandbox requirements:
- Run the environment in a dedicated virtual machine or container with minimal privileges.
- Do not give the model access to sensitive accounts, login credentials, or payment methods.
- Restrict internet access to an allowlist of domains.
- Require human confirmation before any action with real-world consequences: form submissions, file deletions, financial transactions, accepting terms of service.
The reference implementation uses Docker with a virtual X11 display server (Xvfb), a lightweight window manager (Mutter), a taskbar (Tint2), a set of pre-installed Linux applications, and an agent loop that bridges Claude's action outputs to the display server. See the Anthropic computer use demo repository for the full reference implementation.
Worked example (pseudocode)¶
The following pseudocode illustrates the core agent loop pattern from the Anthropic documentation. Details like the actual screenshot library, display server interaction, and action executor are environment-specific.
import anthropic
import base64
client = anthropic.Anthropic()
TOOL_DEFINITION = {
"type": "computer_20251124",
"name": "computer",
"display_width_px": 1024,
"display_height_px": 768,
"enable_zoom": False,
}
def run_computer_use_task(task: str, max_turns: int = 20) -> str:
messages = [{"role": "user", "content": task}]
for turn in range(max_turns):
response = client.beta.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
tools=[TOOL_DEFINITION],
messages=messages,
betas=["computer-use-2025-11-24"],
)
# Accumulate the assistant turn
messages.append({"role": "assistant", "content": response.content})
# Collect tool calls from this turn
tool_results = []
for block in response.content:
if block.type == "tool_use" and block.name == "computer":
action = block.input["action"]
if action == "screenshot":
# Capture the display and return it
raw = capture_screenshot() # returns PNG bytes
encoded = base64.standard_b64encode(raw).decode()
result_content = {
"type": "image",
"source": {"type": "base64", "media_type": "image/png", "data": encoded}
}
else:
# Execute the action (click, type, key, scroll, etc.)
execute_action(action, block.input)
# Capture updated state
raw = capture_screenshot()
encoded = base64.standard_b64encode(raw).decode()
result_content = {
"type": "image",
"source": {"type": "base64", "media_type": "image/png", "data": encoded}
}
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": [result_content],
})
# If no tool calls were made, the task is complete
if not tool_results:
# Extract the final text response
for block in response.content:
if hasattr(block, "text"):
return block.text
return "Task complete."
# Feed results back for the next turn
messages.append({"role": "user", "content": tool_results})
return "Max turns reached."
Key points from this pattern:
- Every tool call must receive a tool result, even if it is just a confirmation screenshot.
- The loop terminates when the model returns a turn with no
tool_useblocks. - Set a
max_turnsguard. Without it, a confused model or a stalled UI can run indefinitely. - The
stop_reasonin the response can be"end_turn"(model finished),"tool_use"(more tool calls pending), or"max_tokens"(truncated — usually indicates the context is too long).
Known limitations¶
From the Anthropic documentation:
- Latency per turn makes the approach slow for tasks requiring many sequential actions. Target use cases where speed is not critical.
- Coordinate accuracy degrades on high-density displays and on UIs with small click targets. Use the
zoomaction to inspect dense areas before clicking. - Scrolling reliability, spreadsheet interactions, and multi-application tasks are harder than single-window web tasks.
- Prompt injection via web page content is a documented risk. Treat any text returned from external URLs as potentially adversarial.