13 KiB
API Reference (v0.1)
Base URL: http://127.0.0.1:8123
If CLICKTHROUGH_TOKEN is set, include header:
x-clickthrough-token: <token>
GET /health
Returns status and runtime safety flags, including exec capability config.
GET /displays
Returns detected displays in API screen order.
{
"ok": true,
"default_screen": 0,
"displays": [
{"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
{"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
]
}
screen is zero-based. screen=0 is the primary display when detectable, falling back to the first monitor reported by the capture backend.
Invalid screen values fall back to 0.
GET /screen
Query params:
screen(int, default0) — zero-based display selector; invalid values fall back to0with_grid(bool, defaulttrue)grid_rows(int, default env or12)grid_cols(int, default env or12)include_labels(bool, defaulttrue)image_format(png|jpeg, defaultpng)jpeg_quality(1-100, default85)asImage(bool, defaultfalse) - iftrue, return raw image bytes only (image/pngorimage/jpeg)
Default response includes base64 image and metadata (meta.region, meta.screen, meta.displays, optional meta.grid).
meta.region uses global desktop coordinates.
These image-returning endpoints do not magically grant the agent live vision.
If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's image tool and ask a narrow question about the visible UI.
POST /zoom
Body:
{
"center_x": 1200,
"center_y": 700,
"width": 500,
"height": 350,
"with_grid": true,
"grid_rows": 20,
"grid_cols": 20,
"include_labels": true,
"image_format": "png",
"jpeg_quality": 90
}
Query params:
screen(int, default0) - zero-based display selector; invalid values fall back to0asImage(bool, defaultfalse) - iftrue, return raw image bytes only (image/pngorimage/jpeg)
Default response returns cropped image + region metadata in global pixel coordinates. center_x and center_y are also global coordinates; use the selected display's meta.region from /screen?screen=X as the coordinate base.
POST /zoom is often the best screenshot to hand to the image tool when the agent needs help judging a specific button, icon, or dialog layout.
POST /action
Body: one action.
Important:
- the request body uses
actionplus an optionaltarget - pixel coordinates live inside
targetwhentarget.mode="pixel" - do not send top-level
x/yfields
Query params:
screen(int, default0) - zero-based display selector included in the response metadata; invalid values fall back to0
Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture /screen?screen=X and use that response's meta.region or grid metadata to compute the target.
Pointer target modes
Pixel target
{
"mode": "pixel",
"x": 100,
"y": 200,
"dx": 0,
"dy": 0
}
Grid target
{
"mode": "grid",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"rows": 12,
"cols": 12,
"row": 5,
"col": 9,
"dx": 0.0,
"dy": 0.0
}
dx/dy are normalized offsets in [-1, 1] inside the selected cell.
Action examples
Click:
{
"action": "click",
"target": {
"mode": "grid",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"rows": 12,
"cols": 12,
"row": 7,
"col": 3,
"dx": 0.2,
"dy": -0.1
},
"clicks": 1,
"button": "left"
}
Scroll:
{
"action": "scroll",
"target": {"mode": "pixel", "x": 1300, "y": 740},
"scroll_amount": -500
}
Type text:
{
"action": "type",
"text": "hello world",
"interval_ms": 20
}
Hotkey:
{
"action": "hotkey",
"keys": ["ctrl", "l"]
}
Right click:
{
"action": "right_click",
"target": {"mode": "pixel", "x": 1300, "y": 740}
}
Move only:
{
"action": "move",
"target": {"mode": "pixel", "x": 1300, "y": 740},
"duration_ms": 150
}
GET /windows
List desktop windows using structured filters instead of shelling out.
Query params:
title_contains(optional substring match)title_regex(optional case-insensitive regex)process_name(optional exact process name, e.g.explorer.exe)hwnd(optional exact window handle)visible_only(bool, defaulttrue)
{
"ok": true,
"count": 1,
"windows": [
{
"hwnd": 132640,
"title": "WinDirStat",
"class_name": "WinDirStatMainWindow",
"pid": 18420,
"process_name": "windirstat.exe",
"visible": true,
"enabled": true,
"minimized": false,
"maximized": false,
"foreground": true,
"rect": {"x": 194, "y": 116, "width": 1532, "height": 870}
}
]
}
Notes:
- Currently supported on Windows hosts only.
- Returns
409for ambiguous write-target matches when a mutation endpoint would affect multiple windows.
POST /windows/action
Perform a structured window action against exactly one matched window.
{
"action": "focus",
"title_contains": "WinDirStat",
"visible_only": true,
"timeout_ms": 3000
}
Supported actions:
focusrestoreminimizemaximizeclose
The response includes the matched pre-action window and the final observed window state (or closed=true if it disappeared).
POST /launch
Start an app/process without invoking a shell.
{
"executable": "C:/Program Files/WinDirStat/WinDirStat.exe",
"args": [],
"cwd": "C:/Program Files/WinDirStat",
"wait_for_window": true,
"match": {
"title_contains": "WinDirStat",
"visible_only": true
},
"timeout_ms": 8000
}
Notes:
- Launch uses direct process execution (
subprocess.Popen) rather than PowerShell/CMD. - If
wait_for_window=true, the server polls for a matching window and returnswindow_found. dry_run=truereturns the resolved argv/cwd without launching.
POST /vision/diff
Measure whether a screen region changed meaningfully between two captures.
Query params:
screen(int, default0) - used formode=screenandmode=region
Compare live captures:
{
"mode": "region",
"region_x": 120,
"region_y": 80,
"region_width": 600,
"region_height": 300,
"delay_ms": 400,
"diff_threshold": 0.01
}
Compare provided images:
{
"mode": "image",
"before_image_base64": "iVBORw0KGgoAAA...",
"after_image_base64": "iVBORw0KGgoBBB...",
"diff_threshold": 0.01
}
Response includes:
diff_ratio— average normalized pixel differencechanged— whetherdiff_ratio >= diff_thresholdregion— compared region
POST /vision/stability
Measure whether a screen region stays visually stable over a short interval.
Query params:
screen(int, default0)
{
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"sample_interval_ms": 250,
"duration_ms": 1200,
"diff_threshold": 0.005
}
Response includes:
stablesample_countmax_diff_ratioavg_diff_ratio
POST /wait
Wait on a structured UI condition instead of guessing sleep durations.
Query params:
screen(int, default0) - used for text and visual waits
Wait for text to appear
{
"condition": {
"kind": "text",
"mode": "screen",
"text": "Scan complete",
"match": "contains",
"present": true,
"language_hint": "eng",
"min_confidence": 0.4
},
"timeout_ms": 15000,
"poll_interval_ms": 400
}
Wait for a window state
{
"condition": {
"kind": "window",
"title_contains": "WinDirStat",
"visible_only": true,
"state": "focused"
},
"timeout_ms": 5000,
"poll_interval_ms": 200
}
Window states:
existsfocusedclosed
Wait for visual change or stability
{
"condition": {
"kind": "visual",
"state": "stable",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"diff_threshold": 0.005,
"stable_for_ms": 1000
},
"timeout_ms": 12000,
"poll_interval_ms": 300
}
Visual states:
change— succeeds when the average pixel diff crossesdiff_thresholdstable— succeeds when the diff stays at or belowdiff_thresholdforstable_for_ms
Notes:
- Text waits reuse the OCR pipeline and return matching OCR blocks on success.
- Window waits build on the structured window discovery endpoint.
- Visual waits compare repeated captures of either the full selected display or an explicit region.
POST /action/verify
Execute one action and wait for a structured success condition.
Query params:
screen(int, default0)
{
"action": {
"action": "click",
"target": {"mode": "pixel", "x": 1300, "y": 740}
},
"condition": {
"kind": "text",
"mode": "screen",
"text": "Settings",
"match": "contains",
"present": true,
"language_hint": "eng",
"min_confidence": 0.4
},
"retries": 1,
"timeout_ms": 4000,
"poll_interval_ms": 250,
"retry_delay_ms": 250
}
Condition kinds mirror POST /wait:
textwindowvisual
The response returns per-attempt action output plus structured verification output.
POST /ocr
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
Query params:
screen(int, default0) - zero-based display selector formode=screenandmode=region; invalid values fall back to0
Body:
{
"mode": "screen",
"language_hint": "eng",
"min_confidence": 0.4
}
Modes:
screen(default): OCR over full selected monitorregion: OCR over explicit region (region_x,region_y,region_width,region_height)image: OCR over providedimage_base64(supports plain base64 or data URL)
Region mode example:
{
"mode": "region",
"region_x": 220,
"region_y": 160,
"region_width": 900,
"region_height": 400,
"language_hint": "eng",
"min_confidence": 0.5
}
Image mode example:
{
"mode": "image",
"image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
"language_hint": "eng"
}
Response shape:
{
"ok": true,
"request_id": "...",
"time_ms": 1710000000000,
"result": {
"mode": "screen",
"language_hint": "eng",
"min_confidence": 0.4,
"region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
"blocks": [
{
"text": "Settings",
"confidence": 0.9821,
"bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
}
]
}
}
Notes:
- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
bboxcoordinates are in global screen space forscreen/region, and image-local forimage.- Requires
tesseractexecutable plus Python packagepytesseract. - If
tesseractis not onPATH, setCLICKTHROUGH_TESSERACT_CMDto the full executable path.
POST /ocr/find
Search OCR output for matching text instead of post-processing raw OCR blocks client-side.
Query params:
screen(int, default0) - used formode=screenandmode=region
{
"mode": "screen",
"query": "Settings",
"match": "contains",
"group_lines": true,
"max_results": 10,
"language_hint": "eng",
"min_confidence": 0.4
}
Modes:
screenregionimage
Options:
match:contains,exact, orregexgroup_lines=true: combine nearby OCR words into line-level candidates before matchingmax_results: result cap after confidence sorting
Response includes:
matches— confidence-sorted candidate matchesmatch_countblocks_considered
POST /exec
Execute a shell command on the host running Clickthrough.
Requirements:
CLICKTHROUGH_EXEC_SECRETmust be configured on the server- send header
x-clickthrough-exec-secret: <secret>
{
"command": "Get-Process | Select-Object -First 5",
"shell": "powershell",
"timeout_s": 20,
"cwd": "C:/Users/Paul",
"dry_run": false
}
Notes:
shellsupportspowershell,bash,cmd- if
shellis omitted, server usesCLICKTHROUGH_EXEC_DEFAULT_SHELL - output is truncated based on
CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS - endpoint can be disabled with
CLICKTHROUGH_EXEC_ENABLED=false - if
CLICKTHROUGH_EXEC_SECRETis missing,/execis blocked (403)
Response includes stdout, stderr, exit_code, timeout state, and execution metadata.
POST /batch
Runs multiple action payloads sequentially.
Query params:
screen(int, default0) - zero-based display selector applied to each action response; invalid values fall back to0
{
"actions": [
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}},
{"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}}
],
"stop_on_error": true
}