479 lines
10 KiB
Markdown
479 lines
10 KiB
Markdown
# API Reference (v0.1)
|
|
|
|
Base URL: `http://127.0.0.1:8123`
|
|
|
|
If `CLICKTHROUGH_TOKEN` is set, include header:
|
|
|
|
```http
|
|
x-clickthrough-token: <token>
|
|
```
|
|
|
|
## `GET /health`
|
|
|
|
Returns status and runtime safety flags, including `exec` capability config.
|
|
|
|
## `GET /displays`
|
|
|
|
Returns detected displays in API screen order.
|
|
|
|
```json
|
|
{
|
|
"ok": true,
|
|
"default_screen": 0,
|
|
"displays": [
|
|
{"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
|
|
{"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
|
|
]
|
|
}
|
|
```
|
|
|
|
`screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend.
|
|
Invalid `screen` values fall back to `0`.
|
|
|
|
## `GET /screen`
|
|
|
|
Query params:
|
|
|
|
- `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0`
|
|
- `with_grid` (bool, default `true`)
|
|
- `grid_rows` (int, default env or `12`)
|
|
- `grid_cols` (int, default env or `12`)
|
|
- `include_labels` (bool, default `true`)
|
|
- `image_format` (`png`|`jpeg`, default `png`)
|
|
- `jpeg_quality` (1-100, default `85`)
|
|
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
|
|
|
|
Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
|
|
`meta.region` uses global desktop coordinates.
|
|
|
|
These image-returning endpoints do not magically grant the agent live vision.
|
|
If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI.
|
|
|
|
## `POST /zoom`
|
|
|
|
Body:
|
|
|
|
```json
|
|
{
|
|
"center_x": 1200,
|
|
"center_y": 700,
|
|
"width": 500,
|
|
"height": 350,
|
|
"with_grid": true,
|
|
"grid_rows": 20,
|
|
"grid_cols": 20,
|
|
"include_labels": true,
|
|
"image_format": "png",
|
|
"jpeg_quality": 90
|
|
}
|
|
```
|
|
|
|
Query params:
|
|
|
|
- `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0`
|
|
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
|
|
|
|
Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.
|
|
|
|
`POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout.
|
|
|
|
## `POST /action`
|
|
|
|
Body: one action.
|
|
|
|
Important:
|
|
- the request body uses `action` plus an optional `target`
|
|
- pixel coordinates live inside `target` when `target.mode="pixel"`
|
|
- do **not** send top-level `x` / `y` fields
|
|
|
|
Query params:
|
|
|
|
- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`
|
|
|
|
Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target.
|
|
|
|
### Pointer target modes
|
|
|
|
#### Pixel target
|
|
|
|
```json
|
|
{
|
|
"mode": "pixel",
|
|
"x": 100,
|
|
"y": 200,
|
|
"dx": 0,
|
|
"dy": 0
|
|
}
|
|
```
|
|
|
|
#### Grid target
|
|
|
|
```json
|
|
{
|
|
"mode": "grid",
|
|
"region_x": 0,
|
|
"region_y": 0,
|
|
"region_width": 1920,
|
|
"region_height": 1080,
|
|
"rows": 12,
|
|
"cols": 12,
|
|
"row": 5,
|
|
"col": 9,
|
|
"dx": 0.0,
|
|
"dy": 0.0
|
|
}
|
|
```
|
|
|
|
`dx`/`dy` are normalized offsets in `[-1, 1]` inside the selected cell.
|
|
|
|
### Action examples
|
|
|
|
Click:
|
|
|
|
```json
|
|
{
|
|
"action": "click",
|
|
"target": {
|
|
"mode": "grid",
|
|
"region_x": 0,
|
|
"region_y": 0,
|
|
"region_width": 1920,
|
|
"region_height": 1080,
|
|
"rows": 12,
|
|
"cols": 12,
|
|
"row": 7,
|
|
"col": 3,
|
|
"dx": 0.2,
|
|
"dy": -0.1
|
|
},
|
|
"clicks": 1,
|
|
"button": "left"
|
|
}
|
|
```
|
|
|
|
Scroll:
|
|
|
|
```json
|
|
{
|
|
"action": "scroll",
|
|
"target": {"mode": "pixel", "x": 1300, "y": 740},
|
|
"scroll_amount": -500
|
|
}
|
|
```
|
|
|
|
Type text:
|
|
|
|
```json
|
|
{
|
|
"action": "type",
|
|
"text": "hello world",
|
|
"interval_ms": 20
|
|
}
|
|
```
|
|
|
|
Hotkey:
|
|
|
|
```json
|
|
{
|
|
"action": "hotkey",
|
|
"keys": ["ctrl", "l"]
|
|
}
|
|
```
|
|
|
|
Right click:
|
|
|
|
```json
|
|
{
|
|
"action": "right_click",
|
|
"target": {"mode": "pixel", "x": 1300, "y": 740}
|
|
}
|
|
```
|
|
|
|
Move only:
|
|
|
|
```json
|
|
{
|
|
"action": "move",
|
|
"target": {"mode": "pixel", "x": 1300, "y": 740},
|
|
"duration_ms": 150
|
|
}
|
|
```
|
|
|
|
## `GET /windows`
|
|
|
|
List desktop windows using structured filters instead of shelling out.
|
|
|
|
Query params:
|
|
|
|
- `title_contains` (optional substring match)
|
|
- `title_regex` (optional case-insensitive regex)
|
|
- `process_name` (optional exact process name, e.g. `explorer.exe`)
|
|
- `hwnd` (optional exact window handle)
|
|
- `visible_only` (bool, default `true`)
|
|
|
|
```json
|
|
{
|
|
"ok": true,
|
|
"count": 1,
|
|
"windows": [
|
|
{
|
|
"hwnd": 132640,
|
|
"title": "WinDirStat",
|
|
"class_name": "WinDirStatMainWindow",
|
|
"pid": 18420,
|
|
"process_name": "windirstat.exe",
|
|
"visible": true,
|
|
"enabled": true,
|
|
"minimized": false,
|
|
"maximized": false,
|
|
"foreground": true,
|
|
"rect": {"x": 194, "y": 116, "width": 1532, "height": 870}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Notes:
|
|
- Currently supported on Windows hosts only.
|
|
- Returns `409` for ambiguous write-target matches when a mutation endpoint would affect multiple windows.
|
|
|
|
## `POST /windows/action`
|
|
|
|
Perform a structured window action against exactly one matched window.
|
|
|
|
```json
|
|
{
|
|
"action": "focus",
|
|
"title_contains": "WinDirStat",
|
|
"visible_only": true,
|
|
"timeout_ms": 3000
|
|
}
|
|
```
|
|
|
|
Supported actions:
|
|
- `focus`
|
|
- `restore`
|
|
- `minimize`
|
|
- `maximize`
|
|
- `close`
|
|
|
|
The response includes the matched pre-action window and the final observed window state (or `closed=true` if it disappeared).
|
|
|
|
## `POST /launch`
|
|
|
|
Start an app/process without invoking a shell.
|
|
|
|
```json
|
|
{
|
|
"executable": "C:/Program Files/WinDirStat/WinDirStat.exe",
|
|
"args": [],
|
|
"cwd": "C:/Program Files/WinDirStat",
|
|
"wait_for_window": true,
|
|
"match": {
|
|
"title_contains": "WinDirStat",
|
|
"visible_only": true
|
|
},
|
|
"timeout_ms": 8000
|
|
}
|
|
```
|
|
|
|
Notes:
|
|
- Launch uses direct process execution (`subprocess.Popen`) rather than PowerShell/CMD.
|
|
- If `wait_for_window=true`, the server polls for a matching window and returns `window_found`.
|
|
- `dry_run=true` returns the resolved argv/cwd without launching.
|
|
|
|
## `POST /wait`
|
|
|
|
Wait on a structured UI condition instead of guessing sleep durations.
|
|
|
|
Query params:
|
|
|
|
- `screen` (int, default `0`) - used for text and visual waits
|
|
|
|
### Wait for text to appear
|
|
|
|
```json
|
|
{
|
|
"condition": {
|
|
"kind": "text",
|
|
"mode": "screen",
|
|
"text": "Scan complete",
|
|
"match": "contains",
|
|
"present": true,
|
|
"language_hint": "eng",
|
|
"min_confidence": 0.4
|
|
},
|
|
"timeout_ms": 15000,
|
|
"poll_interval_ms": 400
|
|
}
|
|
```
|
|
|
|
### Wait for a window state
|
|
|
|
```json
|
|
{
|
|
"condition": {
|
|
"kind": "window",
|
|
"title_contains": "WinDirStat",
|
|
"visible_only": true,
|
|
"state": "focused"
|
|
},
|
|
"timeout_ms": 5000,
|
|
"poll_interval_ms": 200
|
|
}
|
|
```
|
|
|
|
Window states:
|
|
- `exists`
|
|
- `focused`
|
|
- `closed`
|
|
|
|
### Wait for visual change or stability
|
|
|
|
```json
|
|
{
|
|
"condition": {
|
|
"kind": "visual",
|
|
"state": "stable",
|
|
"region_x": 0,
|
|
"region_y": 0,
|
|
"region_width": 1920,
|
|
"region_height": 1080,
|
|
"diff_threshold": 0.005,
|
|
"stable_for_ms": 1000
|
|
},
|
|
"timeout_ms": 12000,
|
|
"poll_interval_ms": 300
|
|
}
|
|
```
|
|
|
|
Visual states:
|
|
- `change` — succeeds when the average pixel diff crosses `diff_threshold`
|
|
- `stable` — succeeds when the diff stays at or below `diff_threshold` for `stable_for_ms`
|
|
|
|
Notes:
|
|
- Text waits reuse the OCR pipeline and return matching OCR blocks on success.
|
|
- Window waits build on the structured window discovery endpoint.
|
|
- Visual waits compare repeated captures of either the full selected display or an explicit region.
|
|
|
|
## `POST /ocr`
|
|
|
|
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
|
|
|
|
Query params:
|
|
|
|
- `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0`
|
|
|
|
Body:
|
|
|
|
```json
|
|
{
|
|
"mode": "screen",
|
|
"language_hint": "eng",
|
|
"min_confidence": 0.4
|
|
}
|
|
```
|
|
|
|
Modes:
|
|
- `screen` (default): OCR over full selected monitor
|
|
- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
|
|
- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
|
|
|
|
Region mode example:
|
|
|
|
```json
|
|
{
|
|
"mode": "region",
|
|
"region_x": 220,
|
|
"region_y": 160,
|
|
"region_width": 900,
|
|
"region_height": 400,
|
|
"language_hint": "eng",
|
|
"min_confidence": 0.5
|
|
}
|
|
```
|
|
|
|
Image mode example:
|
|
|
|
```json
|
|
{
|
|
"mode": "image",
|
|
"image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
|
|
"language_hint": "eng"
|
|
}
|
|
```
|
|
|
|
Response shape:
|
|
|
|
```json
|
|
{
|
|
"ok": true,
|
|
"request_id": "...",
|
|
"time_ms": 1710000000000,
|
|
"result": {
|
|
"mode": "screen",
|
|
"language_hint": "eng",
|
|
"min_confidence": 0.4,
|
|
"region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
|
|
"blocks": [
|
|
{
|
|
"text": "Settings",
|
|
"confidence": 0.9821,
|
|
"bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
Notes:
|
|
- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
|
|
- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
|
|
- Requires `tesseract` executable plus Python package `pytesseract`.
|
|
- If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.
|
|
|
|
## `POST /exec`
|
|
|
|
Execute a shell command on the host running Clickthrough.
|
|
|
|
Requirements:
|
|
- `CLICKTHROUGH_EXEC_SECRET` must be configured on the server
|
|
- send header `x-clickthrough-exec-secret: <secret>`
|
|
|
|
```json
|
|
{
|
|
"command": "Get-Process | Select-Object -First 5",
|
|
"shell": "powershell",
|
|
"timeout_s": 20,
|
|
"cwd": "C:/Users/Paul",
|
|
"dry_run": false
|
|
}
|
|
```
|
|
|
|
Notes:
|
|
- `shell` supports `powershell`, `bash`, `cmd`
|
|
- if `shell` is omitted, server uses `CLICKTHROUGH_EXEC_DEFAULT_SHELL`
|
|
- output is truncated based on `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS`
|
|
- endpoint can be disabled with `CLICKTHROUGH_EXEC_ENABLED=false`
|
|
- if `CLICKTHROUGH_EXEC_SECRET` is missing, `/exec` is blocked (`403`)
|
|
|
|
Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution metadata.
|
|
|
|
## `POST /batch`
|
|
|
|
Runs multiple `action` payloads sequentially.
|
|
|
|
Query params:
|
|
|
|
- `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0`
|
|
|
|
```json
|
|
{
|
|
"actions": [
|
|
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}},
|
|
{"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}}
|
|
],
|
|
"stop_on_error": true
|
|
}
|
|
```
|