This repository has been archived on 2026-05-20. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
clickthrough/docs/API.md
Luna f00c525721
All checks were successful
python-syntax / syntax-check (push) Successful in 6s
feat(ocr): add higher-level text search helpers
2026-05-01 16:23:16 +02:00

514 lines
11 KiB
Markdown

# API Reference (v0.1)
Base URL: `http://127.0.0.1:8123`
If `CLICKTHROUGH_TOKEN` is set, include header:
```http
x-clickthrough-token: <token>
```
## `GET /health`
Returns status and runtime safety flags, including `exec` capability config.
## `GET /displays`
Returns detected displays in API screen order.
```json
{
"ok": true,
"default_screen": 0,
"displays": [
{"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
{"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
]
}
```
`screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend.
Invalid `screen` values fall back to `0`.
## `GET /screen`
Query params:
- `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0`
- `with_grid` (bool, default `true`)
- `grid_rows` (int, default env or `12`)
- `grid_cols` (int, default env or `12`)
- `include_labels` (bool, default `true`)
- `image_format` (`png`|`jpeg`, default `png`)
- `jpeg_quality` (1-100, default `85`)
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
`meta.region` uses global desktop coordinates.
These image-returning endpoints do not magically grant the agent live vision.
If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI.
## `POST /zoom`
Body:
```json
{
"center_x": 1200,
"center_y": 700,
"width": 500,
"height": 350,
"with_grid": true,
"grid_rows": 20,
"grid_cols": 20,
"include_labels": true,
"image_format": "png",
"jpeg_quality": 90
}
```
Query params:
- `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0`
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.
`POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout.
## `POST /action`
Body: one action.
Important:
- the request body uses `action` plus an optional `target`
- pixel coordinates live inside `target` when `target.mode="pixel"`
- do **not** send top-level `x` / `y` fields
Query params:
- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`
Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target.
### Pointer target modes
#### Pixel target
```json
{
"mode": "pixel",
"x": 100,
"y": 200,
"dx": 0,
"dy": 0
}
```
#### Grid target
```json
{
"mode": "grid",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"rows": 12,
"cols": 12,
"row": 5,
"col": 9,
"dx": 0.0,
"dy": 0.0
}
```
`dx`/`dy` are normalized offsets in `[-1, 1]` inside the selected cell.
### Action examples
Click:
```json
{
"action": "click",
"target": {
"mode": "grid",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"rows": 12,
"cols": 12,
"row": 7,
"col": 3,
"dx": 0.2,
"dy": -0.1
},
"clicks": 1,
"button": "left"
}
```
Scroll:
```json
{
"action": "scroll",
"target": {"mode": "pixel", "x": 1300, "y": 740},
"scroll_amount": -500
}
```
Type text:
```json
{
"action": "type",
"text": "hello world",
"interval_ms": 20
}
```
Hotkey:
```json
{
"action": "hotkey",
"keys": ["ctrl", "l"]
}
```
Right click:
```json
{
"action": "right_click",
"target": {"mode": "pixel", "x": 1300, "y": 740}
}
```
Move only:
```json
{
"action": "move",
"target": {"mode": "pixel", "x": 1300, "y": 740},
"duration_ms": 150
}
```
## `GET /windows`
List desktop windows using structured filters instead of shelling out.
Query params:
- `title_contains` (optional substring match)
- `title_regex` (optional case-insensitive regex)
- `process_name` (optional exact process name, e.g. `explorer.exe`)
- `hwnd` (optional exact window handle)
- `visible_only` (bool, default `true`)
```json
{
"ok": true,
"count": 1,
"windows": [
{
"hwnd": 132640,
"title": "WinDirStat",
"class_name": "WinDirStatMainWindow",
"pid": 18420,
"process_name": "windirstat.exe",
"visible": true,
"enabled": true,
"minimized": false,
"maximized": false,
"foreground": true,
"rect": {"x": 194, "y": 116, "width": 1532, "height": 870}
}
]
}
```
Notes:
- Currently supported on Windows hosts only.
- Returns `409` for ambiguous write-target matches when a mutation endpoint would affect multiple windows.
## `POST /windows/action`
Perform a structured window action against exactly one matched window.
```json
{
"action": "focus",
"title_contains": "WinDirStat",
"visible_only": true,
"timeout_ms": 3000
}
```
Supported actions:
- `focus`
- `restore`
- `minimize`
- `maximize`
- `close`
The response includes the matched pre-action window and the final observed window state (or `closed=true` if it disappeared).
## `POST /launch`
Start an app/process without invoking a shell.
```json
{
"executable": "C:/Program Files/WinDirStat/WinDirStat.exe",
"args": [],
"cwd": "C:/Program Files/WinDirStat",
"wait_for_window": true,
"match": {
"title_contains": "WinDirStat",
"visible_only": true
},
"timeout_ms": 8000
}
```
Notes:
- Launch uses direct process execution (`subprocess.Popen`) rather than PowerShell/CMD.
- If `wait_for_window=true`, the server polls for a matching window and returns `window_found`.
- `dry_run=true` returns the resolved argv/cwd without launching.
## `POST /wait`
Wait on a structured UI condition instead of guessing sleep durations.
Query params:
- `screen` (int, default `0`) - used for text and visual waits
### Wait for text to appear
```json
{
"condition": {
"kind": "text",
"mode": "screen",
"text": "Scan complete",
"match": "contains",
"present": true,
"language_hint": "eng",
"min_confidence": 0.4
},
"timeout_ms": 15000,
"poll_interval_ms": 400
}
```
### Wait for a window state
```json
{
"condition": {
"kind": "window",
"title_contains": "WinDirStat",
"visible_only": true,
"state": "focused"
},
"timeout_ms": 5000,
"poll_interval_ms": 200
}
```
Window states:
- `exists`
- `focused`
- `closed`
### Wait for visual change or stability
```json
{
"condition": {
"kind": "visual",
"state": "stable",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"diff_threshold": 0.005,
"stable_for_ms": 1000
},
"timeout_ms": 12000,
"poll_interval_ms": 300
}
```
Visual states:
- `change` — succeeds when the average pixel diff crosses `diff_threshold`
- `stable` — succeeds when the diff stays at or below `diff_threshold` for `stable_for_ms`
Notes:
- Text waits reuse the OCR pipeline and return matching OCR blocks on success.
- Window waits build on the structured window discovery endpoint.
- Visual waits compare repeated captures of either the full selected display or an explicit region.
## `POST /ocr`
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
Query params:
- `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0`
Body:
```json
{
"mode": "screen",
"language_hint": "eng",
"min_confidence": 0.4
}
```
Modes:
- `screen` (default): OCR over full selected monitor
- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
Region mode example:
```json
{
"mode": "region",
"region_x": 220,
"region_y": 160,
"region_width": 900,
"region_height": 400,
"language_hint": "eng",
"min_confidence": 0.5
}
```
Image mode example:
```json
{
"mode": "image",
"image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
"language_hint": "eng"
}
```
Response shape:
```json
{
"ok": true,
"request_id": "...",
"time_ms": 1710000000000,
"result": {
"mode": "screen",
"language_hint": "eng",
"min_confidence": 0.4,
"region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
"blocks": [
{
"text": "Settings",
"confidence": 0.9821,
"bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
}
]
}
}
```
Notes:
- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
- Requires `tesseract` executable plus Python package `pytesseract`.
- If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.
## `POST /ocr/find`
Search OCR output for matching text instead of post-processing raw OCR blocks client-side.
Query params:
- `screen` (int, default `0`) - used for `mode=screen` and `mode=region`
```json
{
"mode": "screen",
"query": "Settings",
"match": "contains",
"group_lines": true,
"max_results": 10,
"language_hint": "eng",
"min_confidence": 0.4
}
```
Modes:
- `screen`
- `region`
- `image`
Options:
- `match`: `contains`, `exact`, or `regex`
- `group_lines=true`: combine nearby OCR words into line-level candidates before matching
- `max_results`: result cap after confidence sorting
Response includes:
- `matches` — confidence-sorted candidate matches
- `match_count`
- `blocks_considered`
## `POST /exec`
Execute a shell command on the host running Clickthrough.
Requirements:
- `CLICKTHROUGH_EXEC_SECRET` must be configured on the server
- send header `x-clickthrough-exec-secret: <secret>`
```json
{
"command": "Get-Process | Select-Object -First 5",
"shell": "powershell",
"timeout_s": 20,
"cwd": "C:/Users/Paul",
"dry_run": false
}
```
Notes:
- `shell` supports `powershell`, `bash`, `cmd`
- if `shell` is omitted, server uses `CLICKTHROUGH_EXEC_DEFAULT_SHELL`
- output is truncated based on `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS`
- endpoint can be disabled with `CLICKTHROUGH_EXEC_ENABLED=false`
- if `CLICKTHROUGH_EXEC_SECRET` is missing, `/exec` is blocked (`403`)
Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution metadata.
## `POST /batch`
Runs multiple `action` payloads sequentially.
Query params:
- `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0`
```json
{
"actions": [
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}},
{"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}}
],
"stop_on_error": true
}
```