feat: migrate to v2-only API and unified response envelope
All checks were successful
python-syntax / syntax-check (push) Successful in 7s
All checks were successful
python-syntax / syntax-check (push) Successful in 7s
This commit is contained in:
85
README.md
85
README.md
@@ -1,22 +1,25 @@
|
||||
# Clickthrough
|
||||
|
||||
Let an Agent interact with your computer over HTTP, with grid-aware screenshots and precise input actions.
|
||||
Let an agent interact with a computer over HTTP.
|
||||
|
||||
## Primary mode (v2)
|
||||
|
||||
Use the v2 contract for faster, less OCR-heavy control loops:
|
||||
- `POST /v2/observe`
|
||||
- `POST /v2/localize`
|
||||
- `POST /v2/act`
|
||||
- `POST /v2/act-verify`
|
||||
|
||||
This is optimized for agents that cannot directly see the screen and must use screenshot/image tools.
|
||||
|
||||
## What this provides
|
||||
|
||||
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
|
||||
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
|
||||
- **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ...
|
||||
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
|
||||
- **Window lifecycle endpoints**: list/focus/restore/minimize/maximize/close windows via `GET /windows` + `POST /windows/action`
|
||||
- **Structured launch endpoint**: start an app/process without dropping to a shell via `POST /launch`
|
||||
- **Wait/sync endpoint**: poll for text, window, or visual state changes via `POST /wait`
|
||||
- **Vision helper endpoints**: compare screenshots and measure stability via `POST /vision/diff` and `POST /vision/stability`
|
||||
- **OCR endpoints**: extract text blocks or search for matching text via `POST /ocr` and `POST /ocr/find`
|
||||
- **Compound verify endpoint**: execute an action and wait for a structured success condition via `POST /action/verify`
|
||||
- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
|
||||
- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
|
||||
- **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
|
||||
- Screen/region capture with optional OCR and timing stats
|
||||
- Observation IDs for deterministic follow-up localization
|
||||
- Text localization and image-tool coordinate localization
|
||||
- Action execution with resolved target IDs
|
||||
- Risk-aware action+verification defaults
|
||||
- Unified response envelope across all endpoints
|
||||
|
||||
## Quick start
|
||||
|
||||
@@ -30,53 +33,17 @@ CLICKTHROUGH_TOKEN=change-me python -m server.app
|
||||
|
||||
Server defaults to `127.0.0.1:8123`.
|
||||
|
||||
For OCR support, install the native `tesseract` binary on the host (in addition to Python deps), or point `CLICKTHROUGH_TESSERACT_CMD` at the executable if it lives somewhere weird.
|
||||
## Fast control loop
|
||||
|
||||
`python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically.
|
||||
1. `POST /v2/observe` on a tight region
|
||||
2. If OCR is enough, `POST /v2/localize` with `text_query`
|
||||
3. If ambiguous, ask image tool for one x,y in observation bounds
|
||||
4. `POST /v2/localize` with `image_tool_point`
|
||||
5. `POST /v2/act` or `POST /v2/act-verify`
|
||||
6. Re-observe only changed region
|
||||
|
||||
## Minimal API flow
|
||||
## See docs
|
||||
|
||||
1. `GET /displays` if you need a non-primary monitor
|
||||
2. `GET /screen?screen=0` with grid
|
||||
3. Decide cell / target
|
||||
4. Optional `POST /zoom?screen=0` for finer targeting
|
||||
5. `POST /action?screen=0` to execute (or `POST /action/verify?screen=0` for a bundled action+wait flow)
|
||||
6. `GET /screen?screen=0` again to verify result, or use `POST /wait`, `POST /vision/diff`, or `POST /ocr/find`
|
||||
|
||||
Important:
|
||||
- `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields.
|
||||
- Pixel coordinates and OCR bounding boxes are always global desktop coordinates.
|
||||
- The agent does **not** inherently see the remote desktop; it reasons from screenshots, OCR, and window metadata.
|
||||
- When OCR is not enough, pair Clickthrough screenshots with OpenClaw's `image` tool for explicit screenshot interpretation.
|
||||
- Prefer structured GUI interaction first; use `/windows`, `/launch`, `/wait`, and `/action` before reaching for `/exec`.
|
||||
|
||||
See:
|
||||
- `docs/API.md`
|
||||
- `docs/coordinate-system.md`
|
||||
- `skill/SKILL.md`
|
||||
|
||||
## Configuration
|
||||
|
||||
Environment variables:
|
||||
|
||||
- `CLICKTHROUGH_HOST` (default `127.0.0.1`)
|
||||
- `CLICKTHROUGH_PORT` (default `8123`)
|
||||
- `CLICKTHROUGH_TOKEN` (optional; if set, require `x-clickthrough-token` header)
|
||||
- `CLICKTHROUGH_DRY_RUN` (`true`/`false`; default `false`)
|
||||
- `CLICKTHROUGH_GRID_ROWS` (default `12`)
|
||||
- `CLICKTHROUGH_GRID_COLS` (default `12`)
|
||||
- `CLICKTHROUGH_ALLOWED_REGION` (optional `x,y,width,height`)
|
||||
- `CLICKTHROUGH_EXEC_ENABLED` (default `true`)
|
||||
- `CLICKTHROUGH_EXEC_SECRET` (**required for `/exec` to run**)
|
||||
- `CLICKTHROUGH_EXEC_DEFAULT_SHELL` (default `powershell`; one of `powershell`, `bash`, `cmd`)
|
||||
- `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`)
|
||||
- `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`)
|
||||
- `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`)
|
||||
- `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable)
|
||||
|
||||
Window management endpoints currently target Windows hosts. On non-Windows hosts they return `501` instead of guessing.
|
||||
|
||||
## Gitea CI
|
||||
|
||||
A Gitea Actions workflow is included at `.gitea/workflows/python-syntax.yml`.
|
||||
It runs Python syntax checks (`py_compile`) on every push and pull request.
|
||||
- `docs/coordinate-system.md`
|
||||
|
||||
629
docs/API.md
629
docs/API.md
@@ -1,614 +1,141 @@
|
||||
# API Reference (v0.1)
|
||||
# API Reference (v2)
|
||||
|
||||
Base URL: `http://127.0.0.1:8123`
|
||||
|
||||
If `CLICKTHROUGH_TOKEN` is set, include header:
|
||||
If `CLICKTHROUGH_TOKEN` is set, include:
|
||||
|
||||
```http
|
||||
x-clickthrough-token: <token>
|
||||
```
|
||||
|
||||
## `GET /health`
|
||||
## Endpoints
|
||||
|
||||
Returns status and runtime safety flags, including `exec` capability config.
|
||||
- `POST /v2/observe`
|
||||
- `POST /v2/localize`
|
||||
- `POST /v2/act`
|
||||
- `POST /v2/act-verify`
|
||||
- `GET /health`
|
||||
- `GET /displays`
|
||||
- `GET /windows`
|
||||
- `POST /windows/action`
|
||||
- `POST /launch`
|
||||
- `POST /exec`
|
||||
|
||||
## `GET /displays`
|
||||
No v1 endpoints are supported.
|
||||
|
||||
Returns detected displays in API screen order.
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"default_screen": 0,
|
||||
"displays": [
|
||||
{"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
|
||||
{"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
`screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend.
|
||||
Invalid `screen` values fall back to `0`.
|
||||
|
||||
## `GET /screen`
|
||||
|
||||
Query params:
|
||||
|
||||
- `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0`
|
||||
- `with_grid` (bool, default `true`)
|
||||
- `grid_rows` (int, default env or `12`)
|
||||
- `grid_cols` (int, default env or `12`)
|
||||
- `include_labels` (bool, default `true`)
|
||||
- `image_format` (`png`|`jpeg`, default `png`)
|
||||
- `jpeg_quality` (1-100, default `85`)
|
||||
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
|
||||
|
||||
Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
|
||||
`meta.region` uses global desktop coordinates.
|
||||
|
||||
These image-returning endpoints do not magically grant the agent live vision.
|
||||
If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI.
|
||||
|
||||
## `POST /zoom`
|
||||
|
||||
Body:
|
||||
|
||||
```json
|
||||
{
|
||||
"center_x": 1200,
|
||||
"center_y": 700,
|
||||
"width": 500,
|
||||
"height": 350,
|
||||
"with_grid": true,
|
||||
"grid_rows": 20,
|
||||
"grid_cols": 20,
|
||||
"include_labels": true,
|
||||
"image_format": "png",
|
||||
"jpeg_quality": 90
|
||||
}
|
||||
```
|
||||
|
||||
Query params:
|
||||
|
||||
- `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0`
|
||||
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
|
||||
|
||||
Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.
|
||||
|
||||
`POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout.
|
||||
|
||||
## `POST /action`
|
||||
|
||||
Body: one action.
|
||||
|
||||
Important:
|
||||
- the request body uses `action` plus an optional `target`
|
||||
- pixel coordinates live inside `target` when `target.mode="pixel"`
|
||||
- do **not** send top-level `x` / `y` fields
|
||||
|
||||
Query params:
|
||||
|
||||
- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`
|
||||
|
||||
Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target.
|
||||
|
||||
### Pointer target modes
|
||||
|
||||
#### Pixel target
|
||||
|
||||
```json
|
||||
{
|
||||
"mode": "pixel",
|
||||
"x": 100,
|
||||
"y": 200,
|
||||
"dx": 0,
|
||||
"dy": 0
|
||||
}
|
||||
```
|
||||
|
||||
#### Grid target
|
||||
|
||||
```json
|
||||
{
|
||||
"mode": "grid",
|
||||
"region_x": 0,
|
||||
"region_y": 0,
|
||||
"region_width": 1920,
|
||||
"region_height": 1080,
|
||||
"rows": 12,
|
||||
"cols": 12,
|
||||
"row": 5,
|
||||
"col": 9,
|
||||
"dx": 0.0,
|
||||
"dy": 0.0
|
||||
}
|
||||
```
|
||||
|
||||
`dx`/`dy` are normalized offsets in `[-1, 1]` inside the selected cell.
|
||||
|
||||
### Action examples
|
||||
|
||||
Click:
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "click",
|
||||
"target": {
|
||||
"mode": "grid",
|
||||
"region_x": 0,
|
||||
"region_y": 0,
|
||||
"region_width": 1920,
|
||||
"region_height": 1080,
|
||||
"rows": 12,
|
||||
"cols": 12,
|
||||
"row": 7,
|
||||
"col": 3,
|
||||
"dx": 0.2,
|
||||
"dy": -0.1
|
||||
},
|
||||
"clicks": 1,
|
||||
"button": "left"
|
||||
}
|
||||
```
|
||||
|
||||
Scroll:
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "scroll",
|
||||
"target": {"mode": "pixel", "x": 1300, "y": 740},
|
||||
"scroll_amount": -500
|
||||
}
|
||||
```
|
||||
|
||||
Type text:
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "type",
|
||||
"text": "hello world",
|
||||
"interval_ms": 20
|
||||
}
|
||||
```
|
||||
|
||||
Hotkey:
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "hotkey",
|
||||
"keys": ["ctrl", "l"]
|
||||
}
|
||||
```
|
||||
|
||||
Right click:
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "right_click",
|
||||
"target": {"mode": "pixel", "x": 1300, "y": 740}
|
||||
}
|
||||
```
|
||||
|
||||
Move only:
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "move",
|
||||
"target": {"mode": "pixel", "x": 1300, "y": 740},
|
||||
"duration_ms": 150
|
||||
}
|
||||
```
|
||||
|
||||
## `GET /windows`
|
||||
|
||||
List desktop windows using structured filters instead of shelling out.
|
||||
|
||||
Query params:
|
||||
|
||||
- `title_contains` (optional substring match)
|
||||
- `title_regex` (optional case-insensitive regex)
|
||||
- `process_name` (optional exact process name, e.g. `explorer.exe`)
|
||||
- `hwnd` (optional exact window handle)
|
||||
- `visible_only` (bool, default `true`)
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"count": 1,
|
||||
"windows": [
|
||||
{
|
||||
"hwnd": 132640,
|
||||
"title": "WinDirStat",
|
||||
"class_name": "WinDirStatMainWindow",
|
||||
"pid": 18420,
|
||||
"process_name": "windirstat.exe",
|
||||
"visible": true,
|
||||
"enabled": true,
|
||||
"minimized": false,
|
||||
"maximized": false,
|
||||
"foreground": true,
|
||||
"rect": {"x": 194, "y": 116, "width": 1532, "height": 870}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Currently supported on Windows hosts only.
|
||||
- Returns `409` for ambiguous write-target matches when a mutation endpoint would affect multiple windows.
|
||||
|
||||
## `POST /windows/action`
|
||||
|
||||
Perform a structured window action against exactly one matched window.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "focus",
|
||||
"title_contains": "WinDirStat",
|
||||
"visible_only": true,
|
||||
"timeout_ms": 3000
|
||||
}
|
||||
```
|
||||
|
||||
Supported actions:
|
||||
- `focus`
|
||||
- `restore`
|
||||
- `minimize`
|
||||
- `maximize`
|
||||
- `close`
|
||||
|
||||
The response includes the matched pre-action window and the final observed window state (or `closed=true` if it disappeared).
|
||||
|
||||
## `POST /launch`
|
||||
|
||||
Start an app/process without invoking a shell.
|
||||
|
||||
```json
|
||||
{
|
||||
"executable": "C:/Program Files/WinDirStat/WinDirStat.exe",
|
||||
"args": [],
|
||||
"cwd": "C:/Program Files/WinDirStat",
|
||||
"wait_for_window": true,
|
||||
"match": {
|
||||
"title_contains": "WinDirStat",
|
||||
"visible_only": true
|
||||
},
|
||||
"timeout_ms": 8000
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Launch uses direct process execution (`subprocess.Popen`) rather than PowerShell/CMD.
|
||||
- If `wait_for_window=true`, the server polls for a matching window and returns `window_found`.
|
||||
- `dry_run=true` returns the resolved argv/cwd without launching.
|
||||
|
||||
## `POST /vision/diff`
|
||||
|
||||
Measure whether a screen region changed meaningfully between two captures.
|
||||
|
||||
Query params:
|
||||
|
||||
- `screen` (int, default `0`) - used for `mode=screen` and `mode=region`
|
||||
|
||||
Compare live captures:
|
||||
## `POST /v2/observe`
|
||||
|
||||
```json
|
||||
{
|
||||
"mode": "region",
|
||||
"region_x": 120,
|
||||
"region_y": 80,
|
||||
"region_width": 600,
|
||||
"region_height": 300,
|
||||
"delay_ms": 400,
|
||||
"diff_threshold": 0.01
|
||||
}
|
||||
```
|
||||
|
||||
Compare provided images:
|
||||
|
||||
```json
|
||||
{
|
||||
"mode": "image",
|
||||
"before_image_base64": "iVBORw0KGgoAAA...",
|
||||
"after_image_base64": "iVBORw0KGgoBBB...",
|
||||
"diff_threshold": 0.01
|
||||
}
|
||||
```
|
||||
|
||||
Response includes:
|
||||
- `diff_ratio` — average normalized pixel difference
|
||||
- `changed` — whether `diff_ratio >= diff_threshold`
|
||||
- `region` — compared region
|
||||
|
||||
## `POST /vision/stability`
|
||||
|
||||
Measure whether a screen region stays visually stable over a short interval.
|
||||
|
||||
Query params:
|
||||
|
||||
- `screen` (int, default `0`)
|
||||
|
||||
```json
|
||||
{
|
||||
"region_x": 0,
|
||||
"region_y": 0,
|
||||
"region_width": 1920,
|
||||
"region_height": 1080,
|
||||
"sample_interval_ms": 250,
|
||||
"duration_ms": 1200,
|
||||
"diff_threshold": 0.005
|
||||
}
|
||||
```
|
||||
|
||||
Response includes:
|
||||
- `stable`
|
||||
- `sample_count`
|
||||
- `max_diff_ratio`
|
||||
- `avg_diff_ratio`
|
||||
|
||||
## `POST /wait`
|
||||
|
||||
Wait on a structured UI condition instead of guessing sleep durations.
|
||||
|
||||
Query params:
|
||||
|
||||
- `screen` (int, default `0`) - used for text and visual waits
|
||||
|
||||
### Wait for text to appear
|
||||
|
||||
```json
|
||||
{
|
||||
"condition": {
|
||||
"kind": "text",
|
||||
"mode": "screen",
|
||||
"text": "Scan complete",
|
||||
"match": "contains",
|
||||
"present": true,
|
||||
"region_x": 800,
|
||||
"region_y": 420,
|
||||
"region_width": 700,
|
||||
"region_height": 420,
|
||||
"include_image": true,
|
||||
"image_format": "jpeg",
|
||||
"jpeg_quality": 75,
|
||||
"ocr_mode": "region",
|
||||
"language_hint": "eng",
|
||||
"min_confidence": 0.4
|
||||
},
|
||||
"timeout_ms": 15000,
|
||||
"poll_interval_ms": 400
|
||||
"min_confidence": 0.45,
|
||||
"max_ocr_area_px": 1500000,
|
||||
"group_lines": true
|
||||
}
|
||||
```
|
||||
|
||||
### Wait for a window state
|
||||
Returns observation metadata, optional image, OCR blocks/lines, and timing fields.
|
||||
|
||||
## `POST /v2/localize`
|
||||
|
||||
Text localization:
|
||||
|
||||
```json
|
||||
{
|
||||
"condition": {
|
||||
"kind": "window",
|
||||
"title_contains": "WinDirStat",
|
||||
"visible_only": true,
|
||||
"state": "focused"
|
||||
},
|
||||
"timeout_ms": 5000,
|
||||
"poll_interval_ms": 200
|
||||
"observation_id": "...",
|
||||
"text_query": "Save",
|
||||
"text_match": "exact",
|
||||
"candidate_index": 0
|
||||
}
|
||||
```
|
||||
|
||||
Window states:
|
||||
- `exists`
|
||||
- `focused`
|
||||
- `closed`
|
||||
|
||||
### Wait for visual change or stability
|
||||
Image-tool point localization:
|
||||
|
||||
```json
|
||||
{
|
||||
"condition": {
|
||||
"kind": "visual",
|
||||
"state": "stable",
|
||||
"region_x": 0,
|
||||
"region_y": 0,
|
||||
"region_width": 1920,
|
||||
"region_height": 1080,
|
||||
"diff_threshold": 0.005,
|
||||
"stable_for_ms": 1000
|
||||
},
|
||||
"timeout_ms": 12000,
|
||||
"poll_interval_ms": 300
|
||||
"observation_id": "...",
|
||||
"image_tool_point": {"x": 312, "y": 188}
|
||||
}
|
||||
```
|
||||
|
||||
Visual states:
|
||||
- `change` — succeeds when the average pixel diff crosses `diff_threshold`
|
||||
- `stable` — succeeds when the diff stays at or below `diff_threshold` for `stable_for_ms`
|
||||
Returns `resolved_target_id`, global pixel, and `localization_confidence`.
|
||||
|
||||
Notes:
|
||||
- Text waits reuse the OCR pipeline and return matching OCR blocks on success.
|
||||
- Window waits build on the structured window discovery endpoint.
|
||||
- Visual waits compare repeated captures of either the full selected display or an explicit region.
|
||||
|
||||
## `POST /action/verify`
|
||||
|
||||
Execute one action and wait for a structured success condition.
|
||||
|
||||
Query params:
|
||||
|
||||
- `screen` (int, default `0`)
|
||||
## `POST /v2/act`
|
||||
|
||||
```json
|
||||
{
|
||||
"action": {
|
||||
"action": "click",
|
||||
"target": {"mode": "pixel", "x": 1300, "y": 740}
|
||||
"target": {"resolved_target_id": "..."},
|
||||
"button": "left",
|
||||
"clicks": 1
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## `POST /v2/act-verify`
|
||||
|
||||
```json
|
||||
{
|
||||
"action": {
|
||||
"action": "click",
|
||||
"target": {"resolved_target_id": "..."}
|
||||
},
|
||||
"condition": {
|
||||
"kind": "text",
|
||||
"mode": "screen",
|
||||
"text": "Settings",
|
||||
"mode": "region",
|
||||
"text": "Saved",
|
||||
"match": "contains",
|
||||
"present": true,
|
||||
"language_hint": "eng",
|
||||
"region_x": 820,
|
||||
"region_y": 420,
|
||||
"region_width": 500,
|
||||
"region_height": 140,
|
||||
"min_confidence": 0.4
|
||||
},
|
||||
"retries": 1,
|
||||
"timeout_ms": 4000,
|
||||
"poll_interval_ms": 250,
|
||||
"retry_delay_ms": 250
|
||||
"risk_level": "low"
|
||||
}
|
||||
```
|
||||
|
||||
Condition kinds mirror `POST /wait`:
|
||||
- `text`
|
||||
- `window`
|
||||
- `visual`
|
||||
Risk defaults:
|
||||
- `low`: retries `0`, timeout `2500ms`
|
||||
- `high`: retries `1`, timeout `6000ms`
|
||||
|
||||
The response returns per-attempt action output plus structured verification output.
|
||||
## Response envelope
|
||||
|
||||
## `POST /ocr`
|
||||
|
||||
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
|
||||
|
||||
Query params:
|
||||
|
||||
- `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0`
|
||||
|
||||
Body:
|
||||
|
||||
```json
|
||||
{
|
||||
"mode": "screen",
|
||||
"language_hint": "eng",
|
||||
"min_confidence": 0.4
|
||||
}
|
||||
```
|
||||
|
||||
Modes:
|
||||
- `screen` (default): OCR over full selected monitor
|
||||
- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
|
||||
- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
|
||||
|
||||
Region mode example:
|
||||
|
||||
```json
|
||||
{
|
||||
"mode": "region",
|
||||
"region_x": 220,
|
||||
"region_y": 160,
|
||||
"region_width": 900,
|
||||
"region_height": 400,
|
||||
"language_hint": "eng",
|
||||
"min_confidence": 0.5
|
||||
}
|
||||
```
|
||||
|
||||
Image mode example:
|
||||
|
||||
```json
|
||||
{
|
||||
"mode": "image",
|
||||
"image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
|
||||
"language_hint": "eng"
|
||||
}
|
||||
```
|
||||
|
||||
Response shape:
|
||||
Success:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"request_id": "...",
|
||||
"time_ms": 1710000000000,
|
||||
"result": {
|
||||
"mode": "screen",
|
||||
"language_hint": "eng",
|
||||
"min_confidence": 0.4,
|
||||
"region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
|
||||
"blocks": [
|
||||
{
|
||||
"text": "Settings",
|
||||
"confidence": 0.9821,
|
||||
"bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
|
||||
}
|
||||
]
|
||||
}
|
||||
"data": { },
|
||||
"error": null
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
|
||||
- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
|
||||
- Requires `tesseract` executable plus Python package `pytesseract`.
|
||||
- If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.
|
||||
|
||||
## `POST /ocr/find`
|
||||
|
||||
Search OCR output for matching text instead of post-processing raw OCR blocks client-side.
|
||||
|
||||
Query params:
|
||||
|
||||
- `screen` (int, default `0`) - used for `mode=screen` and `mode=region`
|
||||
Error:
|
||||
|
||||
```json
|
||||
{
|
||||
"mode": "screen",
|
||||
"query": "Settings",
|
||||
"match": "contains",
|
||||
"group_lines": true,
|
||||
"max_results": 10,
|
||||
"language_hint": "eng",
|
||||
"min_confidence": 0.4
|
||||
}
|
||||
```
|
||||
|
||||
Modes:
|
||||
- `screen`
|
||||
- `region`
|
||||
- `image`
|
||||
|
||||
Options:
|
||||
- `match`: `contains`, `exact`, or `regex`
|
||||
- `group_lines=true`: combine nearby OCR words into line-level candidates before matching
|
||||
- `max_results`: result cap after confidence sorting
|
||||
|
||||
Response includes:
|
||||
- `matches` — confidence-sorted candidate matches
|
||||
- `match_count`
|
||||
- `blocks_considered`
|
||||
|
||||
## `POST /exec`
|
||||
|
||||
Execute a shell command on the host running Clickthrough.
|
||||
|
||||
Requirements:
|
||||
- `CLICKTHROUGH_EXEC_SECRET` must be configured on the server
|
||||
- send header `x-clickthrough-exec-secret: <secret>`
|
||||
|
||||
```json
|
||||
{
|
||||
"command": "Get-Process | Select-Object -First 5",
|
||||
"shell": "powershell",
|
||||
"timeout_s": 20,
|
||||
"cwd": "C:/Users/Paul",
|
||||
"dry_run": false
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `shell` supports `powershell`, `bash`, `cmd`
|
||||
- if `shell` is omitted, server uses `CLICKTHROUGH_EXEC_DEFAULT_SHELL`
|
||||
- output is truncated based on `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS`
|
||||
- endpoint can be disabled with `CLICKTHROUGH_EXEC_ENABLED=false`
|
||||
- if `CLICKTHROUGH_EXEC_SECRET` is missing, `/exec` is blocked (`403`)
|
||||
|
||||
Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution metadata.
|
||||
|
||||
## `POST /batch`
|
||||
|
||||
Runs multiple `action` payloads sequentially.
|
||||
|
||||
Query params:
|
||||
|
||||
- `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0`
|
||||
|
||||
```json
|
||||
{
|
||||
"actions": [
|
||||
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}},
|
||||
{"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}}
|
||||
],
|
||||
"stop_on_error": true
|
||||
"ok": false,
|
||||
"request_id": "...",
|
||||
"time_ms": 1710000000000,
|
||||
"data": null,
|
||||
"error": {
|
||||
"code": "http_error",
|
||||
"message": "...",
|
||||
"details": {}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
@@ -13,23 +13,26 @@ if TOKEN:
|
||||
|
||||
|
||||
def main():
|
||||
r = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10)
|
||||
r.raise_for_status()
|
||||
print("health:", r.json())
|
||||
health = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10)
|
||||
health.raise_for_status()
|
||||
print("health ok:", health.json().get("ok"))
|
||||
|
||||
d = requests.get(f"{BASE_URL}/displays", headers=headers, timeout=10)
|
||||
d.raise_for_status()
|
||||
print("displays:", d.json().get("displays", []))
|
||||
|
||||
s = requests.get(
|
||||
f"{BASE_URL}/screen",
|
||||
observe = requests.post(
|
||||
f"{BASE_URL}/v2/observe",
|
||||
headers=headers,
|
||||
params={"screen": SCREEN, "with_grid": True, "grid_rows": 12, "grid_cols": 12},
|
||||
timeout=30,
|
||||
params={"screen": SCREEN},
|
||||
json={
|
||||
"mode": "screen",
|
||||
"include_image": False,
|
||||
"ocr_mode": "none",
|
||||
},
|
||||
timeout=20,
|
||||
)
|
||||
s.raise_for_status()
|
||||
payload = s.json()
|
||||
print("screen meta:", payload.get("meta", {}))
|
||||
observe.raise_for_status()
|
||||
payload = observe.json()["data"]
|
||||
print("observation_id:", payload["observation_id"])
|
||||
print("region:", payload["region"])
|
||||
print("timing_ms:", payload["timing_ms"])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
667
server/app.py
667
server/app.py
@@ -8,10 +8,12 @@ import subprocess
|
||||
import sys
|
||||
import time
|
||||
import uuid
|
||||
from typing import Literal, Optional
|
||||
from typing import Any, Literal, Optional
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from fastapi import Depends, FastAPI, Header, HTTPException, Response
|
||||
from fastapi import Depends, FastAPI, Header, HTTPException, Request
|
||||
from fastapi.exceptions import RequestValidationError
|
||||
from fastapi.responses import JSONResponse
|
||||
from PIL import ImageChops, ImageStat
|
||||
from pydantic import BaseModel, Field, model_validator
|
||||
|
||||
@@ -21,6 +23,55 @@ load_dotenv(dotenv_path=".env", override=False)
|
||||
app = FastAPI(title="clickthrough", version="0.1.0")
|
||||
|
||||
|
||||
def _ok(data: Any, status_code: int = 200):
|
||||
return JSONResponse(
|
||||
status_code=status_code,
|
||||
content={
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"data": data,
|
||||
"error": None,
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
def _err(code: str, message: str, status_code: int, details: Any = None):
|
||||
return JSONResponse(
|
||||
status_code=status_code,
|
||||
content={
|
||||
"ok": False,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"data": None,
|
||||
"error": {
|
||||
"code": code,
|
||||
"message": message,
|
||||
"details": details,
|
||||
},
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
@app.exception_handler(HTTPException)
|
||||
async def _http_exception_handler(_: Request, exc: HTTPException):
|
||||
detail = exc.detail
|
||||
if isinstance(detail, dict):
|
||||
message = str(detail.get("message", "request failed"))
|
||||
return _err("http_error", message, exc.status_code, detail)
|
||||
return _err("http_error", str(detail), exc.status_code)
|
||||
|
||||
|
||||
@app.exception_handler(Exception)
|
||||
async def _unhandled_exception_handler(_: Request, exc: Exception):
|
||||
return _err("internal_error", "internal server error", 500, {"type": type(exc).__name__})
|
||||
|
||||
|
||||
@app.exception_handler(RequestValidationError)
|
||||
async def _validation_exception_handler(_: Request, exc: RequestValidationError):
|
||||
return _err("validation_error", "request validation failed", 422, exc.errors())
|
||||
|
||||
|
||||
def _env_bool(name: str, default: bool) -> bool:
|
||||
raw = os.getenv(name)
|
||||
if raw is None:
|
||||
@@ -288,6 +339,144 @@ class VerifyActionRequest(BaseModel):
|
||||
stop_on_action_error: bool = True
|
||||
|
||||
|
||||
class ObserveRequestV2(BaseModel):
|
||||
mode: Literal["screen", "region"] = "screen"
|
||||
region_x: int | None = Field(default=None, ge=0)
|
||||
region_y: int | None = Field(default=None, ge=0)
|
||||
region_width: int | None = Field(default=None, gt=0)
|
||||
region_height: int | None = Field(default=None, gt=0)
|
||||
include_image: bool = True
|
||||
image_format: Literal["png", "jpeg"] = "jpeg"
|
||||
jpeg_quality: int = Field(default=75, ge=1, le=100)
|
||||
ocr_mode: Literal["none", "region", "screen"] = "none"
|
||||
language_hint: str | None = Field(default=None, min_length=1, max_length=64)
|
||||
min_confidence: float = Field(default=0.4, ge=0.0, le=1.0)
|
||||
max_ocr_area_px: int | None = Field(default=1_500_000, ge=1000)
|
||||
group_lines: bool = True
|
||||
|
||||
@model_validator(mode="after")
|
||||
def _validate_region(self):
|
||||
if self.mode == "region":
|
||||
required = [self.region_x, self.region_y, self.region_width, self.region_height]
|
||||
if any(v is None for v in required):
|
||||
raise ValueError("region_x, region_y, region_width, region_height are required for mode=region")
|
||||
return self
|
||||
|
||||
|
||||
class ImageToolPoint(BaseModel):
|
||||
x: int = Field(ge=0)
|
||||
y: int = Field(ge=0)
|
||||
|
||||
|
||||
class LocalizeRequestV2(BaseModel):
|
||||
observation_id: str = Field(min_length=1, max_length=128)
|
||||
text_query: str | None = Field(default=None, max_length=512)
|
||||
text_match: Literal["contains", "exact", "regex"] = "contains"
|
||||
image_tool_point: ImageToolPoint | None = None
|
||||
candidate_index: int = Field(default=0, ge=0)
|
||||
|
||||
@model_validator(mode="after")
|
||||
def _validate_selector(self):
|
||||
has_text = bool((self.text_query or "").strip())
|
||||
has_point = self.image_tool_point is not None
|
||||
if has_text == has_point:
|
||||
raise ValueError("provide exactly one of text_query or image_tool_point")
|
||||
return self
|
||||
|
||||
|
||||
class ActionTargetV2(BaseModel):
|
||||
resolved_target_id: str | None = Field(default=None, max_length=128)
|
||||
pixel_x: int | None = None
|
||||
pixel_y: int | None = None
|
||||
|
||||
@model_validator(mode="after")
|
||||
def _validate_shape(self):
|
||||
has_resolved = bool(self.resolved_target_id)
|
||||
has_pixel = self.pixel_x is not None or self.pixel_y is not None
|
||||
if has_resolved == has_pixel:
|
||||
raise ValueError("provide either resolved_target_id or pixel_x/pixel_y")
|
||||
if has_pixel and (self.pixel_x is None or self.pixel_y is None):
|
||||
raise ValueError("pixel_x and pixel_y are both required")
|
||||
return self
|
||||
|
||||
|
||||
class ActionRequestV2(BaseModel):
|
||||
action: Literal[
|
||||
"move",
|
||||
"click",
|
||||
"right_click",
|
||||
"double_click",
|
||||
"middle_click",
|
||||
"scroll",
|
||||
"type",
|
||||
"hotkey",
|
||||
]
|
||||
target: ActionTargetV2 | None = None
|
||||
duration_ms: int = Field(default=0, ge=0, le=20000)
|
||||
button: Literal["left", "right", "middle"] = "left"
|
||||
clicks: int = Field(default=1, ge=1, le=10)
|
||||
scroll_amount: int = 0
|
||||
text: str = ""
|
||||
keys: list[str] = Field(default_factory=list)
|
||||
interval_ms: int = Field(default=20, ge=0, le=5000)
|
||||
dry_run: bool = False
|
||||
|
||||
|
||||
class ActRequestV2(BaseModel):
|
||||
action: ActionRequestV2
|
||||
|
||||
|
||||
class ActVerifyRequestV2(BaseModel):
|
||||
action: ActionRequestV2
|
||||
condition: WaitTextCondition | WaitWindowCondition | WaitVisualCondition
|
||||
risk_level: Literal["low", "high"] = "low"
|
||||
retries: int | None = Field(default=None, ge=0, le=10)
|
||||
timeout_ms: int | None = Field(default=None, ge=0, le=120000)
|
||||
poll_interval_ms: int | None = Field(default=None, ge=50, le=10000)
|
||||
retry_delay_ms: int | None = Field(default=None, ge=0, le=60000)
|
||||
stop_on_action_error: bool = True
|
||||
|
||||
|
||||
OBSERVATIONS: dict[str, dict[str, Any]] = {}
|
||||
RESOLVED_TARGETS: dict[str, dict[str, Any]] = {}
|
||||
|
||||
|
||||
def _get_observation(observation_id: str) -> dict[str, Any]:
|
||||
observation = OBSERVATIONS.get(observation_id)
|
||||
if observation is None:
|
||||
raise HTTPException(status_code=404, detail="observation_id not found")
|
||||
return observation
|
||||
|
||||
|
||||
def _resolve_v2_action(req: ActionRequestV2) -> ActionRequest:
|
||||
target: Target | None = None
|
||||
if req.target is not None:
|
||||
if req.target.resolved_target_id:
|
||||
item = RESOLVED_TARGETS.get(req.target.resolved_target_id)
|
||||
if item is None:
|
||||
raise HTTPException(status_code=404, detail="resolved_target_id not found")
|
||||
target = PixelTarget(mode="pixel", x=item["x"], y=item["y"], dx=0, dy=0)
|
||||
else:
|
||||
target = PixelTarget(mode="pixel", x=req.target.pixel_x or 0, y=req.target.pixel_y or 0, dx=0, dy=0)
|
||||
return ActionRequest(
|
||||
action=req.action,
|
||||
target=target,
|
||||
duration_ms=req.duration_ms,
|
||||
button=req.button,
|
||||
clicks=req.clicks,
|
||||
scroll_amount=req.scroll_amount,
|
||||
text=req.text,
|
||||
keys=req.keys,
|
||||
interval_ms=req.interval_ms,
|
||||
dry_run=req.dry_run,
|
||||
)
|
||||
|
||||
|
||||
def _risk_defaults(risk_level: str) -> dict[str, int]:
|
||||
if risk_level == "high":
|
||||
return {"retries": 1, "timeout_ms": 6000, "poll_interval_ms": 250, "retry_delay_ms": 300}
|
||||
return {"retries": 0, "timeout_ms": 2500, "poll_interval_ms": 200, "retry_delay_ms": 150}
|
||||
|
||||
|
||||
def _auth(x_clickthrough_token: Optional[str] = Header(default=None)):
|
||||
token = SETTINGS["token"]
|
||||
@@ -1377,14 +1566,208 @@ def _exec_action(req: ActionRequest, screen: int = 0) -> dict:
|
||||
}
|
||||
|
||||
|
||||
def _localization_confidence(source: str, confidence: float | None = None) -> str:
|
||||
if source == "image_tool_point":
|
||||
return "high"
|
||||
if source == "ocr" and confidence is not None:
|
||||
if confidence >= 0.8:
|
||||
return "high"
|
||||
if confidence >= 0.55:
|
||||
return "medium"
|
||||
return "low"
|
||||
|
||||
|
||||
@app.post("/v2/observe")
|
||||
def observe_v2(req: ObserveRequestV2, screen: int = 0, _: None = Depends(_auth)):
|
||||
capture_started = time.perf_counter()
|
||||
image, region, mon, displays, screen_selection = _capture_region_image(
|
||||
screen,
|
||||
req.region_x if req.mode == "region" else None,
|
||||
req.region_y if req.mode == "region" else None,
|
||||
req.region_width if req.mode == "region" else None,
|
||||
req.region_height if req.mode == "region" else None,
|
||||
)
|
||||
capture_ms = int((time.perf_counter() - capture_started) * 1000)
|
||||
|
||||
encoded = None
|
||||
if req.include_image:
|
||||
encoded = _encode_image(image, req.image_format, req.jpeg_quality)
|
||||
|
||||
ocr_started = time.perf_counter()
|
||||
blocks: list[dict] = []
|
||||
grouped_lines: list[dict] = []
|
||||
ocr_applied_mode = "none"
|
||||
if req.ocr_mode != "none":
|
||||
if req.ocr_mode == "screen":
|
||||
ocr_image, ocr_region, _, _, _ = _capture_region_image(screen, None, None, None, None)
|
||||
else:
|
||||
ocr_image, ocr_region = image, region
|
||||
|
||||
area = ocr_region["width"] * ocr_region["height"]
|
||||
if req.max_ocr_area_px is not None and area > req.max_ocr_area_px:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"ocr area {area} exceeds max_ocr_area_px {req.max_ocr_area_px}",
|
||||
)
|
||||
|
||||
blocks = _run_ocr(
|
||||
ocr_image,
|
||||
req.language_hint,
|
||||
req.min_confidence,
|
||||
ocr_region["x"],
|
||||
ocr_region["y"],
|
||||
)
|
||||
if req.group_lines:
|
||||
grouped_lines = _group_ocr_lines(blocks)
|
||||
ocr_applied_mode = req.ocr_mode
|
||||
ocr_ms = int((time.perf_counter() - ocr_started) * 1000)
|
||||
|
||||
observation_id = _request_id()
|
||||
OBSERVATIONS[observation_id] = {
|
||||
"id": observation_id,
|
||||
"region": region,
|
||||
"screen": screen_selection,
|
||||
"display": mon,
|
||||
"image_width": image.size[0],
|
||||
"image_height": image.size[1],
|
||||
"ocr_blocks": blocks,
|
||||
"ocr_lines": grouped_lines,
|
||||
"created_at_ms": _now_ms(),
|
||||
}
|
||||
|
||||
return _ok(
|
||||
{
|
||||
"observation_id": observation_id,
|
||||
"region": region,
|
||||
"screen": screen_selection,
|
||||
"display": mon,
|
||||
"image": {
|
||||
"included": req.include_image,
|
||||
"format": req.image_format if req.include_image else None,
|
||||
"base64": encoded,
|
||||
"width": image.size[0],
|
||||
"height": image.size[1],
|
||||
},
|
||||
"ocr": {
|
||||
"mode": ocr_applied_mode,
|
||||
"min_confidence": req.min_confidence,
|
||||
"language_hint": req.language_hint,
|
||||
"block_count": len(blocks),
|
||||
"line_count": len(grouped_lines),
|
||||
"blocks": blocks,
|
||||
"lines": grouped_lines,
|
||||
},
|
||||
"timing_ms": {
|
||||
"capture_ms": capture_ms,
|
||||
"ocr_ms": ocr_ms if req.ocr_mode != "none" else 0,
|
||||
"total_ms": capture_ms + (ocr_ms if req.ocr_mode != "none" else 0),
|
||||
},
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
@app.post("/v2/localize")
|
||||
def localize_v2(req: LocalizeRequestV2, _: None = Depends(_auth)):
|
||||
observation = _get_observation(req.observation_id)
|
||||
region = observation["region"]
|
||||
image_width = observation["image_width"]
|
||||
image_height = observation["image_height"]
|
||||
|
||||
if req.image_tool_point is not None:
|
||||
if req.image_tool_point.x >= image_width or req.image_tool_point.y >= image_height:
|
||||
raise HTTPException(status_code=400, detail="image_tool_point outside observation image bounds")
|
||||
x = region["x"] + req.image_tool_point.x
|
||||
y = region["y"] + req.image_tool_point.y
|
||||
_enforce_allowed_region(x, y)
|
||||
resolved_target_id = _request_id()
|
||||
RESOLVED_TARGETS[resolved_target_id] = {
|
||||
"id": resolved_target_id,
|
||||
"observation_id": req.observation_id,
|
||||
"x": x,
|
||||
"y": y,
|
||||
"source": "image_tool_point",
|
||||
}
|
||||
return _ok(
|
||||
{
|
||||
"resolved_target_id": resolved_target_id,
|
||||
"source": "image_tool_point",
|
||||
"localization_confidence": _localization_confidence("image_tool_point"),
|
||||
"pixel": {"x": x, "y": y},
|
||||
"observation_region": region,
|
||||
"image_bounds": {"width": image_width, "height": image_height},
|
||||
}
|
||||
)
|
||||
|
||||
lines = observation.get("ocr_lines") or _group_ocr_lines(observation.get("ocr_blocks", []))
|
||||
matches = _find_text_matches(lines, req.text_query or "", req.text_match, False, 200)
|
||||
if not matches:
|
||||
return _err("not_found", "no localization candidates found", 404, {"found": False, "matches": []})
|
||||
if req.candidate_index >= len(matches):
|
||||
raise HTTPException(status_code=400, detail="candidate_index is outside match results")
|
||||
|
||||
chosen = matches[req.candidate_index]
|
||||
bbox = chosen["bbox"]
|
||||
x = bbox["x"] + max(1, bbox["width"] // 2)
|
||||
y = bbox["y"] + max(1, bbox["height"] // 2)
|
||||
_enforce_allowed_region(x, y)
|
||||
resolved_target_id = _request_id()
|
||||
RESOLVED_TARGETS[resolved_target_id] = {
|
||||
"id": resolved_target_id,
|
||||
"observation_id": req.observation_id,
|
||||
"x": x,
|
||||
"y": y,
|
||||
"source": "ocr",
|
||||
"match": chosen,
|
||||
}
|
||||
|
||||
return _ok(
|
||||
{
|
||||
"resolved_target_id": resolved_target_id,
|
||||
"source": "ocr",
|
||||
"localization_confidence": _localization_confidence("ocr", chosen.get("confidence")),
|
||||
"pixel": {"x": x, "y": y},
|
||||
"selected_match": chosen,
|
||||
"match_count": len(matches),
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
@app.post("/v2/act")
|
||||
def act_v2(req: ActRequestV2, screen: int = 0, _: None = Depends(_auth)):
|
||||
legacy_action = _resolve_v2_action(req.action)
|
||||
result = _exec_action(legacy_action, screen)
|
||||
return _ok(result)
|
||||
|
||||
|
||||
@app.post("/v2/act-verify")
|
||||
def act_verify_v2(req: ActVerifyRequestV2, screen: int = 0, _: None = Depends(_auth)):
|
||||
defaults = _risk_defaults(req.risk_level)
|
||||
verify_req = VerifyActionRequest(
|
||||
action=_resolve_v2_action(req.action),
|
||||
condition=req.condition,
|
||||
retries=defaults["retries"] if req.retries is None else req.retries,
|
||||
timeout_ms=defaults["timeout_ms"] if req.timeout_ms is None else req.timeout_ms,
|
||||
poll_interval_ms=defaults["poll_interval_ms"] if req.poll_interval_ms is None else req.poll_interval_ms,
|
||||
retry_delay_ms=defaults["retry_delay_ms"] if req.retry_delay_ms is None else req.retry_delay_ms,
|
||||
stop_on_action_error=req.stop_on_action_error,
|
||||
)
|
||||
result = _run_verified_action(verify_req, screen)
|
||||
payload = {
|
||||
"risk_level": req.risk_level,
|
||||
"defaults_applied": defaults,
|
||||
**result,
|
||||
}
|
||||
if result.get("success", False):
|
||||
return _ok(payload)
|
||||
return _err("verification_failed", "action verification did not satisfy condition", 409, payload)
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
def health(_: None = Depends(_auth)):
|
||||
return {
|
||||
"ok": True,
|
||||
return _ok(
|
||||
{
|
||||
"service": "clickthrough",
|
||||
"version": app.version,
|
||||
"time_ms": _now_ms(),
|
||||
"request_id": _request_id(),
|
||||
"dry_run": SETTINGS["dry_run"],
|
||||
"allowed_region": SETTINGS["allowed_region"],
|
||||
"exec": {
|
||||
@@ -1395,136 +1778,13 @@ def health(_: None = Depends(_auth)):
|
||||
"max_timeout_s": SETTINGS["exec_max_timeout_s"],
|
||||
},
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
@app.get("/displays")
|
||||
def displays(_: None = Depends(_auth)):
|
||||
detected = _get_displays()
|
||||
return {
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"displays": detected,
|
||||
"default_screen": 0,
|
||||
}
|
||||
|
||||
|
||||
@app.get("/screen")
|
||||
def screen(
|
||||
with_grid: bool = True,
|
||||
grid_rows: int = SETTINGS["default_grid_rows"],
|
||||
grid_cols: int = SETTINGS["default_grid_cols"],
|
||||
include_labels: bool = True,
|
||||
image_format: Literal["png", "jpeg"] = "png",
|
||||
jpeg_quality: int = 85,
|
||||
asImage: bool = False,
|
||||
screen: int = 0,
|
||||
_: None = Depends(_auth),
|
||||
):
|
||||
req = ScreenRequest(
|
||||
with_grid=with_grid,
|
||||
grid_rows=grid_rows,
|
||||
grid_cols=grid_cols,
|
||||
include_labels=include_labels,
|
||||
image_format=image_format,
|
||||
jpeg_quality=jpeg_quality,
|
||||
)
|
||||
|
||||
base_img, mon, displays, screen_selection = _capture_screen(screen)
|
||||
meta = {"region": mon, "screen": screen_selection, "displays": displays}
|
||||
out_img = base_img
|
||||
|
||||
if req.with_grid:
|
||||
out_img, grid_meta = _draw_grid(base_img, mon["x"], mon["y"], req.grid_rows, req.grid_cols, req.include_labels)
|
||||
meta.update(grid_meta)
|
||||
|
||||
if asImage:
|
||||
image_bytes = _serialize_image(out_img, req.image_format, req.jpeg_quality)
|
||||
media_type = "image/jpeg" if req.image_format == "jpeg" else "image/png"
|
||||
return Response(content=image_bytes, media_type=media_type)
|
||||
|
||||
encoded = _encode_image(out_img, req.image_format, req.jpeg_quality)
|
||||
return {
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"image": {
|
||||
"format": req.image_format,
|
||||
"base64": encoded,
|
||||
"width": out_img.size[0],
|
||||
"height": out_img.size[1],
|
||||
},
|
||||
"meta": meta,
|
||||
}
|
||||
|
||||
|
||||
@app.post("/zoom")
|
||||
def zoom(req: ZoomRequest, asImage: bool = False, screen: int = 0, _: None = Depends(_auth)):
|
||||
base_img, mon, displays, screen_selection = _capture_screen(screen)
|
||||
|
||||
cx = req.center_x - mon["x"]
|
||||
cy = req.center_y - mon["y"]
|
||||
|
||||
half_w = req.width // 2
|
||||
half_h = req.height // 2
|
||||
|
||||
left = max(0, cx - half_w)
|
||||
top = max(0, cy - half_h)
|
||||
right = min(base_img.size[0], left + req.width)
|
||||
bottom = min(base_img.size[1], top + req.height)
|
||||
|
||||
crop = base_img.crop((left, top, right, bottom))
|
||||
|
||||
region_x = mon["x"] + left
|
||||
region_y = mon["y"] + top
|
||||
|
||||
meta = {
|
||||
"source_monitor": mon,
|
||||
"screen": screen_selection,
|
||||
"displays": displays,
|
||||
"region": {
|
||||
"x": region_x,
|
||||
"y": region_y,
|
||||
"width": crop.size[0],
|
||||
"height": crop.size[1],
|
||||
},
|
||||
}
|
||||
|
||||
out_img = crop
|
||||
if req.with_grid:
|
||||
out_img, grid_meta = _draw_grid(crop, region_x, region_y, req.grid_rows, req.grid_cols, req.include_labels)
|
||||
meta.update(grid_meta)
|
||||
|
||||
if asImage:
|
||||
image_bytes = _serialize_image(out_img, req.image_format, req.jpeg_quality)
|
||||
media_type = "image/jpeg" if req.image_format == "jpeg" else "image/png"
|
||||
return Response(content=image_bytes, media_type=media_type)
|
||||
|
||||
encoded = _encode_image(out_img, req.image_format, req.jpeg_quality)
|
||||
|
||||
return {
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"image": {
|
||||
"format": req.image_format,
|
||||
"base64": encoded,
|
||||
"width": out_img.size[0],
|
||||
"height": out_img.size[1],
|
||||
},
|
||||
"meta": meta,
|
||||
}
|
||||
|
||||
|
||||
@app.post("/action")
|
||||
def action(req: ActionRequest, screen: int = 0, _: None = Depends(_auth)):
|
||||
result = _exec_action(req, screen)
|
||||
return {
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"result": result,
|
||||
}
|
||||
return _ok({"displays": detected, "default_screen": 0})
|
||||
|
||||
|
||||
@app.post("/exec")
|
||||
@@ -1540,12 +1800,7 @@ def exec_command(
|
||||
raise HTTPException(status_code=401, detail="invalid exec secret")
|
||||
|
||||
result = _exec_command(req)
|
||||
return {
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"result": result,
|
||||
}
|
||||
return _ok(result)
|
||||
|
||||
|
||||
@app.get("/windows")
|
||||
@@ -1565,151 +1820,19 @@ def windows(
|
||||
visible_only=visible_only,
|
||||
)
|
||||
matches = _list_windows(query)
|
||||
return {
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"windows": matches,
|
||||
"count": len(matches),
|
||||
}
|
||||
return _ok({"windows": matches, "count": len(matches)})
|
||||
|
||||
|
||||
@app.post("/windows/action")
|
||||
def window_action(req: WindowActionRequest, _: None = Depends(_auth)):
|
||||
result = _apply_window_action(req)
|
||||
return {
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"result": result,
|
||||
}
|
||||
return _ok(result)
|
||||
|
||||
|
||||
@app.post("/launch")
|
||||
def launch(req: LaunchRequest, _: None = Depends(_auth)):
|
||||
result = _launch_app(req)
|
||||
return {
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"result": result,
|
||||
}
|
||||
|
||||
|
||||
@app.post("/wait")
|
||||
def wait(req: WaitRequest, screen: int = 0, _: None = Depends(_auth)):
|
||||
result = _wait_for_condition(req, screen)
|
||||
return {
|
||||
"ok": result.get("satisfied", False),
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"result": result,
|
||||
}
|
||||
|
||||
|
||||
@app.post("/vision/diff")
|
||||
def vision_diff(req: VisionDiffRequest, screen: int = 0, _: None = Depends(_auth)):
|
||||
result = _compute_visual_diff(req, screen)
|
||||
return {
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"result": result,
|
||||
}
|
||||
|
||||
|
||||
@app.post("/vision/stability")
|
||||
def vision_stability(req: VisionStabilityRequest, screen: int = 0, _: None = Depends(_auth)):
|
||||
result = _measure_stability(req, screen)
|
||||
return {
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"result": result,
|
||||
}
|
||||
|
||||
|
||||
@app.post("/action/verify")
|
||||
def action_verify(req: VerifyActionRequest, screen: int = 0, _: None = Depends(_auth)):
|
||||
result = _run_verified_action(req, screen)
|
||||
return {
|
||||
"ok": result.get("success", False),
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"result": result,
|
||||
}
|
||||
|
||||
|
||||
@app.post("/ocr")
|
||||
def ocr(req: OCRRequest, screen: int = 0, _: None = Depends(_auth)):
|
||||
image, region, mon, displays, screen_selection, source = _capture_ocr_source(req, screen)
|
||||
offset_x = region["x"] if source != "image" else 0
|
||||
offset_y = region["y"] if source != "image" else 0
|
||||
blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y)
|
||||
|
||||
return {
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"result": {
|
||||
"mode": source,
|
||||
"screen": screen_selection if source != "image" else None,
|
||||
"display": mon if source != "image" else None,
|
||||
"language_hint": req.language_hint,
|
||||
"min_confidence": req.min_confidence,
|
||||
"region": region,
|
||||
"blocks": blocks,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
@app.post("/ocr/find")
|
||||
def ocr_find(req: OCRFindRequest, screen: int = 0, _: None = Depends(_auth)):
|
||||
image, region, mon, displays, screen_selection, source = _capture_ocr_source(req, screen)
|
||||
offset_x = region["x"] if source != "image" else 0
|
||||
offset_y = region["y"] if source != "image" else 0
|
||||
blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y)
|
||||
matches = _find_text_matches(blocks, req.query, req.match, req.group_lines, req.max_results)
|
||||
|
||||
return {
|
||||
"ok": True,
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"result": {
|
||||
"mode": source,
|
||||
"screen": screen_selection if source != "image" else None,
|
||||
"display": mon if source != "image" else None,
|
||||
"language_hint": req.language_hint,
|
||||
"min_confidence": req.min_confidence,
|
||||
"query": req.query,
|
||||
"match": req.match,
|
||||
"group_lines": req.group_lines,
|
||||
"region": region,
|
||||
"matches": matches,
|
||||
"match_count": len(matches),
|
||||
"blocks_considered": len(blocks),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
@app.post("/batch")
|
||||
def batch(req: BatchRequest, screen: int = 0, _: None = Depends(_auth)):
|
||||
results = []
|
||||
for index, item in enumerate(req.actions):
|
||||
try:
|
||||
item_result = _exec_action(item, screen)
|
||||
results.append({"index": index, "ok": True, "result": item_result})
|
||||
except Exception as exc:
|
||||
results.append({"index": index, "ok": False, "error": str(exc)})
|
||||
if req.stop_on_error:
|
||||
break
|
||||
|
||||
return {
|
||||
"ok": all(r["ok"] for r in results),
|
||||
"request_id": _request_id(),
|
||||
"time_ms": _now_ms(),
|
||||
"results": results,
|
||||
}
|
||||
return _ok(result)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
422
skill/SKILL.md
422
skill/SKILL.md
@@ -1,381 +1,97 @@
|
||||
---
|
||||
name: clickthrough-http-control
|
||||
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
|
||||
description: Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops.
|
||||
---
|
||||
|
||||
# Clickthrough HTTP Control
|
||||
# Clickthrough HTTP Control (v2)
|
||||
|
||||
Use a strict observe-decide-act-verify loop.
|
||||
Agents do not see live desktop video. They operate on snapshots.
|
||||
Use this loop: **observe -> localize -> act -> verify**.
|
||||
|
||||
## Getting a computer instance (user-owned setup)
|
||||
## Fast defaults
|
||||
|
||||
The **user/operator** is responsible for provisioning and exposing the target machine.
|
||||
The agent should not assume it can self-install this stack.
|
||||
- Start with `POST /v2/observe` on a tight region, not full screen.
|
||||
- Set `ocr_mode` to `none` unless text is required immediately.
|
||||
- Use `image` tool localization for icon-heavy or dense controls.
|
||||
- Use `POST /v2/act-verify` instead of manual sleep/poll loops.
|
||||
|
||||
### What the user must do
|
||||
## Mandatory image-tool click localization
|
||||
|
||||
1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
|
||||
2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
|
||||
3. Configure secrets on target machine:
|
||||
- `CLICKTHROUGH_TOKEN` for general API auth
|
||||
- `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
|
||||
4. Share connection details with the agent through a secure channel:
|
||||
- `base_url`
|
||||
- `x-clickthrough-token`
|
||||
- `x-clickthrough-exec-secret` (only when `/exec` is needed)
|
||||
When OCR is weak or ambiguous, ask image tool for one coordinate in bounds.
|
||||
|
||||
### What the agent should do
|
||||
|
||||
1. Validate connection with `GET /health` using provided headers.
|
||||
2. Refuse `/exec` attempts when exec secret is missing/invalid.
|
||||
3. Ask user for missing setup inputs instead of guessing infrastructure.
|
||||
|
||||
## What the agent can actually see
|
||||
|
||||
The agent does **not** inherently see the remote desktop.
|
||||
Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision.
|
||||
|
||||
That means:
|
||||
- `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly
|
||||
- `POST /ocr` returns machine-readable text blocks when text extraction is enough
|
||||
- the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues
|
||||
- every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected
|
||||
|
||||
Do not write or think as if the agent is directly watching the screen in real time.
|
||||
Say what you actually have: screenshots, OCR output, and fresh verification captures.
|
||||
|
||||
## Mini API map
|
||||
|
||||
- `GET /health` → server status + safety flags
|
||||
- `GET /displays` → detected displays in zero-based API order
|
||||
- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
|
||||
- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
|
||||
- `GET /windows` → discover visible desktop windows and their handles/processes
|
||||
- `POST /windows/action` → focus/restore/minimize/maximize/close a matched window
|
||||
- `POST /launch` → start an app/process without dropping to a shell
|
||||
- `POST /wait?screen=0` → wait for text, window, or visual state changes
|
||||
- `POST /vision/diff?screen=0` → compare screenshots or regions for meaningful visual change
|
||||
- `POST /vision/stability?screen=0` → measure short-interval visual stability
|
||||
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
|
||||
- `POST /ocr/find?screen=0` → search OCR output for matching text candidates
|
||||
- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
|
||||
- `POST /action/verify?screen=0` → execute one action plus structured success verification
|
||||
- `POST /batch?screen=0` → sequential action list
|
||||
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
|
||||
|
||||
### Display selection
|
||||
|
||||
- Use `GET /displays` before operating on multi-monitor systems.
|
||||
- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
|
||||
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
|
||||
- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
|
||||
- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
|
||||
- Window rectangles from `GET /windows` are also in global desktop coordinates. Use them to sanity-check which monitor the app is really on before clicking.
|
||||
|
||||
### OCR usage
|
||||
|
||||
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
|
||||
- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
|
||||
- Use `language_hint` when known (for example `eng`) to improve consistency.
|
||||
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
|
||||
- Treat OCR as one signal, not the only signal, before high-impact clicks.
|
||||
- Current response shape is nested under `result.blocks`, not top-level `blocks`. Parse the real payload before assuming the endpoint failed.
|
||||
- OCR can be noisy on dense shopping pages, streaming apps, and button-heavy sidebars. Re-crop tightly before escalating.
|
||||
|
||||
### Screenshot + `image` tool usage
|
||||
|
||||
Use the OpenClaw `image` tool when OCR is not enough.
|
||||
This is especially useful for:
|
||||
- identifying which visible button looks like the primary confirm action
|
||||
- understanding dialog layout or pane structure
|
||||
- distinguishing similar nearby controls by icon, spacing, or emphasis
|
||||
- checking whether a visual state changed after a click
|
||||
- telling you where something is and where to click when text alone is not reliable
|
||||
|
||||
Good pattern:
|
||||
1. capture with `GET /screen` or `POST /zoom`
|
||||
2. hand that screenshot to the `image` tool
|
||||
3. ask a precise question about the visible UI
|
||||
4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
|
||||
5. convert the answer into a concrete Clickthrough target
|
||||
6. act once
|
||||
7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
|
||||
|
||||
Prefer vision over guessing.
|
||||
If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
|
||||
The model should help answer things like:
|
||||
- which visible button is the real primary action
|
||||
- whether the target is left/right/top/bottom within the crop
|
||||
- which of several similar buttons is the one to click
|
||||
- an approximate click point inside the provided image bounds
|
||||
|
||||
Ask narrow questions.
|
||||
Good:
|
||||
- "Which button in this dialog is the primary confirmation action?"
|
||||
- "Is the scan still running, or does this look complete?"
|
||||
- "Which of these tabs appears selected?"
|
||||
- "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
|
||||
- "Which visible control says Stop Recording, and where should I click?"
|
||||
|
||||
Bad:
|
||||
- "What should I click?"
|
||||
- "Use your eyes and do the task"
|
||||
- anything that assumes the model has live continuity without a new screenshot
|
||||
- requesting coordinates without telling the model the image bounds or expected output format
|
||||
|
||||
### Header requirements
|
||||
|
||||
- Always send `x-clickthrough-token` when token auth is enabled.
|
||||
- For `/exec`, also send `x-clickthrough-exec-secret`.
|
||||
|
||||
## `POST /action` request shape (important)
|
||||
|
||||
`/action` always expects an `action` plus an optional `target` object.
|
||||
Do **not** invent top-level `x` / `y` fields.
|
||||
|
||||
Minimal pixel click:
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "click",
|
||||
"target": {"mode": "pixel", "x": 100, "y": 200},
|
||||
"button": "left",
|
||||
"clicks": 1
|
||||
}
|
||||
```
|
||||
|
||||
Minimal grid click:
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "click",
|
||||
"target": {
|
||||
"mode": "grid",
|
||||
"region_x": 0,
|
||||
"region_y": 0,
|
||||
"region_width": 1920,
|
||||
"region_height": 1080,
|
||||
"rows": 12,
|
||||
"cols": 12,
|
||||
"row": 6,
|
||||
"col": 8,
|
||||
"dx": 0.0,
|
||||
"dy": 0.0
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Other canonical examples:
|
||||
|
||||
```json
|
||||
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
||||
{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
||||
{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
||||
{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
|
||||
{"action": "type", "text": "hello world", "interval_ms": 20}
|
||||
{"action": "hotkey", "keys": ["ctrl", "l"]}
|
||||
```
|
||||
Prompt template:
|
||||
- "Return one click point as JSON `{\"x\":<int>,\"y\":<int>}` inside this image (`width=W`, `height=H`) for the **<exact target>** control."
|
||||
|
||||
Rules:
|
||||
- `dx` / `dy` belong inside `target`, not beside it.
|
||||
- `type` and `hotkey` usually do not need a `target`.
|
||||
- For pixel targets, `x` / `y` are global desktop coordinates.
|
||||
- For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used.
|
||||
- Ask for one point only.
|
||||
- Include bounds in the prompt.
|
||||
- If answer is not parseable `x,y`, re-ask once with stricter format.
|
||||
- Send returned point to `POST /v2/localize` via `image_tool_point`.
|
||||
|
||||
## When to use `/exec`
|
||||
## API playbook
|
||||
|
||||
Prefer structured GUI control first:
|
||||
- `/screen`, `/zoom`, `/ocr` to observe
|
||||
- `/action` or `/batch` to interact
|
||||
1. **Observe**
|
||||
|
||||
Use `/exec` only when it is the cleanest available tool for the job, for example:
|
||||
- querying machine state that the GUI does not expose well
|
||||
- performing an explicit user-requested shell/system task
|
||||
- recovering from a blocked GUI flow when normal interaction failed
|
||||
```json
|
||||
POST /v2/observe?screen=0
|
||||
{
|
||||
"mode": "region",
|
||||
"region_x": 820,
|
||||
"region_y": 420,
|
||||
"region_width": 700,
|
||||
"region_height": 420,
|
||||
"include_image": true,
|
||||
"ocr_mode": "none"
|
||||
}
|
||||
```
|
||||
|
||||
Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
|
||||
Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
|
||||
When a task can be completed with window focus/restore, keyboard shortcuts, screenshots, OCR, and normal actions, stay out of `/exec` entirely.
|
||||
2. **Localize** (choose one)
|
||||
|
||||
## Core workflow (mandatory)
|
||||
Text:
|
||||
```json
|
||||
POST /v2/localize
|
||||
{"observation_id":"...","text_query":"Save","text_match":"exact"}
|
||||
```
|
||||
|
||||
1. Call `GET /windows` first when the task mentions a known app; focus/restore the right window before screen hunting.
|
||||
2. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
|
||||
3. Identify likely target region and compute an initial confidence score.
|
||||
4. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
||||
5. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
||||
6. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
|
||||
7. Execute one minimal action via `POST /action`.
|
||||
8. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
|
||||
9. Repeat until objective is complete.
|
||||
Image-tool point:
|
||||
```json
|
||||
POST /v2/localize
|
||||
{"observation_id":"...","image_tool_point":{"x":312,"y":188}}
|
||||
```
|
||||
|
||||
## Verify-before-click rules
|
||||
3. **Act**
|
||||
|
||||
- Never click if target identity is ambiguous.
|
||||
- Require at least two matching signals before click.
|
||||
- Good signal pairs include:
|
||||
- OCR text + expected UI region
|
||||
- OCR text + matching button shape/icon nearby
|
||||
- dialog title text + expected button position within that dialog
|
||||
- known app/window focus + expected control location
|
||||
- OCR candidate + vision-model localization inside the same crop
|
||||
- If confidence is low, do not "test click"; zoom and re-localize first.
|
||||
- If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
|
||||
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
|
||||
1) preview intended coordinate + reason
|
||||
2) execute only after explicit confirmation.
|
||||
```json
|
||||
POST /v2/act?screen=0
|
||||
{"action":{"action":"click","target":{"resolved_target_id":"..."}}}
|
||||
```
|
||||
|
||||
## Precision rules
|
||||
4. **Verify**
|
||||
|
||||
- Prefer grid targets first, then use `dx/dy` for subcell precision.
|
||||
- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
|
||||
- Use zoom before guessing offsets.
|
||||
- Avoid stale coordinates: re-capture before action if UI moved/scrolled.
|
||||
```json
|
||||
POST /v2/act-verify?screen=0
|
||||
{
|
||||
"action":{"action":"click","target":{"resolved_target_id":"..."}},
|
||||
"condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420},
|
||||
"risk_level":"low"
|
||||
}
|
||||
```
|
||||
|
||||
## Safety rules
|
||||
## Risk policy
|
||||
|
||||
- Respect `dry_run` and `allowed_region` restrictions from `/health`.
|
||||
- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
|
||||
- Avoid destructive shortcuts unless explicitly requested.
|
||||
- Send one action at a time unless deterministic; then use `/batch`.
|
||||
- Low risk (navigation, focus, benign clicks): single verification signal.
|
||||
- High risk (delete/send/purchase/close-lossy): use `risk_level=high` and require two checks before act.
|
||||
- Never do speculative repeat clicks; switch strategy after one failed verify.
|
||||
|
||||
## Reliability rules
|
||||
## Anti-latency rules
|
||||
|
||||
- After every meaningful action, verify with a fresh screenshot.
|
||||
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
|
||||
- Prefer short, reversible actions over long macros.
|
||||
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
|
||||
- Never repeat full-screen OCR by default.
|
||||
- Re-observe only the active pane/region.
|
||||
- Prefer keyboard + window APIs for app switching.
|
||||
- Use OCR on region only and cap area with `max_ocr_area_px`.
|
||||
|
||||
## Fallback ladder for uncertain targeting
|
||||
## Setup and auth
|
||||
|
||||
1. Full-screen capture with a coarse grid.
|
||||
2. Zoom into the candidate area with a denser grid.
|
||||
3. OCR the full screen or the tighter region.
|
||||
4. Re-anchor on a more reliable nearby control, title, or label.
|
||||
5. Try a keyboard-first flow if the app supports it.
|
||||
6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner.
|
||||
|
||||
Do not skip from "uncertain click" straight to random retries.
|
||||
|
||||
## Concrete screenshot -> `image` -> action example
|
||||
|
||||
Example loop:
|
||||
1. `GET /screen?screen=0` to capture the current app state
|
||||
2. if the UI is text-heavy, try `POST /ocr` first
|
||||
3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like:
|
||||
- "In this save dialog, which visible button is the primary action?"
|
||||
- "Is there a dismiss/close button in the top-right of this modal?"
|
||||
4. map the answer back to a Clickthrough target using the returned grid/region metadata
|
||||
5. click once with `POST /action`
|
||||
6. recapture the screen
|
||||
7. optionally use `POST /wait` or another `image`/OCR check to confirm the result
|
||||
|
||||
The key rule is simple: screenshot first, interpret second, click third, verify fourth.
|
||||
Do not collapse those steps into fake certainty.
|
||||
When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.
|
||||
|
||||
## App-specific playbooks (recommended)
|
||||
|
||||
Build per-app routines for repetitive tasks instead of generic clicking.
|
||||
|
||||
### Launcher / search / start app playbook
|
||||
|
||||
Use this when the goal is "open app X" or "bring up tool Y".
|
||||
|
||||
1. check `GET /windows` first in case the app is already open
|
||||
2. if present, use `POST /windows/action` to focus or restore it
|
||||
3. if absent, prefer `POST /launch` when you know the executable path
|
||||
4. if launch path is unknown but the OS launcher/search UI is available, use a keyboard-first flow:
|
||||
- open launcher (`win`, `cmd+space`, or app-specific shortcut depending on host)
|
||||
- type exact app name
|
||||
- wait for stable results with `POST /wait` or recapture
|
||||
- verify the result text with OCR or the `image` tool
|
||||
- press Enter or click the exact result once
|
||||
5. verify the app window now exists or is focused
|
||||
|
||||
Do not keep relaunching if the window already exists; that’s sloppy.
|
||||
|
||||
### Dialog confirmation playbook
|
||||
|
||||
Use for modals like save/discard, delete confirmation, permission prompts, and installer dialogs.
|
||||
|
||||
1. capture the dialog region with `POST /zoom`
|
||||
2. use OCR first for title/body/button labels
|
||||
3. if button hierarchy or emphasis matters, inspect the zoomed screenshot with the `image` tool
|
||||
4. identify the exact intended action (`Cancel`, `Save`, `Allow`, `Delete`, etc.)
|
||||
5. for destructive actions, require explicit user confirmation unless already requested
|
||||
6. click once and verify the dialog disappeared or changed state
|
||||
|
||||
Good verification targets:
|
||||
- dialog title vanished
|
||||
- expected next window appeared
|
||||
- destructive side effect is visible and confirmed
|
||||
|
||||
### File picker playbook
|
||||
|
||||
Use for open/save dialogs.
|
||||
|
||||
1. verify the file picker window is focused
|
||||
2. OCR the visible breadcrumb/path area, filename field, and button row
|
||||
3. prefer keyboard-first entry when possible:
|
||||
- type or paste the target path/name into the focused field
|
||||
- use `tab` / `shift+tab` to move predictably between filename and action buttons
|
||||
4. if the target path is uncertain, use OCR plus the `image` tool to identify the active field and selected folder/file row
|
||||
5. verify the intended filename/path is visible before confirming
|
||||
6. activate `Open` / `Save` once and verify the picker closes
|
||||
|
||||
If the picker stays open, stop and inspect why instead of hammering Enter like a maniac.
|
||||
|
||||
### Browser tab / window playbook
|
||||
|
||||
Use for browser navigation, tab targeting, or web app recovery.
|
||||
|
||||
1. use `GET /windows` to focus the correct browser window first
|
||||
2. prefer keyboard-first navigation:
|
||||
- `ctrl+l` / `cmd+l` to focus the address bar
|
||||
- `ctrl+tab` / `ctrl+shift+tab` for tab movement when order is known
|
||||
- `ctrl+w` only for explicitly requested close actions
|
||||
3. verify tab or page identity with OCR on the tab strip or page heading
|
||||
4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs
|
||||
5. after navigation, wait for visual stability or expected text before taking the next action
|
||||
6. on shopping/checkouts, tighten crops around the buy box or checkout panel before reading button text; full-page OCR often misses the one thing that matters
|
||||
|
||||
Do not assume a page loaded just because the click landed. Verify it.
|
||||
|
||||
### Settings / preferences navigation playbook
|
||||
|
||||
Use when the task involves toggles, dropdowns, sidebars, or nested settings panels.
|
||||
|
||||
1. identify the current settings page with OCR on the heading/sidebar
|
||||
2. use OCR to find the specific section label before trying to toggle anything
|
||||
3. if the layout is dense, zoom into the relevant pane and use the `image` tool to distinguish labels from controls
|
||||
4. prefer small reversible actions: one toggle, one dropdown, one field edit at a time
|
||||
5. after each change, verify the control state changed visually or via visible text
|
||||
6. if a save/apply button exists, treat it as a separate confirmation step and verify completion
|
||||
|
||||
Settings UIs love hiding side effects. Assume nothing.
|
||||
|
||||
### Dense app / control-strip playbook
|
||||
|
||||
Use for apps like OBS, IDEs, mixers, dashboards, or anything with tiny bottom-right control clusters.
|
||||
|
||||
1. focus the exact app window with `POST /windows/action`
|
||||
2. capture the full target display once to confirm the window is actually frontmost
|
||||
3. crop tightly around the suspected control strip with `POST /zoom`
|
||||
4. run OCR on the crop, not the full screen
|
||||
5. if labels are still ambiguous, ask the `image` tool a narrow question about the specific buttons
|
||||
6. click once and immediately verify the control label changed (`Start Recording` -> `Stop Recording`, etc.)
|
||||
|
||||
Do not trust OCR taken from the wrong frontmost window. It will happily waste your time.
|
||||
|
||||
### Spotify playbook
|
||||
|
||||
- Focus app window before search/navigation.
|
||||
- Prefer keyboard-first flow for song start:
|
||||
1) `Ctrl+L` (search)
|
||||
2) type exact query
|
||||
3) Enter
|
||||
4) verify exact song+artist text
|
||||
5) click/double-click row
|
||||
6) verify now-playing bar
|
||||
- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.
|
||||
- Include `x-clickthrough-token` when token auth is enabled.
|
||||
- `/exec` additionally requires `x-clickthrough-exec-secret`.
|
||||
- Validate server first: `GET /health`.
|
||||
|
||||
Reference in New Issue
Block a user