From aced5be25e476780edcd5ff05fb6e45da925c8ce Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Paul=20W=C3=A4hner?= Date: Sun, 3 May 2026 19:11:11 +0200 Subject: [PATCH] feat: migrate to v2-only API and unified response envelope --- README.md | 85 ++--- docs/API.md | 641 +++++--------------------------------- examples/quickstart.py | 31 +- server/app.py | 691 ++++++++++++++++++++++++----------------- skill/SKILL.md | 422 ++++--------------------- 5 files changed, 603 insertions(+), 1267 deletions(-) diff --git a/README.md b/README.md index f0be27e..ae7ef4f 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,25 @@ # Clickthrough -Let an Agent interact with your computer over HTTP, with grid-aware screenshots and precise input actions. +Let an agent interact with a computer over HTTP. + +## Primary mode (v2) + +Use the v2 contract for faster, less OCR-heavy control loops: +- `POST /v2/observe` +- `POST /v2/localize` +- `POST /v2/act` +- `POST /v2/act-verify` + +This is optimized for agents that cannot directly see the screen and must use screenshot/image tools. ## What this provides -- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes) -- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported) -- **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ... -- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey -- **Window lifecycle endpoints**: list/focus/restore/minimize/maximize/close windows via `GET /windows` + `POST /windows/action` -- **Structured launch endpoint**: start an app/process without dropping to a shell via `POST /launch` -- **Wait/sync endpoint**: poll for text, window, or visual state changes via `POST /wait` -- **Vision helper endpoints**: compare screenshots and measure stability via `POST /vision/diff` and `POST /vision/stability` -- **OCR endpoints**: extract text blocks or search for matching text via `POST /ocr` and `POST /ocr/find` -- **Compound verify endpoint**: execute an action and wait for a structured success condition via `POST /action/verify` -- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec` -- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels -- **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction +- Screen/region capture with optional OCR and timing stats +- Observation IDs for deterministic follow-up localization +- Text localization and image-tool coordinate localization +- Action execution with resolved target IDs +- Risk-aware action+verification defaults +- Unified response envelope across all endpoints ## Quick start @@ -30,53 +33,17 @@ CLICKTHROUGH_TOKEN=change-me python -m server.app Server defaults to `127.0.0.1:8123`. -For OCR support, install the native `tesseract` binary on the host (in addition to Python deps), or point `CLICKTHROUGH_TESSERACT_CMD` at the executable if it lives somewhere weird. +## Fast control loop -`python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically. +1. `POST /v2/observe` on a tight region +2. If OCR is enough, `POST /v2/localize` with `text_query` +3. If ambiguous, ask image tool for one x,y in observation bounds +4. `POST /v2/localize` with `image_tool_point` +5. `POST /v2/act` or `POST /v2/act-verify` +6. Re-observe only changed region -## Minimal API flow +## See docs -1. `GET /displays` if you need a non-primary monitor -2. `GET /screen?screen=0` with grid -3. Decide cell / target -4. Optional `POST /zoom?screen=0` for finer targeting -5. `POST /action?screen=0` to execute (or `POST /action/verify?screen=0` for a bundled action+wait flow) -6. `GET /screen?screen=0` again to verify result, or use `POST /wait`, `POST /vision/diff`, or `POST /ocr/find` - -Important: -- `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields. -- Pixel coordinates and OCR bounding boxes are always global desktop coordinates. -- The agent does **not** inherently see the remote desktop; it reasons from screenshots, OCR, and window metadata. -- When OCR is not enough, pair Clickthrough screenshots with OpenClaw's `image` tool for explicit screenshot interpretation. -- Prefer structured GUI interaction first; use `/windows`, `/launch`, `/wait`, and `/action` before reaching for `/exec`. - -See: - `docs/API.md` -- `docs/coordinate-system.md` - `skill/SKILL.md` - -## Configuration - -Environment variables: - -- `CLICKTHROUGH_HOST` (default `127.0.0.1`) -- `CLICKTHROUGH_PORT` (default `8123`) -- `CLICKTHROUGH_TOKEN` (optional; if set, require `x-clickthrough-token` header) -- `CLICKTHROUGH_DRY_RUN` (`true`/`false`; default `false`) -- `CLICKTHROUGH_GRID_ROWS` (default `12`) -- `CLICKTHROUGH_GRID_COLS` (default `12`) -- `CLICKTHROUGH_ALLOWED_REGION` (optional `x,y,width,height`) -- `CLICKTHROUGH_EXEC_ENABLED` (default `true`) -- `CLICKTHROUGH_EXEC_SECRET` (**required for `/exec` to run**) -- `CLICKTHROUGH_EXEC_DEFAULT_SHELL` (default `powershell`; one of `powershell`, `bash`, `cmd`) -- `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`) -- `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`) -- `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`) -- `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable) - -Window management endpoints currently target Windows hosts. On non-Windows hosts they return `501` instead of guessing. - -## Gitea CI - -A Gitea Actions workflow is included at `.gitea/workflows/python-syntax.yml`. -It runs Python syntax checks (`py_compile`) on every push and pull request. +- `docs/coordinate-system.md` diff --git a/docs/API.md b/docs/API.md index 76cfa59..54c7466 100644 --- a/docs/API.md +++ b/docs/API.md @@ -1,614 +1,141 @@ -# API Reference (v0.1) +# API Reference (v2) Base URL: `http://127.0.0.1:8123` -If `CLICKTHROUGH_TOKEN` is set, include header: +If `CLICKTHROUGH_TOKEN` is set, include: ```http x-clickthrough-token: ``` -## `GET /health` +## Endpoints -Returns status and runtime safety flags, including `exec` capability config. +- `POST /v2/observe` +- `POST /v2/localize` +- `POST /v2/act` +- `POST /v2/act-verify` +- `GET /health` +- `GET /displays` +- `GET /windows` +- `POST /windows/action` +- `POST /launch` +- `POST /exec` -## `GET /displays` +No v1 endpoints are supported. -Returns detected displays in API screen order. - -```json -{ - "ok": true, - "default_screen": 0, - "displays": [ - {"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080}, - {"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080} - ] -} -``` - -`screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend. -Invalid `screen` values fall back to `0`. - -## `GET /screen` - -Query params: - -- `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0` -- `with_grid` (bool, default `true`) -- `grid_rows` (int, default env or `12`) -- `grid_cols` (int, default env or `12`) -- `include_labels` (bool, default `true`) -- `image_format` (`png`|`jpeg`, default `png`) -- `jpeg_quality` (1-100, default `85`) -- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`) - -Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`). -`meta.region` uses global desktop coordinates. - -These image-returning endpoints do not magically grant the agent live vision. -If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI. - -## `POST /zoom` - -Body: - -```json -{ - "center_x": 1200, - "center_y": 700, - "width": 500, - "height": 350, - "with_grid": true, - "grid_rows": 20, - "grid_cols": 20, - "include_labels": true, - "image_format": "png", - "jpeg_quality": 90 -} -``` - -Query params: - -- `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0` -- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`) - -Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base. - -`POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout. - -## `POST /action` - -Body: one action. - -Important: -- the request body uses `action` plus an optional `target` -- pixel coordinates live inside `target` when `target.mode="pixel"` -- do **not** send top-level `x` / `y` fields - -Query params: - -- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0` - -Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target. - -### Pointer target modes - -#### Pixel target - -```json -{ - "mode": "pixel", - "x": 100, - "y": 200, - "dx": 0, - "dy": 0 -} -``` - -#### Grid target - -```json -{ - "mode": "grid", - "region_x": 0, - "region_y": 0, - "region_width": 1920, - "region_height": 1080, - "rows": 12, - "cols": 12, - "row": 5, - "col": 9, - "dx": 0.0, - "dy": 0.0 -} -``` - -`dx`/`dy` are normalized offsets in `[-1, 1]` inside the selected cell. - -### Action examples - -Click: - -```json -{ - "action": "click", - "target": { - "mode": "grid", - "region_x": 0, - "region_y": 0, - "region_width": 1920, - "region_height": 1080, - "rows": 12, - "cols": 12, - "row": 7, - "col": 3, - "dx": 0.2, - "dy": -0.1 - }, - "clicks": 1, - "button": "left" -} -``` - -Scroll: - -```json -{ - "action": "scroll", - "target": {"mode": "pixel", "x": 1300, "y": 740}, - "scroll_amount": -500 -} -``` - -Type text: - -```json -{ - "action": "type", - "text": "hello world", - "interval_ms": 20 -} -``` - -Hotkey: - -```json -{ - "action": "hotkey", - "keys": ["ctrl", "l"] -} -``` - -Right click: - -```json -{ - "action": "right_click", - "target": {"mode": "pixel", "x": 1300, "y": 740} -} -``` - -Move only: - -```json -{ - "action": "move", - "target": {"mode": "pixel", "x": 1300, "y": 740}, - "duration_ms": 150 -} -``` - -## `GET /windows` - -List desktop windows using structured filters instead of shelling out. - -Query params: - -- `title_contains` (optional substring match) -- `title_regex` (optional case-insensitive regex) -- `process_name` (optional exact process name, e.g. `explorer.exe`) -- `hwnd` (optional exact window handle) -- `visible_only` (bool, default `true`) - -```json -{ - "ok": true, - "count": 1, - "windows": [ - { - "hwnd": 132640, - "title": "WinDirStat", - "class_name": "WinDirStatMainWindow", - "pid": 18420, - "process_name": "windirstat.exe", - "visible": true, - "enabled": true, - "minimized": false, - "maximized": false, - "foreground": true, - "rect": {"x": 194, "y": 116, "width": 1532, "height": 870} - } - ] -} -``` - -Notes: -- Currently supported on Windows hosts only. -- Returns `409` for ambiguous write-target matches when a mutation endpoint would affect multiple windows. - -## `POST /windows/action` - -Perform a structured window action against exactly one matched window. - -```json -{ - "action": "focus", - "title_contains": "WinDirStat", - "visible_only": true, - "timeout_ms": 3000 -} -``` - -Supported actions: -- `focus` -- `restore` -- `minimize` -- `maximize` -- `close` - -The response includes the matched pre-action window and the final observed window state (or `closed=true` if it disappeared). - -## `POST /launch` - -Start an app/process without invoking a shell. - -```json -{ - "executable": "C:/Program Files/WinDirStat/WinDirStat.exe", - "args": [], - "cwd": "C:/Program Files/WinDirStat", - "wait_for_window": true, - "match": { - "title_contains": "WinDirStat", - "visible_only": true - }, - "timeout_ms": 8000 -} -``` - -Notes: -- Launch uses direct process execution (`subprocess.Popen`) rather than PowerShell/CMD. -- If `wait_for_window=true`, the server polls for a matching window and returns `window_found`. -- `dry_run=true` returns the resolved argv/cwd without launching. - -## `POST /vision/diff` - -Measure whether a screen region changed meaningfully between two captures. - -Query params: - -- `screen` (int, default `0`) - used for `mode=screen` and `mode=region` - -Compare live captures: +## `POST /v2/observe` ```json { "mode": "region", - "region_x": 120, - "region_y": 80, - "region_width": 600, - "region_height": 300, - "delay_ms": 400, - "diff_threshold": 0.01 + "region_x": 800, + "region_y": 420, + "region_width": 700, + "region_height": 420, + "include_image": true, + "image_format": "jpeg", + "jpeg_quality": 75, + "ocr_mode": "region", + "language_hint": "eng", + "min_confidence": 0.45, + "max_ocr_area_px": 1500000, + "group_lines": true } ``` -Compare provided images: +Returns observation metadata, optional image, OCR blocks/lines, and timing fields. + +## `POST /v2/localize` + +Text localization: ```json { - "mode": "image", - "before_image_base64": "iVBORw0KGgoAAA...", - "after_image_base64": "iVBORw0KGgoBBB...", - "diff_threshold": 0.01 + "observation_id": "...", + "text_query": "Save", + "text_match": "exact", + "candidate_index": 0 } ``` -Response includes: -- `diff_ratio` — average normalized pixel difference -- `changed` — whether `diff_ratio >= diff_threshold` -- `region` — compared region - -## `POST /vision/stability` - -Measure whether a screen region stays visually stable over a short interval. - -Query params: - -- `screen` (int, default `0`) +Image-tool point localization: ```json { - "region_x": 0, - "region_y": 0, - "region_width": 1920, - "region_height": 1080, - "sample_interval_ms": 250, - "duration_ms": 1200, - "diff_threshold": 0.005 + "observation_id": "...", + "image_tool_point": {"x": 312, "y": 188} } ``` -Response includes: -- `stable` -- `sample_count` -- `max_diff_ratio` -- `avg_diff_ratio` +Returns `resolved_target_id`, global pixel, and `localization_confidence`. -## `POST /wait` - -Wait on a structured UI condition instead of guessing sleep durations. - -Query params: - -- `screen` (int, default `0`) - used for text and visual waits - -### Wait for text to appear - -```json -{ - "condition": { - "kind": "text", - "mode": "screen", - "text": "Scan complete", - "match": "contains", - "present": true, - "language_hint": "eng", - "min_confidence": 0.4 - }, - "timeout_ms": 15000, - "poll_interval_ms": 400 -} -``` - -### Wait for a window state - -```json -{ - "condition": { - "kind": "window", - "title_contains": "WinDirStat", - "visible_only": true, - "state": "focused" - }, - "timeout_ms": 5000, - "poll_interval_ms": 200 -} -``` - -Window states: -- `exists` -- `focused` -- `closed` - -### Wait for visual change or stability - -```json -{ - "condition": { - "kind": "visual", - "state": "stable", - "region_x": 0, - "region_y": 0, - "region_width": 1920, - "region_height": 1080, - "diff_threshold": 0.005, - "stable_for_ms": 1000 - }, - "timeout_ms": 12000, - "poll_interval_ms": 300 -} -``` - -Visual states: -- `change` — succeeds when the average pixel diff crosses `diff_threshold` -- `stable` — succeeds when the diff stays at or below `diff_threshold` for `stable_for_ms` - -Notes: -- Text waits reuse the OCR pipeline and return matching OCR blocks on success. -- Window waits build on the structured window discovery endpoint. -- Visual waits compare repeated captures of either the full selected display or an explicit region. - -## `POST /action/verify` - -Execute one action and wait for a structured success condition. - -Query params: - -- `screen` (int, default `0`) +## `POST /v2/act` ```json { "action": { "action": "click", - "target": {"mode": "pixel", "x": 1300, "y": 740} + "target": {"resolved_target_id": "..."}, + "button": "left", + "clicks": 1 + } +} +``` + +## `POST /v2/act-verify` + +```json +{ + "action": { + "action": "click", + "target": {"resolved_target_id": "..."} }, "condition": { "kind": "text", - "mode": "screen", - "text": "Settings", + "mode": "region", + "text": "Saved", "match": "contains", "present": true, - "language_hint": "eng", + "region_x": 820, + "region_y": 420, + "region_width": 500, + "region_height": 140, "min_confidence": 0.4 }, - "retries": 1, - "timeout_ms": 4000, - "poll_interval_ms": 250, - "retry_delay_ms": 250 + "risk_level": "low" } ``` -Condition kinds mirror `POST /wait`: -- `text` -- `window` -- `visual` +Risk defaults: +- `low`: retries `0`, timeout `2500ms` +- `high`: retries `1`, timeout `6000ms` -The response returns per-attempt action output plus structured verification output. +## Response envelope -## `POST /ocr` - -Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes. - -Query params: - -- `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0` - -Body: - -```json -{ - "mode": "screen", - "language_hint": "eng", - "min_confidence": 0.4 -} -``` - -Modes: -- `screen` (default): OCR over full selected monitor -- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`) -- `image`: OCR over provided `image_base64` (supports plain base64 or data URL) - -Region mode example: - -```json -{ - "mode": "region", - "region_x": 220, - "region_y": 160, - "region_width": 900, - "region_height": 400, - "language_hint": "eng", - "min_confidence": 0.5 -} -``` - -Image mode example: - -```json -{ - "mode": "image", - "image_base64": "iVBORw0KGgoAAAANSUhEUgAA...", - "language_hint": "eng" -} -``` - -Response shape: +Success: ```json { "ok": true, "request_id": "...", "time_ms": 1710000000000, - "result": { - "mode": "screen", - "language_hint": "eng", - "min_confidence": 0.4, - "region": {"x": 0, "y": 0, "width": 1920, "height": 1080}, - "blocks": [ - { - "text": "Settings", - "confidence": 0.9821, - "bbox": {"x": 144, "y": 92, "width": 96, "height": 21} - } - ] + "data": { }, + "error": null +} +``` + +Error: + +```json +{ + "ok": false, + "request_id": "...", + "time_ms": 1710000000000, + "data": null, + "error": { + "code": "http_error", + "message": "...", + "details": {} } } ``` - -Notes: -- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right). -- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`. -- Requires `tesseract` executable plus Python package `pytesseract`. -- If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path. - -## `POST /ocr/find` - -Search OCR output for matching text instead of post-processing raw OCR blocks client-side. - -Query params: - -- `screen` (int, default `0`) - used for `mode=screen` and `mode=region` - -```json -{ - "mode": "screen", - "query": "Settings", - "match": "contains", - "group_lines": true, - "max_results": 10, - "language_hint": "eng", - "min_confidence": 0.4 -} -``` - -Modes: -- `screen` -- `region` -- `image` - -Options: -- `match`: `contains`, `exact`, or `regex` -- `group_lines=true`: combine nearby OCR words into line-level candidates before matching -- `max_results`: result cap after confidence sorting - -Response includes: -- `matches` — confidence-sorted candidate matches -- `match_count` -- `blocks_considered` - -## `POST /exec` - -Execute a shell command on the host running Clickthrough. - -Requirements: -- `CLICKTHROUGH_EXEC_SECRET` must be configured on the server -- send header `x-clickthrough-exec-secret: ` - -```json -{ - "command": "Get-Process | Select-Object -First 5", - "shell": "powershell", - "timeout_s": 20, - "cwd": "C:/Users/Paul", - "dry_run": false -} -``` - -Notes: -- `shell` supports `powershell`, `bash`, `cmd` -- if `shell` is omitted, server uses `CLICKTHROUGH_EXEC_DEFAULT_SHELL` -- output is truncated based on `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` -- endpoint can be disabled with `CLICKTHROUGH_EXEC_ENABLED=false` -- if `CLICKTHROUGH_EXEC_SECRET` is missing, `/exec` is blocked (`403`) - -Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution metadata. - -## `POST /batch` - -Runs multiple `action` payloads sequentially. - -Query params: - -- `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0` - -```json -{ - "actions": [ - {"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}}, - {"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}} - ], - "stop_on_error": true -} -``` diff --git a/examples/quickstart.py b/examples/quickstart.py index 5aba923..3ad8ce2 100644 --- a/examples/quickstart.py +++ b/examples/quickstart.py @@ -13,23 +13,26 @@ if TOKEN: def main(): - r = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10) - r.raise_for_status() - print("health:", r.json()) + health = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10) + health.raise_for_status() + print("health ok:", health.json().get("ok")) - d = requests.get(f"{BASE_URL}/displays", headers=headers, timeout=10) - d.raise_for_status() - print("displays:", d.json().get("displays", [])) - - s = requests.get( - f"{BASE_URL}/screen", + observe = requests.post( + f"{BASE_URL}/v2/observe", headers=headers, - params={"screen": SCREEN, "with_grid": True, "grid_rows": 12, "grid_cols": 12}, - timeout=30, + params={"screen": SCREEN}, + json={ + "mode": "screen", + "include_image": False, + "ocr_mode": "none", + }, + timeout=20, ) - s.raise_for_status() - payload = s.json() - print("screen meta:", payload.get("meta", {})) + observe.raise_for_status() + payload = observe.json()["data"] + print("observation_id:", payload["observation_id"]) + print("region:", payload["region"]) + print("timing_ms:", payload["timing_ms"]) if __name__ == "__main__": diff --git a/server/app.py b/server/app.py index dca2378..bd43ed6 100644 --- a/server/app.py +++ b/server/app.py @@ -8,10 +8,12 @@ import subprocess import sys import time import uuid -from typing import Literal, Optional +from typing import Any, Literal, Optional from dotenv import load_dotenv -from fastapi import Depends, FastAPI, Header, HTTPException, Response +from fastapi import Depends, FastAPI, Header, HTTPException, Request +from fastapi.exceptions import RequestValidationError +from fastapi.responses import JSONResponse from PIL import ImageChops, ImageStat from pydantic import BaseModel, Field, model_validator @@ -21,6 +23,55 @@ load_dotenv(dotenv_path=".env", override=False) app = FastAPI(title="clickthrough", version="0.1.0") +def _ok(data: Any, status_code: int = 200): + return JSONResponse( + status_code=status_code, + content={ + "ok": True, + "request_id": _request_id(), + "time_ms": _now_ms(), + "data": data, + "error": None, + }, + ) + + +def _err(code: str, message: str, status_code: int, details: Any = None): + return JSONResponse( + status_code=status_code, + content={ + "ok": False, + "request_id": _request_id(), + "time_ms": _now_ms(), + "data": None, + "error": { + "code": code, + "message": message, + "details": details, + }, + }, + ) + + +@app.exception_handler(HTTPException) +async def _http_exception_handler(_: Request, exc: HTTPException): + detail = exc.detail + if isinstance(detail, dict): + message = str(detail.get("message", "request failed")) + return _err("http_error", message, exc.status_code, detail) + return _err("http_error", str(detail), exc.status_code) + + +@app.exception_handler(Exception) +async def _unhandled_exception_handler(_: Request, exc: Exception): + return _err("internal_error", "internal server error", 500, {"type": type(exc).__name__}) + + +@app.exception_handler(RequestValidationError) +async def _validation_exception_handler(_: Request, exc: RequestValidationError): + return _err("validation_error", "request validation failed", 422, exc.errors()) + + def _env_bool(name: str, default: bool) -> bool: raw = os.getenv(name) if raw is None: @@ -288,6 +339,144 @@ class VerifyActionRequest(BaseModel): stop_on_action_error: bool = True +class ObserveRequestV2(BaseModel): + mode: Literal["screen", "region"] = "screen" + region_x: int | None = Field(default=None, ge=0) + region_y: int | None = Field(default=None, ge=0) + region_width: int | None = Field(default=None, gt=0) + region_height: int | None = Field(default=None, gt=0) + include_image: bool = True + image_format: Literal["png", "jpeg"] = "jpeg" + jpeg_quality: int = Field(default=75, ge=1, le=100) + ocr_mode: Literal["none", "region", "screen"] = "none" + language_hint: str | None = Field(default=None, min_length=1, max_length=64) + min_confidence: float = Field(default=0.4, ge=0.0, le=1.0) + max_ocr_area_px: int | None = Field(default=1_500_000, ge=1000) + group_lines: bool = True + + @model_validator(mode="after") + def _validate_region(self): + if self.mode == "region": + required = [self.region_x, self.region_y, self.region_width, self.region_height] + if any(v is None for v in required): + raise ValueError("region_x, region_y, region_width, region_height are required for mode=region") + return self + + +class ImageToolPoint(BaseModel): + x: int = Field(ge=0) + y: int = Field(ge=0) + + +class LocalizeRequestV2(BaseModel): + observation_id: str = Field(min_length=1, max_length=128) + text_query: str | None = Field(default=None, max_length=512) + text_match: Literal["contains", "exact", "regex"] = "contains" + image_tool_point: ImageToolPoint | None = None + candidate_index: int = Field(default=0, ge=0) + + @model_validator(mode="after") + def _validate_selector(self): + has_text = bool((self.text_query or "").strip()) + has_point = self.image_tool_point is not None + if has_text == has_point: + raise ValueError("provide exactly one of text_query or image_tool_point") + return self + + +class ActionTargetV2(BaseModel): + resolved_target_id: str | None = Field(default=None, max_length=128) + pixel_x: int | None = None + pixel_y: int | None = None + + @model_validator(mode="after") + def _validate_shape(self): + has_resolved = bool(self.resolved_target_id) + has_pixel = self.pixel_x is not None or self.pixel_y is not None + if has_resolved == has_pixel: + raise ValueError("provide either resolved_target_id or pixel_x/pixel_y") + if has_pixel and (self.pixel_x is None or self.pixel_y is None): + raise ValueError("pixel_x and pixel_y are both required") + return self + + +class ActionRequestV2(BaseModel): + action: Literal[ + "move", + "click", + "right_click", + "double_click", + "middle_click", + "scroll", + "type", + "hotkey", + ] + target: ActionTargetV2 | None = None + duration_ms: int = Field(default=0, ge=0, le=20000) + button: Literal["left", "right", "middle"] = "left" + clicks: int = Field(default=1, ge=1, le=10) + scroll_amount: int = 0 + text: str = "" + keys: list[str] = Field(default_factory=list) + interval_ms: int = Field(default=20, ge=0, le=5000) + dry_run: bool = False + + +class ActRequestV2(BaseModel): + action: ActionRequestV2 + + +class ActVerifyRequestV2(BaseModel): + action: ActionRequestV2 + condition: WaitTextCondition | WaitWindowCondition | WaitVisualCondition + risk_level: Literal["low", "high"] = "low" + retries: int | None = Field(default=None, ge=0, le=10) + timeout_ms: int | None = Field(default=None, ge=0, le=120000) + poll_interval_ms: int | None = Field(default=None, ge=50, le=10000) + retry_delay_ms: int | None = Field(default=None, ge=0, le=60000) + stop_on_action_error: bool = True + + +OBSERVATIONS: dict[str, dict[str, Any]] = {} +RESOLVED_TARGETS: dict[str, dict[str, Any]] = {} + + +def _get_observation(observation_id: str) -> dict[str, Any]: + observation = OBSERVATIONS.get(observation_id) + if observation is None: + raise HTTPException(status_code=404, detail="observation_id not found") + return observation + + +def _resolve_v2_action(req: ActionRequestV2) -> ActionRequest: + target: Target | None = None + if req.target is not None: + if req.target.resolved_target_id: + item = RESOLVED_TARGETS.get(req.target.resolved_target_id) + if item is None: + raise HTTPException(status_code=404, detail="resolved_target_id not found") + target = PixelTarget(mode="pixel", x=item["x"], y=item["y"], dx=0, dy=0) + else: + target = PixelTarget(mode="pixel", x=req.target.pixel_x or 0, y=req.target.pixel_y or 0, dx=0, dy=0) + return ActionRequest( + action=req.action, + target=target, + duration_ms=req.duration_ms, + button=req.button, + clicks=req.clicks, + scroll_amount=req.scroll_amount, + text=req.text, + keys=req.keys, + interval_ms=req.interval_ms, + dry_run=req.dry_run, + ) + + +def _risk_defaults(risk_level: str) -> dict[str, int]: + if risk_level == "high": + return {"retries": 1, "timeout_ms": 6000, "poll_interval_ms": 250, "retry_delay_ms": 300} + return {"retries": 0, "timeout_ms": 2500, "poll_interval_ms": 200, "retry_delay_ms": 150} + def _auth(x_clickthrough_token: Optional[str] = Header(default=None)): token = SETTINGS["token"] @@ -1377,154 +1566,225 @@ def _exec_action(req: ActionRequest, screen: int = 0) -> dict: } +def _localization_confidence(source: str, confidence: float | None = None) -> str: + if source == "image_tool_point": + return "high" + if source == "ocr" and confidence is not None: + if confidence >= 0.8: + return "high" + if confidence >= 0.55: + return "medium" + return "low" + + +@app.post("/v2/observe") +def observe_v2(req: ObserveRequestV2, screen: int = 0, _: None = Depends(_auth)): + capture_started = time.perf_counter() + image, region, mon, displays, screen_selection = _capture_region_image( + screen, + req.region_x if req.mode == "region" else None, + req.region_y if req.mode == "region" else None, + req.region_width if req.mode == "region" else None, + req.region_height if req.mode == "region" else None, + ) + capture_ms = int((time.perf_counter() - capture_started) * 1000) + + encoded = None + if req.include_image: + encoded = _encode_image(image, req.image_format, req.jpeg_quality) + + ocr_started = time.perf_counter() + blocks: list[dict] = [] + grouped_lines: list[dict] = [] + ocr_applied_mode = "none" + if req.ocr_mode != "none": + if req.ocr_mode == "screen": + ocr_image, ocr_region, _, _, _ = _capture_region_image(screen, None, None, None, None) + else: + ocr_image, ocr_region = image, region + + area = ocr_region["width"] * ocr_region["height"] + if req.max_ocr_area_px is not None and area > req.max_ocr_area_px: + raise HTTPException( + status_code=400, + detail=f"ocr area {area} exceeds max_ocr_area_px {req.max_ocr_area_px}", + ) + + blocks = _run_ocr( + ocr_image, + req.language_hint, + req.min_confidence, + ocr_region["x"], + ocr_region["y"], + ) + if req.group_lines: + grouped_lines = _group_ocr_lines(blocks) + ocr_applied_mode = req.ocr_mode + ocr_ms = int((time.perf_counter() - ocr_started) * 1000) + + observation_id = _request_id() + OBSERVATIONS[observation_id] = { + "id": observation_id, + "region": region, + "screen": screen_selection, + "display": mon, + "image_width": image.size[0], + "image_height": image.size[1], + "ocr_blocks": blocks, + "ocr_lines": grouped_lines, + "created_at_ms": _now_ms(), + } + + return _ok( + { + "observation_id": observation_id, + "region": region, + "screen": screen_selection, + "display": mon, + "image": { + "included": req.include_image, + "format": req.image_format if req.include_image else None, + "base64": encoded, + "width": image.size[0], + "height": image.size[1], + }, + "ocr": { + "mode": ocr_applied_mode, + "min_confidence": req.min_confidence, + "language_hint": req.language_hint, + "block_count": len(blocks), + "line_count": len(grouped_lines), + "blocks": blocks, + "lines": grouped_lines, + }, + "timing_ms": { + "capture_ms": capture_ms, + "ocr_ms": ocr_ms if req.ocr_mode != "none" else 0, + "total_ms": capture_ms + (ocr_ms if req.ocr_mode != "none" else 0), + }, + } + ) + + +@app.post("/v2/localize") +def localize_v2(req: LocalizeRequestV2, _: None = Depends(_auth)): + observation = _get_observation(req.observation_id) + region = observation["region"] + image_width = observation["image_width"] + image_height = observation["image_height"] + + if req.image_tool_point is not None: + if req.image_tool_point.x >= image_width or req.image_tool_point.y >= image_height: + raise HTTPException(status_code=400, detail="image_tool_point outside observation image bounds") + x = region["x"] + req.image_tool_point.x + y = region["y"] + req.image_tool_point.y + _enforce_allowed_region(x, y) + resolved_target_id = _request_id() + RESOLVED_TARGETS[resolved_target_id] = { + "id": resolved_target_id, + "observation_id": req.observation_id, + "x": x, + "y": y, + "source": "image_tool_point", + } + return _ok( + { + "resolved_target_id": resolved_target_id, + "source": "image_tool_point", + "localization_confidence": _localization_confidence("image_tool_point"), + "pixel": {"x": x, "y": y}, + "observation_region": region, + "image_bounds": {"width": image_width, "height": image_height}, + } + ) + + lines = observation.get("ocr_lines") or _group_ocr_lines(observation.get("ocr_blocks", [])) + matches = _find_text_matches(lines, req.text_query or "", req.text_match, False, 200) + if not matches: + return _err("not_found", "no localization candidates found", 404, {"found": False, "matches": []}) + if req.candidate_index >= len(matches): + raise HTTPException(status_code=400, detail="candidate_index is outside match results") + + chosen = matches[req.candidate_index] + bbox = chosen["bbox"] + x = bbox["x"] + max(1, bbox["width"] // 2) + y = bbox["y"] + max(1, bbox["height"] // 2) + _enforce_allowed_region(x, y) + resolved_target_id = _request_id() + RESOLVED_TARGETS[resolved_target_id] = { + "id": resolved_target_id, + "observation_id": req.observation_id, + "x": x, + "y": y, + "source": "ocr", + "match": chosen, + } + + return _ok( + { + "resolved_target_id": resolved_target_id, + "source": "ocr", + "localization_confidence": _localization_confidence("ocr", chosen.get("confidence")), + "pixel": {"x": x, "y": y}, + "selected_match": chosen, + "match_count": len(matches), + } + ) + + +@app.post("/v2/act") +def act_v2(req: ActRequestV2, screen: int = 0, _: None = Depends(_auth)): + legacy_action = _resolve_v2_action(req.action) + result = _exec_action(legacy_action, screen) + return _ok(result) + + +@app.post("/v2/act-verify") +def act_verify_v2(req: ActVerifyRequestV2, screen: int = 0, _: None = Depends(_auth)): + defaults = _risk_defaults(req.risk_level) + verify_req = VerifyActionRequest( + action=_resolve_v2_action(req.action), + condition=req.condition, + retries=defaults["retries"] if req.retries is None else req.retries, + timeout_ms=defaults["timeout_ms"] if req.timeout_ms is None else req.timeout_ms, + poll_interval_ms=defaults["poll_interval_ms"] if req.poll_interval_ms is None else req.poll_interval_ms, + retry_delay_ms=defaults["retry_delay_ms"] if req.retry_delay_ms is None else req.retry_delay_ms, + stop_on_action_error=req.stop_on_action_error, + ) + result = _run_verified_action(verify_req, screen) + payload = { + "risk_level": req.risk_level, + "defaults_applied": defaults, + **result, + } + if result.get("success", False): + return _ok(payload) + return _err("verification_failed", "action verification did not satisfy condition", 409, payload) + + @app.get("/health") def health(_: None = Depends(_auth)): - return { - "ok": True, - "service": "clickthrough", - "version": app.version, - "time_ms": _now_ms(), - "request_id": _request_id(), - "dry_run": SETTINGS["dry_run"], - "allowed_region": SETTINGS["allowed_region"], - "exec": { - "enabled": SETTINGS["exec_enabled"], - "secret_configured": bool(SETTINGS["exec_secret"]), - "default_shell": SETTINGS["exec_default_shell"], - "default_timeout_s": SETTINGS["exec_default_timeout_s"], - "max_timeout_s": SETTINGS["exec_max_timeout_s"], - }, - } + return _ok( + { + "service": "clickthrough", + "version": app.version, + "dry_run": SETTINGS["dry_run"], + "allowed_region": SETTINGS["allowed_region"], + "exec": { + "enabled": SETTINGS["exec_enabled"], + "secret_configured": bool(SETTINGS["exec_secret"]), + "default_shell": SETTINGS["exec_default_shell"], + "default_timeout_s": SETTINGS["exec_default_timeout_s"], + "max_timeout_s": SETTINGS["exec_max_timeout_s"], + }, + } + ) @app.get("/displays") def displays(_: None = Depends(_auth)): detected = _get_displays() - return { - "ok": True, - "request_id": _request_id(), - "time_ms": _now_ms(), - "displays": detected, - "default_screen": 0, - } - - -@app.get("/screen") -def screen( - with_grid: bool = True, - grid_rows: int = SETTINGS["default_grid_rows"], - grid_cols: int = SETTINGS["default_grid_cols"], - include_labels: bool = True, - image_format: Literal["png", "jpeg"] = "png", - jpeg_quality: int = 85, - asImage: bool = False, - screen: int = 0, - _: None = Depends(_auth), -): - req = ScreenRequest( - with_grid=with_grid, - grid_rows=grid_rows, - grid_cols=grid_cols, - include_labels=include_labels, - image_format=image_format, - jpeg_quality=jpeg_quality, - ) - - base_img, mon, displays, screen_selection = _capture_screen(screen) - meta = {"region": mon, "screen": screen_selection, "displays": displays} - out_img = base_img - - if req.with_grid: - out_img, grid_meta = _draw_grid(base_img, mon["x"], mon["y"], req.grid_rows, req.grid_cols, req.include_labels) - meta.update(grid_meta) - - if asImage: - image_bytes = _serialize_image(out_img, req.image_format, req.jpeg_quality) - media_type = "image/jpeg" if req.image_format == "jpeg" else "image/png" - return Response(content=image_bytes, media_type=media_type) - - encoded = _encode_image(out_img, req.image_format, req.jpeg_quality) - return { - "ok": True, - "request_id": _request_id(), - "time_ms": _now_ms(), - "image": { - "format": req.image_format, - "base64": encoded, - "width": out_img.size[0], - "height": out_img.size[1], - }, - "meta": meta, - } - - -@app.post("/zoom") -def zoom(req: ZoomRequest, asImage: bool = False, screen: int = 0, _: None = Depends(_auth)): - base_img, mon, displays, screen_selection = _capture_screen(screen) - - cx = req.center_x - mon["x"] - cy = req.center_y - mon["y"] - - half_w = req.width // 2 - half_h = req.height // 2 - - left = max(0, cx - half_w) - top = max(0, cy - half_h) - right = min(base_img.size[0], left + req.width) - bottom = min(base_img.size[1], top + req.height) - - crop = base_img.crop((left, top, right, bottom)) - - region_x = mon["x"] + left - region_y = mon["y"] + top - - meta = { - "source_monitor": mon, - "screen": screen_selection, - "displays": displays, - "region": { - "x": region_x, - "y": region_y, - "width": crop.size[0], - "height": crop.size[1], - }, - } - - out_img = crop - if req.with_grid: - out_img, grid_meta = _draw_grid(crop, region_x, region_y, req.grid_rows, req.grid_cols, req.include_labels) - meta.update(grid_meta) - - if asImage: - image_bytes = _serialize_image(out_img, req.image_format, req.jpeg_quality) - media_type = "image/jpeg" if req.image_format == "jpeg" else "image/png" - return Response(content=image_bytes, media_type=media_type) - - encoded = _encode_image(out_img, req.image_format, req.jpeg_quality) - - return { - "ok": True, - "request_id": _request_id(), - "time_ms": _now_ms(), - "image": { - "format": req.image_format, - "base64": encoded, - "width": out_img.size[0], - "height": out_img.size[1], - }, - "meta": meta, - } - - -@app.post("/action") -def action(req: ActionRequest, screen: int = 0, _: None = Depends(_auth)): - result = _exec_action(req, screen) - return { - "ok": True, - "request_id": _request_id(), - "time_ms": _now_ms(), - "result": result, - } + return _ok({"displays": detected, "default_screen": 0}) @app.post("/exec") @@ -1540,12 +1800,7 @@ def exec_command( raise HTTPException(status_code=401, detail="invalid exec secret") result = _exec_command(req) - return { - "ok": True, - "request_id": _request_id(), - "time_ms": _now_ms(), - "result": result, - } + return _ok(result) @app.get("/windows") @@ -1565,151 +1820,19 @@ def windows( visible_only=visible_only, ) matches = _list_windows(query) - return { - "ok": True, - "request_id": _request_id(), - "time_ms": _now_ms(), - "windows": matches, - "count": len(matches), - } + return _ok({"windows": matches, "count": len(matches)}) @app.post("/windows/action") def window_action(req: WindowActionRequest, _: None = Depends(_auth)): result = _apply_window_action(req) - return { - "ok": True, - "request_id": _request_id(), - "time_ms": _now_ms(), - "result": result, - } + return _ok(result) @app.post("/launch") def launch(req: LaunchRequest, _: None = Depends(_auth)): result = _launch_app(req) - return { - "ok": True, - "request_id": _request_id(), - "time_ms": _now_ms(), - "result": result, - } - - -@app.post("/wait") -def wait(req: WaitRequest, screen: int = 0, _: None = Depends(_auth)): - result = _wait_for_condition(req, screen) - return { - "ok": result.get("satisfied", False), - "request_id": _request_id(), - "time_ms": _now_ms(), - "result": result, - } - - -@app.post("/vision/diff") -def vision_diff(req: VisionDiffRequest, screen: int = 0, _: None = Depends(_auth)): - result = _compute_visual_diff(req, screen) - return { - "ok": True, - "request_id": _request_id(), - "time_ms": _now_ms(), - "result": result, - } - - -@app.post("/vision/stability") -def vision_stability(req: VisionStabilityRequest, screen: int = 0, _: None = Depends(_auth)): - result = _measure_stability(req, screen) - return { - "ok": True, - "request_id": _request_id(), - "time_ms": _now_ms(), - "result": result, - } - - -@app.post("/action/verify") -def action_verify(req: VerifyActionRequest, screen: int = 0, _: None = Depends(_auth)): - result = _run_verified_action(req, screen) - return { - "ok": result.get("success", False), - "request_id": _request_id(), - "time_ms": _now_ms(), - "result": result, - } - - -@app.post("/ocr") -def ocr(req: OCRRequest, screen: int = 0, _: None = Depends(_auth)): - image, region, mon, displays, screen_selection, source = _capture_ocr_source(req, screen) - offset_x = region["x"] if source != "image" else 0 - offset_y = region["y"] if source != "image" else 0 - blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y) - - return { - "ok": True, - "request_id": _request_id(), - "time_ms": _now_ms(), - "result": { - "mode": source, - "screen": screen_selection if source != "image" else None, - "display": mon if source != "image" else None, - "language_hint": req.language_hint, - "min_confidence": req.min_confidence, - "region": region, - "blocks": blocks, - }, - } - - -@app.post("/ocr/find") -def ocr_find(req: OCRFindRequest, screen: int = 0, _: None = Depends(_auth)): - image, region, mon, displays, screen_selection, source = _capture_ocr_source(req, screen) - offset_x = region["x"] if source != "image" else 0 - offset_y = region["y"] if source != "image" else 0 - blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y) - matches = _find_text_matches(blocks, req.query, req.match, req.group_lines, req.max_results) - - return { - "ok": True, - "request_id": _request_id(), - "time_ms": _now_ms(), - "result": { - "mode": source, - "screen": screen_selection if source != "image" else None, - "display": mon if source != "image" else None, - "language_hint": req.language_hint, - "min_confidence": req.min_confidence, - "query": req.query, - "match": req.match, - "group_lines": req.group_lines, - "region": region, - "matches": matches, - "match_count": len(matches), - "blocks_considered": len(blocks), - }, - } - - -@app.post("/batch") -def batch(req: BatchRequest, screen: int = 0, _: None = Depends(_auth)): - results = [] - for index, item in enumerate(req.actions): - try: - item_result = _exec_action(item, screen) - results.append({"index": index, "ok": True, "result": item_result}) - except Exception as exc: - results.append({"index": index, "ok": False, "error": str(exc)}) - if req.stop_on_error: - break - - return { - "ok": all(r["ok"] for r in results), - "request_id": _request_id(), - "time_ms": _now_ms(), - "results": results, - } + return _ok(result) if __name__ == "__main__": diff --git a/skill/SKILL.md b/skill/SKILL.md index cc53f72..334befa 100644 --- a/skill/SKILL.md +++ b/skill/SKILL.md @@ -1,381 +1,97 @@ --- name: clickthrough-http-control -description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification. +description: Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops. --- -# Clickthrough HTTP Control +# Clickthrough HTTP Control (v2) -Use a strict observe-decide-act-verify loop. +Agents do not see live desktop video. They operate on snapshots. +Use this loop: **observe -> localize -> act -> verify**. -## Getting a computer instance (user-owned setup) +## Fast defaults -The **user/operator** is responsible for provisioning and exposing the target machine. -The agent should not assume it can self-install this stack. +- Start with `POST /v2/observe` on a tight region, not full screen. +- Set `ocr_mode` to `none` unless text is required immediately. +- Use `image` tool localization for icon-heavy or dense controls. +- Use `POST /v2/act-verify` instead of manual sleep/poll loops. -### What the user must do +## Mandatory image-tool click localization -1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`). -2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL. -3. Configure secrets on target machine: - - `CLICKTHROUGH_TOKEN` for general API auth - - `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls -4. Share connection details with the agent through a secure channel: - - `base_url` - - `x-clickthrough-token` - - `x-clickthrough-exec-secret` (only when `/exec` is needed) +When OCR is weak or ambiguous, ask image tool for one coordinate in bounds. -### What the agent should do - -1. Validate connection with `GET /health` using provided headers. -2. Refuse `/exec` attempts when exec secret is missing/invalid. -3. Ask user for missing setup inputs instead of guessing infrastructure. - -## What the agent can actually see - -The agent does **not** inherently see the remote desktop. -Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision. - -That means: -- `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly -- `POST /ocr` returns machine-readable text blocks when text extraction is enough -- the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues -- every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected - -Do not write or think as if the agent is directly watching the screen in real time. -Say what you actually have: screenshots, OCR output, and fresh verification captures. - -## Mini API map - -- `GET /health` → server status + safety flags -- `GET /displays` → detected displays in zero-based API order -- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`) -- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`) -- `GET /windows` → discover visible desktop windows and their handles/processes -- `POST /windows/action` → focus/restore/minimize/maximize/close a matched window -- `POST /launch` → start an app/process without dropping to a shell -- `POST /wait?screen=0` → wait for text, window, or visual state changes -- `POST /vision/diff?screen=0` → compare screenshots or regions for meaningful visual change -- `POST /vision/stability?screen=0` → measure short-interval visual stability -- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes -- `POST /ocr/find?screen=0` → search OCR output for matching text candidates -- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...) -- `POST /action/verify?screen=0` → execute one action plus structured success verification -- `POST /batch?screen=0` → sequential action list -- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header) - -### Display selection - -- Use `GET /displays` before operating on multi-monitor systems. -- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`. -- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates. -- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset. -- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets. -- Window rectangles from `GET /windows` are also in global desktop coordinates. Use them to sanity-check which monitor the app is really on before clicking. - -### OCR usage - -- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs). -- Use `mode=screen` for discovery, then `mode=region` for precision and speed. -- Use `language_hint` when known (for example `eng`) to improve consistency. -- Filter noise with `min_confidence` (start around `0.4` and tune per app). -- Treat OCR as one signal, not the only signal, before high-impact clicks. -- Current response shape is nested under `result.blocks`, not top-level `blocks`. Parse the real payload before assuming the endpoint failed. -- OCR can be noisy on dense shopping pages, streaming apps, and button-heavy sidebars. Re-crop tightly before escalating. - -### Screenshot + `image` tool usage - -Use the OpenClaw `image` tool when OCR is not enough. -This is especially useful for: -- identifying which visible button looks like the primary confirm action -- understanding dialog layout or pane structure -- distinguishing similar nearby controls by icon, spacing, or emphasis -- checking whether a visual state changed after a click -- telling you where something is and where to click when text alone is not reliable - -Good pattern: -1. capture with `GET /screen` or `POST /zoom` -2. hand that screenshot to the `image` tool -3. ask a precise question about the visible UI -4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop -5. convert the answer into a concrete Clickthrough target -6. act once -7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly - -Prefer vision over guessing. -If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is. -The model should help answer things like: -- which visible button is the real primary action -- whether the target is left/right/top/bottom within the crop -- which of several similar buttons is the one to click -- an approximate click point inside the provided image bounds - -Ask narrow questions. -Good: -- "Which button in this dialog is the primary confirmation action?" -- "Is the scan still running, or does this look complete?" -- "Which of these tabs appears selected?" -- "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds." -- "Which visible control says Stop Recording, and where should I click?" - -Bad: -- "What should I click?" -- "Use your eyes and do the task" -- anything that assumes the model has live continuity without a new screenshot -- requesting coordinates without telling the model the image bounds or expected output format - -### Header requirements - -- Always send `x-clickthrough-token` when token auth is enabled. -- For `/exec`, also send `x-clickthrough-exec-secret`. - -## `POST /action` request shape (important) - -`/action` always expects an `action` plus an optional `target` object. -Do **not** invent top-level `x` / `y` fields. - -Minimal pixel click: - -```json -{ - "action": "click", - "target": {"mode": "pixel", "x": 100, "y": 200}, - "button": "left", - "clicks": 1 -} -``` - -Minimal grid click: - -```json -{ - "action": "click", - "target": { - "mode": "grid", - "region_x": 0, - "region_y": 0, - "region_width": 1920, - "region_height": 1080, - "rows": 12, - "cols": 12, - "row": 6, - "col": 8, - "dx": 0.0, - "dy": 0.0 - } -} -``` - -Other canonical examples: - -```json -{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}} -{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}} -{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}} -{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500} -{"action": "type", "text": "hello world", "interval_ms": 20} -{"action": "hotkey", "keys": ["ctrl", "l"]} -``` +Prompt template: +- "Return one click point as JSON `{\"x\":,\"y\":}` inside this image (`width=W`, `height=H`) for the **** control." Rules: -- `dx` / `dy` belong inside `target`, not beside it. -- `type` and `hotkey` usually do not need a `target`. -- For pixel targets, `x` / `y` are global desktop coordinates. -- For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used. +- Ask for one point only. +- Include bounds in the prompt. +- If answer is not parseable `x,y`, re-ask once with stricter format. +- Send returned point to `POST /v2/localize` via `image_tool_point`. -## When to use `/exec` +## API playbook -Prefer structured GUI control first: -- `/screen`, `/zoom`, `/ocr` to observe -- `/action` or `/batch` to interact +1. **Observe** -Use `/exec` only when it is the cleanest available tool for the job, for example: -- querying machine state that the GUI does not expose well -- performing an explicit user-requested shell/system task -- recovering from a blocked GUI flow when normal interaction failed +```json +POST /v2/observe?screen=0 +{ + "mode": "region", + "region_x": 820, + "region_y": 420, + "region_width": 700, + "region_height": 420, + "include_image": true, + "ocr_mode": "none" +} +``` -Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`. -Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly. -When a task can be completed with window focus/restore, keyboard shortcuts, screenshots, OCR, and normal actions, stay out of `/exec` entirely. +2. **Localize** (choose one) -## Core workflow (mandatory) +Text: +```json +POST /v2/localize +{"observation_id":"...","text_query":"Save","text_match":"exact"} +``` -1. Call `GET /windows` first when the task mentions a known app; focus/restore the right window before screen hunting. -2. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display. -3. Identify likely target region and compute an initial confidence score. -4. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate. -5. **Before any click**, verify target identity (OCR text/icon/location consistency). -6. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough. -7. Execute one minimal action via `POST /action`. -8. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change. -9. Repeat until objective is complete. +Image-tool point: +```json +POST /v2/localize +{"observation_id":"...","image_tool_point":{"x":312,"y":188}} +``` -## Verify-before-click rules +3. **Act** -- Never click if target identity is ambiguous. -- Require at least two matching signals before click. -- Good signal pairs include: - - OCR text + expected UI region - - OCR text + matching button shape/icon nearby - - dialog title text + expected button position within that dialog - - known app/window focus + expected control location - - OCR candidate + vision-model localization inside the same crop -- If confidence is low, do not "test click"; zoom and re-localize first. -- If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question. -- For high-impact actions (close/delete/send/purchase), use two-phase flow: - 1) preview intended coordinate + reason - 2) execute only after explicit confirmation. +```json +POST /v2/act?screen=0 +{"action":{"action":"click","target":{"resolved_target_id":"..."}}} +``` -## Precision rules +4. **Verify** -- Prefer grid targets first, then use `dx/dy` for subcell precision. -- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed. -- Use zoom before guessing offsets. -- Avoid stale coordinates: re-capture before action if UI moved/scrolled. +```json +POST /v2/act-verify?screen=0 +{ + "action":{"action":"click","target":{"resolved_target_id":"..."}}, + "condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420}, + "risk_level":"low" +} +``` -## Safety rules +## Risk policy -- Respect `dry_run` and `allowed_region` restrictions from `/health`. -- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`). -- Avoid destructive shortcuts unless explicitly requested. -- Send one action at a time unless deterministic; then use `/batch`. +- Low risk (navigation, focus, benign clicks): single verification signal. +- High risk (delete/send/purchase/close-lossy): use `risk_level=high` and require two checks before act. +- Never do speculative repeat clicks; switch strategy after one failed verify. -## Reliability rules +## Anti-latency rules -- After every meaningful action, verify with a fresh screenshot. -- On mismatch, do not spam clicks: zoom, re-localize, and retry once. -- Prefer short, reversible actions over long macros. -- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click. +- Never repeat full-screen OCR by default. +- Re-observe only the active pane/region. +- Prefer keyboard + window APIs for app switching. +- Use OCR on region only and cap area with `max_ocr_area_px`. -## Fallback ladder for uncertain targeting +## Setup and auth -1. Full-screen capture with a coarse grid. -2. Zoom into the candidate area with a denser grid. -3. OCR the full screen or the tighter region. -4. Re-anchor on a more reliable nearby control, title, or label. -5. Try a keyboard-first flow if the app supports it. -6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner. - -Do not skip from "uncertain click" straight to random retries. - -## Concrete screenshot -> `image` -> action example - -Example loop: -1. `GET /screen?screen=0` to capture the current app state -2. if the UI is text-heavy, try `POST /ocr` first -3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like: - - "In this save dialog, which visible button is the primary action?" - - "Is there a dismiss/close button in the top-right of this modal?" -4. map the answer back to a Clickthrough target using the returned grid/region metadata -5. click once with `POST /action` -6. recapture the screen -7. optionally use `POST /wait` or another `image`/OCR check to confirm the result - -The key rule is simple: screenshot first, interpret second, click third, verify fourth. -Do not collapse those steps into fake certainty. -When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes. - -## App-specific playbooks (recommended) - -Build per-app routines for repetitive tasks instead of generic clicking. - -### Launcher / search / start app playbook - -Use this when the goal is "open app X" or "bring up tool Y". - -1. check `GET /windows` first in case the app is already open -2. if present, use `POST /windows/action` to focus or restore it -3. if absent, prefer `POST /launch` when you know the executable path -4. if launch path is unknown but the OS launcher/search UI is available, use a keyboard-first flow: - - open launcher (`win`, `cmd+space`, or app-specific shortcut depending on host) - - type exact app name - - wait for stable results with `POST /wait` or recapture - - verify the result text with OCR or the `image` tool - - press Enter or click the exact result once -5. verify the app window now exists or is focused - -Do not keep relaunching if the window already exists; that’s sloppy. - -### Dialog confirmation playbook - -Use for modals like save/discard, delete confirmation, permission prompts, and installer dialogs. - -1. capture the dialog region with `POST /zoom` -2. use OCR first for title/body/button labels -3. if button hierarchy or emphasis matters, inspect the zoomed screenshot with the `image` tool -4. identify the exact intended action (`Cancel`, `Save`, `Allow`, `Delete`, etc.) -5. for destructive actions, require explicit user confirmation unless already requested -6. click once and verify the dialog disappeared or changed state - -Good verification targets: -- dialog title vanished -- expected next window appeared -- destructive side effect is visible and confirmed - -### File picker playbook - -Use for open/save dialogs. - -1. verify the file picker window is focused -2. OCR the visible breadcrumb/path area, filename field, and button row -3. prefer keyboard-first entry when possible: - - type or paste the target path/name into the focused field - - use `tab` / `shift+tab` to move predictably between filename and action buttons -4. if the target path is uncertain, use OCR plus the `image` tool to identify the active field and selected folder/file row -5. verify the intended filename/path is visible before confirming -6. activate `Open` / `Save` once and verify the picker closes - -If the picker stays open, stop and inspect why instead of hammering Enter like a maniac. - -### Browser tab / window playbook - -Use for browser navigation, tab targeting, or web app recovery. - -1. use `GET /windows` to focus the correct browser window first -2. prefer keyboard-first navigation: - - `ctrl+l` / `cmd+l` to focus the address bar - - `ctrl+tab` / `ctrl+shift+tab` for tab movement when order is known - - `ctrl+w` only for explicitly requested close actions -3. verify tab or page identity with OCR on the tab strip or page heading -4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs -5. after navigation, wait for visual stability or expected text before taking the next action -6. on shopping/checkouts, tighten crops around the buy box or checkout panel before reading button text; full-page OCR often misses the one thing that matters - -Do not assume a page loaded just because the click landed. Verify it. - -### Settings / preferences navigation playbook - -Use when the task involves toggles, dropdowns, sidebars, or nested settings panels. - -1. identify the current settings page with OCR on the heading/sidebar -2. use OCR to find the specific section label before trying to toggle anything -3. if the layout is dense, zoom into the relevant pane and use the `image` tool to distinguish labels from controls -4. prefer small reversible actions: one toggle, one dropdown, one field edit at a time -5. after each change, verify the control state changed visually or via visible text -6. if a save/apply button exists, treat it as a separate confirmation step and verify completion - -Settings UIs love hiding side effects. Assume nothing. - -### Dense app / control-strip playbook - -Use for apps like OBS, IDEs, mixers, dashboards, or anything with tiny bottom-right control clusters. - -1. focus the exact app window with `POST /windows/action` -2. capture the full target display once to confirm the window is actually frontmost -3. crop tightly around the suspected control strip with `POST /zoom` -4. run OCR on the crop, not the full screen -5. if labels are still ambiguous, ask the `image` tool a narrow question about the specific buttons -6. click once and immediately verify the control label changed (`Start Recording` -> `Stop Recording`, etc.) - -Do not trust OCR taken from the wrong frontmost window. It will happily waste your time. - -### Spotify playbook - -- Focus app window before search/navigation. -- Prefer keyboard-first flow for song start: - 1) `Ctrl+L` (search) - 2) type exact query - 3) Enter - 4) verify exact song+artist text - 5) click/double-click row - 6) verify now-playing bar -- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows. +- Include `x-clickthrough-token` when token auth is enabled. +- `/exec` additionally requires `x-clickthrough-exec-secret`. +- Validate server first: `GET /health`.