feat: migrate to v2-only API and unified response envelope

2026-05-03 19:11:11 +02:00
parent 2585bc3a7c
commit aced5be25e
5 changed files with 603 additions and 1267 deletions
--- a/README.md
+++ b/README.md
@@ -1,22 +1,25 @@
 # Clickthrough
-Let an Agent interact with your computer over HTTP, with grid-aware screenshots and precise input actions.
+Let an agent interact with a computer over HTTP.
 ## Primary mode (v2)
 Use the v2 contract for faster, less OCR-heavy control loops:
 - `POST /v2/observe`
 - `POST /v2/localize`
 - `POST /v2/act`
 - `POST /v2/act-verify`
 This is optimized for agents that cannot directly see the screen and must use screenshot/image tools.
 ## What this provides
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
+- Screen/region capture with optional OCR and timing stats
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
+- Observation IDs for deterministic follow-up localization
- **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ...
+- Text localization and image-tool coordinate localization
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
+- Action execution with resolved target IDs
- **Window lifecycle endpoints**: list/focus/restore/minimize/maximize/close windows via `GET /windows` + `POST /windows/action`
+- Risk-aware action+verification defaults
- **Structured launch endpoint**: start an app/process without dropping to a shell via `POST /launch`
+- Unified response envelope across all endpoints
 - **Wait/sync endpoint**: poll for text, window, or visual state changes via `POST /wait`
 - **Vision helper endpoints**: compare screenshots and measure stability via `POST /vision/diff` and `POST /vision/stability`
 - **OCR endpoints**: extract text blocks or search for matching text via `POST /ocr` and `POST /ocr/find`
 - **Compound verify endpoint**: execute an action and wait for a structured success condition via `POST /action/verify`
 - **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
 - **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
 - **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
 ## Quick start
@@ -30,53 +33,17 @@ CLICKTHROUGH_TOKEN=change-me python -m server.app
 Server defaults to `127.0.0.1:8123`.
-For OCR support, install the native `tesseract` binary on the host (in addition to Python deps), or point `CLICKTHROUGH_TESSERACT_CMD` at the executable if it lives somewhere weird.
+## Fast control loop
-`python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically.
+1. `POST /v2/observe` on a tight region
 2. If OCR is enough, `POST /v2/localize` with `text_query`
 3. If ambiguous, ask image tool for one x,y in observation bounds
 4. `POST /v2/localize` with `image_tool_point`
 5. `POST /v2/act` or `POST /v2/act-verify`
 6. Re-observe only changed region
-## Minimal API flow
+## See docs
 1. `GET /displays` if you need a non-primary monitor
 2. `GET /screen?screen=0` with grid
 3. Decide cell / target
 4. Optional `POST /zoom?screen=0` for finer targeting
 5. `POST /action?screen=0` to execute (or `POST /action/verify?screen=0` for a bundled action+wait flow)
 6. `GET /screen?screen=0` again to verify result, or use `POST /wait`, `POST /vision/diff`, or `POST /ocr/find`
 Important:
 - `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields.
 - Pixel coordinates and OCR bounding boxes are always global desktop coordinates.
 - The agent does **not** inherently see the remote desktop; it reasons from screenshots, OCR, and window metadata.
 - When OCR is not enough, pair Clickthrough screenshots with OpenClaw's `image` tool for explicit screenshot interpretation.
 - Prefer structured GUI interaction first; use `/windows`, `/launch`, `/wait`, and `/action` before reaching for `/exec`.
 See:
 - `docs/API.md`
 - `docs/coordinate-system.md`
 - `skill/SKILL.md`
-
+- `docs/coordinate-system.md`
 ## Configuration
 Environment variables:
 - `CLICKTHROUGH_HOST` (default `127.0.0.1`)
 - `CLICKTHROUGH_PORT` (default `8123`)
 - `CLICKTHROUGH_TOKEN` (optional; if set, require `x-clickthrough-token` header)
 - `CLICKTHROUGH_DRY_RUN` (`true`/`false`; default `false`)
 - `CLICKTHROUGH_GRID_ROWS` (default `12`)
 - `CLICKTHROUGH_GRID_COLS` (default `12`)
 - `CLICKTHROUGH_ALLOWED_REGION` (optional `x,y,width,height`)
 - `CLICKTHROUGH_EXEC_ENABLED` (default `true`)
 - `CLICKTHROUGH_EXEC_SECRET` (**required for `/exec` to run**)
 - `CLICKTHROUGH_EXEC_DEFAULT_SHELL` (default `powershell`; one of `powershell`, `bash`, `cmd`)
 - `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`)
 - `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`)
 - `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`)
 - `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable)
 Window management endpoints currently target Windows hosts. On non-Windows hosts they return `501` instead of guessing.
 ## Gitea CI
 A Gitea Actions workflow is included at `.gitea/workflows/python-syntax.yml`.
 It runs Python syntax checks (`py_compile`) on every push and pull request.
--- a/docs/API.md
+++ b/docs/API.md
@@ -1,614 +1,141 @@
-# API Reference (v0.1)
+# API Reference (v2)
 Base URL: `http://127.0.0.1:8123`
-If `CLICKTHROUGH_TOKEN` is set, include header:
+If `CLICKTHROUGH_TOKEN` is set, include:
 ```http
 x-clickthrough-token: <token>
 ```
-## `GET /health`
+## Endpoints
-Returns status and runtime safety flags, including `exec` capability config.
+- `POST /v2/observe`
 - `POST /v2/localize`
 - `POST /v2/act`
 - `POST /v2/act-verify`
 - `GET /health`
 - `GET /displays`
 - `GET /windows`
 - `POST /windows/action`
 - `POST /launch`
 - `POST /exec`
-## `GET /displays`
+No v1 endpoints are supported.
-Returns detected displays in API screen order.
+## `POST /v2/observe`
 ```json
 {
  "ok": true,
  "default_screen": 0,
  "displays": [
    {"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
    {"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
  ]
 }
 ```
 `screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend.
 Invalid `screen` values fall back to `0`.
 ## `GET /screen`
 Query params:
 - `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0`
 - `with_grid` (bool, default `true`)
 - `grid_rows` (int, default env or `12`)
 - `grid_cols` (int, default env or `12`)
 - `include_labels` (bool, default `true`)
 - `image_format` (`png`|`jpeg`, default `png`)
 - `jpeg_quality` (1-100, default `85`)
 - `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
 Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
 `meta.region` uses global desktop coordinates.
 These image-returning endpoints do not magically grant the agent live vision.
 If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI.
 ## `POST /zoom`
 Body:
 ```json
 {
  "center_x": 1200,
  "center_y": 700,
  "width": 500,
  "height": 350,
  "with_grid": true,
  "grid_rows": 20,
  "grid_cols": 20,
  "include_labels": true,
  "image_format": "png",
  "jpeg_quality": 90
 }
 ```
 Query params:
 - `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0`
 - `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
 Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.
 `POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout.
 ## `POST /action`
 Body: one action.
 Important:
 - the request body uses `action` plus an optional `target`
 - pixel coordinates live inside `target` when `target.mode="pixel"`
 - do **not** send top-level `x` / `y` fields
 Query params:
 - `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`
 Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target.
 ### Pointer target modes
 #### Pixel target
 ```json
 {
  "mode": "pixel",
  "x": 100,
  "y": 200,
  "dx": 0,
  "dy": 0
 }
 ```
 #### Grid target
 ```json
 {
  "mode": "grid",
  "region_x": 0,
  "region_y": 0,
  "region_width": 1920,
  "region_height": 1080,
  "rows": 12,
  "cols": 12,
  "row": 5,
  "col": 9,
  "dx": 0.0,
  "dy": 0.0
 }
 ```
 `dx`/`dy` are normalized offsets in `[-1, 1]` inside the selected cell.
 ### Action examples
 Click:
 ```json
 {
  "action": "click",
  "target": {
    "mode": "grid",
    "region_x": 0,
    "region_y": 0,
    "region_width": 1920,
    "region_height": 1080,
    "rows": 12,
    "cols": 12,
    "row": 7,
    "col": 3,
    "dx": 0.2,
    "dy": -0.1
  },
  "clicks": 1,
  "button": "left"
 }
 ```
 Scroll:
 ```json
 {
  "action": "scroll",
  "target": {"mode": "pixel", "x": 1300, "y": 740},
  "scroll_amount": -500
 }
 ```
 Type text:
 ```json
 {
  "action": "type",
  "text": "hello world",
  "interval_ms": 20
 }
 ```
 Hotkey:
 ```json
 {
  "action": "hotkey",
  "keys": ["ctrl", "l"]
 }
 ```
 Right click:
 ```json
 {
  "action": "right_click",
  "target": {"mode": "pixel", "x": 1300, "y": 740}
 }
 ```
 Move only:
 ```json
 {
  "action": "move",
  "target": {"mode": "pixel", "x": 1300, "y": 740},
  "duration_ms": 150
 }
 ```
 ## `GET /windows`
 List desktop windows using structured filters instead of shelling out.
 Query params:
 - `title_contains` (optional substring match)
 - `title_regex` (optional case-insensitive regex)
 - `process_name` (optional exact process name, e.g. `explorer.exe`)
 - `hwnd` (optional exact window handle)
 - `visible_only` (bool, default `true`)
 ```json
 {
  "ok": true,
  "count": 1,
  "windows": [
    {
      "hwnd": 132640,
      "title": "WinDirStat",
      "class_name": "WinDirStatMainWindow",
      "pid": 18420,
      "process_name": "windirstat.exe",
      "visible": true,
      "enabled": true,
      "minimized": false,
      "maximized": false,
      "foreground": true,
      "rect": {"x": 194, "y": 116, "width": 1532, "height": 870}
    }
  ]
 }
 ```
 Notes:
 - Currently supported on Windows hosts only.
 - Returns `409` for ambiguous write-target matches when a mutation endpoint would affect multiple windows.
 ## `POST /windows/action`
 Perform a structured window action against exactly one matched window.
 ```json
 {
  "action": "focus",
  "title_contains": "WinDirStat",
  "visible_only": true,
  "timeout_ms": 3000
 }
 ```
 Supported actions:
 - `focus`
 - `restore`
 - `minimize`
 - `maximize`
 - `close`
 The response includes the matched pre-action window and the final observed window state (or `closed=true` if it disappeared).
 ## `POST /launch`
 Start an app/process without invoking a shell.
 ```json
 {
  "executable": "C:/Program Files/WinDirStat/WinDirStat.exe",
  "args": [],
  "cwd": "C:/Program Files/WinDirStat",
  "wait_for_window": true,
  "match": {
    "title_contains": "WinDirStat",
    "visible_only": true
  },
  "timeout_ms": 8000
 }
 ```
 Notes:
 - Launch uses direct process execution (`subprocess.Popen`) rather than PowerShell/CMD.
 - If `wait_for_window=true`, the server polls for a matching window and returns `window_found`.
 - `dry_run=true` returns the resolved argv/cwd without launching.
 ## `POST /vision/diff`
 Measure whether a screen region changed meaningfully between two captures.
 Query params:
 - `screen` (int, default `0`) - used for `mode=screen` and `mode=region`
 Compare live captures:
 ```json
 {
  "mode": "region",
-  "region_x": 120,
+  "region_x": 800,
-  "region_y": 80,
+  "region_y": 420,
-  "region_width": 600,
+  "region_width": 700,
-  "region_height": 300,
+  "region_height": 420,
-  "delay_ms": 400,
+  "include_image": true,
-  "diff_threshold": 0.01
+  "image_format": "jpeg",
-}
+  "jpeg_quality": 75,
-```
+  "ocr_mode": "region",
 Compare provided images:
 ```json
 {
  "mode": "image",
  "before_image_base64": "iVBORw0KGgoAAA...",
  "after_image_base64": "iVBORw0KGgoBBB...",
  "diff_threshold": 0.01
 }
 ```
 Response includes:
 - `diff_ratio` — average normalized pixel difference
 - `changed` — whether `diff_ratio >= diff_threshold`
 - `region` — compared region
 ## `POST /vision/stability`
 Measure whether a screen region stays visually stable over a short interval.
 Query params:
 - `screen` (int, default `0`)
 ```json
 {
  "region_x": 0,
  "region_y": 0,
  "region_width": 1920,
  "region_height": 1080,
  "sample_interval_ms": 250,
  "duration_ms": 1200,
  "diff_threshold": 0.005
 }
 ```
 Response includes:
 - `stable`
 - `sample_count`
 - `max_diff_ratio`
 - `avg_diff_ratio`
 ## `POST /wait`
 Wait on a structured UI condition instead of guessing sleep durations.
 Query params:
 - `screen` (int, default `0`) - used for text and visual waits
 ### Wait for text to appear
 ```json
 {
  "condition": {
    "kind": "text",
    "mode": "screen",
    "text": "Scan complete",
    "match": "contains",
    "present": true,
  "language_hint": "eng",
-    "min_confidence": 0.4
+  "min_confidence": 0.45,
-  },
+  "max_ocr_area_px": 1500000,
-  "timeout_ms": 15000,
+  "group_lines": true
  "poll_interval_ms": 400
 }
 ```
-### Wait for a window state
+Returns observation metadata, optional image, OCR blocks/lines, and timing fields.
 ## `POST /v2/localize`
 Text localization:
 ```json
 {
-  "condition": {
+  "observation_id": "...",
-    "kind": "window",
+  "text_query": "Save",
-    "title_contains": "WinDirStat",
+  "text_match": "exact",
-    "visible_only": true,
+  "candidate_index": 0
    "state": "focused"
  },
  "timeout_ms": 5000,
  "poll_interval_ms": 200
 }
 ```
-Window states:
+Image-tool point localization:
 - `exists`
 - `focused`
 - `closed`
 ### Wait for visual change or stability
 ```json
 {
-  "condition": {
+  "observation_id": "...",
-    "kind": "visual",
+  "image_tool_point": {"x": 312, "y": 188}
    "state": "stable",
    "region_x": 0,
    "region_y": 0,
    "region_width": 1920,
    "region_height": 1080,
    "diff_threshold": 0.005,
    "stable_for_ms": 1000
  },
  "timeout_ms": 12000,
  "poll_interval_ms": 300
 }
 ```
-Visual states:
+Returns `resolved_target_id`, global pixel, and `localization_confidence`.
 - `change` — succeeds when the average pixel diff crosses `diff_threshold`
 - `stable` — succeeds when the diff stays at or below `diff_threshold` for `stable_for_ms`
-Notes:
+## `POST /v2/act`
 - Text waits reuse the OCR pipeline and return matching OCR blocks on success.
 - Window waits build on the structured window discovery endpoint.
 - Visual waits compare repeated captures of either the full selected display or an explicit region.
 ## `POST /action/verify`
 Execute one action and wait for a structured success condition.
 Query params:
 - `screen` (int, default `0`)
 ```json
 {
  "action": {
    "action": "click",
-    "target": {"mode": "pixel", "x": 1300, "y": 740}
+    "target": {"resolved_target_id": "..."},
    "button": "left",
    "clicks": 1
  }
 }
 ```
 ## `POST /v2/act-verify`
 ```json
 {
  "action": {
    "action": "click",
    "target": {"resolved_target_id": "..."}
  },
  "condition": {
    "kind": "text",
-    "mode": "screen",
+    "mode": "region",
-    "text": "Settings",
+    "text": "Saved",
    "match": "contains",
    "present": true,
-    "language_hint": "eng",
+    "region_x": 820,
    "region_y": 420,
    "region_width": 500,
    "region_height": 140,
    "min_confidence": 0.4
  },
-  "retries": 1,
+  "risk_level": "low"
  "timeout_ms": 4000,
  "poll_interval_ms": 250,
  "retry_delay_ms": 250
 }
 ```
-Condition kinds mirror `POST /wait`:
+Risk defaults:
- `text`
+- `low`: retries `0`, timeout `2500ms`
- `window`
+- `high`: retries `1`, timeout `6000ms`
 - `visual`
-The response returns per-attempt action output plus structured verification output.
+## Response envelope
-## `POST /ocr`
+Success:
 Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
 Query params:
 - `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0`
 Body:
 ```json
 {
  "mode": "screen",
  "language_hint": "eng",
  "min_confidence": 0.4
 }
 ```
 Modes:
 - `screen` (default): OCR over full selected monitor
 - `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
 - `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
 Region mode example:
 ```json
 {
  "mode": "region",
  "region_x": 220,
  "region_y": 160,
  "region_width": 900,
  "region_height": 400,
  "language_hint": "eng",
  "min_confidence": 0.5
 }
 ```
 Image mode example:
 ```json
 {
  "mode": "image",
  "image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
  "language_hint": "eng"
 }
 ```
 Response shape:
 ```json
 {
  "ok": true,
  "request_id": "...",
  "time_ms": 1710000000000,
-  "result": {
+  "data": { },
-    "mode": "screen",
+  "error": null
-    "language_hint": "eng",
+}
-    "min_confidence": 0.4,
+```
-    "region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
+
-    "blocks": [
+Error:
-      {
+
-        "text": "Settings",
+```json
-        "confidence": 0.9821,
+{
-        "bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
+  "ok": false,
-      }
+  "request_id": "...",
-    ]
+  "time_ms": 1710000000000,
  "data": null,
  "error": {
    "code": "http_error",
    "message": "...",
    "details": {}
  }
 }
 ```
 Notes:
 - Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
 - `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
 - Requires `tesseract` executable plus Python package `pytesseract`.
 - If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.
 ## `POST /ocr/find`
 Search OCR output for matching text instead of post-processing raw OCR blocks client-side.
 Query params:
 - `screen` (int, default `0`) - used for `mode=screen` and `mode=region`
 ```json
 {
  "mode": "screen",
  "query": "Settings",
  "match": "contains",
  "group_lines": true,
  "max_results": 10,
  "language_hint": "eng",
  "min_confidence": 0.4
 }
 ```
 Modes:
 - `screen`
 - `region`
 - `image`
 Options:
 - `match`: `contains`, `exact`, or `regex`
 - `group_lines=true`: combine nearby OCR words into line-level candidates before matching
 - `max_results`: result cap after confidence sorting
 Response includes:
 - `matches` — confidence-sorted candidate matches
 - `match_count`
 - `blocks_considered`
 ## `POST /exec`
 Execute a shell command on the host running Clickthrough.
 Requirements:
 - `CLICKTHROUGH_EXEC_SECRET` must be configured on the server
 - send header `x-clickthrough-exec-secret: <secret>`
 ```json
 {
  "command": "Get-Process | Select-Object -First 5",
  "shell": "powershell",
  "timeout_s": 20,
  "cwd": "C:/Users/Paul",
  "dry_run": false
 }
 ```
 Notes:
 - `shell` supports `powershell`, `bash`, `cmd`
 - if `shell` is omitted, server uses `CLICKTHROUGH_EXEC_DEFAULT_SHELL`
 - output is truncated based on `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS`
 - endpoint can be disabled with `CLICKTHROUGH_EXEC_ENABLED=false`
 - if `CLICKTHROUGH_EXEC_SECRET` is missing, `/exec` is blocked (`403`)
 Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution metadata.
 ## `POST /batch`
 Runs multiple `action` payloads sequentially.
 Query params:
 - `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0`
 ```json
 {
  "actions": [
    {"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}},
    {"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}}
  ],
  "stop_on_error": true
 }
 ```
--- a/examples/quickstart.py
+++ b/examples/quickstart.py
@@ -13,23 +13,26 @@ if TOKEN:
 def main():
-    r = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10)
+    health = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10)
-    r.raise_for_status()
+    health.raise_for_status()
-    print("health:", r.json())
+    print("health ok:", health.json().get("ok"))
-    d = requests.get(f"{BASE_URL}/displays", headers=headers, timeout=10)
+    observe = requests.post(
-    d.raise_for_status()
+        f"{BASE_URL}/v2/observe",
    print("displays:", d.json().get("displays", []))
    s = requests.get(
        f"{BASE_URL}/screen",
        headers=headers,
-        params={"screen": SCREEN, "with_grid": True, "grid_rows": 12, "grid_cols": 12},
+        params={"screen": SCREEN},
-        timeout=30,
+        json={
            "mode": "screen",
            "include_image": False,
            "ocr_mode": "none",
        },
        timeout=20,
    )
-    s.raise_for_status()
+    observe.raise_for_status()
-    payload = s.json()
+    payload = observe.json()["data"]
-    print("screen meta:", payload.get("meta", {}))
+    print("observation_id:", payload["observation_id"])
    print("region:", payload["region"])
    print("timing_ms:", payload["timing_ms"])
 if __name__ == "__main__":
--- a/server/app.py
+++ b/server/app.py
@@ -8,10 +8,12 @@ import subprocess
 import sys
 import time
 import uuid
-from typing import Literal, Optional
+from typing import Any, Literal, Optional
 from dotenv import load_dotenv
-from fastapi import Depends, FastAPI, Header, HTTPException, Response
+from fastapi import Depends, FastAPI, Header, HTTPException, Request
 from fastapi.exceptions import RequestValidationError
 from fastapi.responses import JSONResponse
 from PIL import ImageChops, ImageStat
 from pydantic import BaseModel, Field, model_validator
@@ -21,6 +23,55 @@ load_dotenv(dotenv_path=".env", override=False)
 app = FastAPI(title="clickthrough", version="0.1.0")
 def _ok(data: Any, status_code: int = 200):
    return JSONResponse(
        status_code=status_code,
        content={
            "ok": True,
            "request_id": _request_id(),
            "time_ms": _now_ms(),
            "data": data,
            "error": None,
        },
    )
 def _err(code: str, message: str, status_code: int, details: Any = None):
    return JSONResponse(
        status_code=status_code,
        content={
            "ok": False,
            "request_id": _request_id(),
            "time_ms": _now_ms(),
            "data": None,
            "error": {
                "code": code,
                "message": message,
                "details": details,
            },
        },
    )
@app.exception_handler(HTTPException)
 async def _http_exception_handler(_: Request, exc: HTTPException):
    detail = exc.detail
    if isinstance(detail, dict):
        message = str(detail.get("message", "request failed"))
        return _err("http_error", message, exc.status_code, detail)
    return _err("http_error", str(detail), exc.status_code)
@app.exception_handler(Exception)
 async def _unhandled_exception_handler(_: Request, exc: Exception):
    return _err("internal_error", "internal server error", 500, {"type": type(exc).__name__})
@app.exception_handler(RequestValidationError)
 async def _validation_exception_handler(_: Request, exc: RequestValidationError):
    return _err("validation_error", "request validation failed", 422, exc.errors())
 def _env_bool(name: str, default: bool) -> bool:
    raw = os.getenv(name)
    if raw is None:
@@ -288,6 +339,144 @@ class VerifyActionRequest(BaseModel):
    stop_on_action_error: bool = True
 class ObserveRequestV2(BaseModel):
    mode: Literal["screen", "region"] = "screen"
    region_x: int | None = Field(default=None, ge=0)
    region_y: int | None = Field(default=None, ge=0)
    region_width: int | None = Field(default=None, gt=0)
    region_height: int | None = Field(default=None, gt=0)
    include_image: bool = True
    image_format: Literal["png", "jpeg"] = "jpeg"
    jpeg_quality: int = Field(default=75, ge=1, le=100)
    ocr_mode: Literal["none", "region", "screen"] = "none"
    language_hint: str | None = Field(default=None, min_length=1, max_length=64)
    min_confidence: float = Field(default=0.4, ge=0.0, le=1.0)
    max_ocr_area_px: int | None = Field(default=1_500_000, ge=1000)
    group_lines: bool = True
    @model_validator(mode="after")
    def _validate_region(self):
        if self.mode == "region":
            required = [self.region_x, self.region_y, self.region_width, self.region_height]
            if any(v is None for v in required):
                raise ValueError("region_x, region_y, region_width, region_height are required for mode=region")
        return self
 class ImageToolPoint(BaseModel):
    x: int = Field(ge=0)
    y: int = Field(ge=0)
 class LocalizeRequestV2(BaseModel):
    observation_id: str = Field(min_length=1, max_length=128)
    text_query: str | None = Field(default=None, max_length=512)
    text_match: Literal["contains", "exact", "regex"] = "contains"
    image_tool_point: ImageToolPoint | None = None
    candidate_index: int = Field(default=0, ge=0)
    @model_validator(mode="after")
    def _validate_selector(self):
        has_text = bool((self.text_query or "").strip())
        has_point = self.image_tool_point is not None
        if has_text == has_point:
            raise ValueError("provide exactly one of text_query or image_tool_point")
        return self
 class ActionTargetV2(BaseModel):
    resolved_target_id: str | None = Field(default=None, max_length=128)
    pixel_x: int | None = None
    pixel_y: int | None = None
    @model_validator(mode="after")
    def _validate_shape(self):
        has_resolved = bool(self.resolved_target_id)
        has_pixel = self.pixel_x is not None or self.pixel_y is not None
        if has_resolved == has_pixel:
            raise ValueError("provide either resolved_target_id or pixel_x/pixel_y")
        if has_pixel and (self.pixel_x is None or self.pixel_y is None):
            raise ValueError("pixel_x and pixel_y are both required")
        return self
 class ActionRequestV2(BaseModel):
    action: Literal[
        "move",
        "click",
        "right_click",
        "double_click",
        "middle_click",
        "scroll",
        "type",
        "hotkey",
    ]
    target: ActionTargetV2 | None = None
    duration_ms: int = Field(default=0, ge=0, le=20000)
    button: Literal["left", "right", "middle"] = "left"
    clicks: int = Field(default=1, ge=1, le=10)
    scroll_amount: int = 0
    text: str = ""
    keys: list[str] = Field(default_factory=list)
    interval_ms: int = Field(default=20, ge=0, le=5000)
    dry_run: bool = False
 class ActRequestV2(BaseModel):
    action: ActionRequestV2
 class ActVerifyRequestV2(BaseModel):
    action: ActionRequestV2
    condition: WaitTextCondition | WaitWindowCondition | WaitVisualCondition
    risk_level: Literal["low", "high"] = "low"
    retries: int | None = Field(default=None, ge=0, le=10)
    timeout_ms: int | None = Field(default=None, ge=0, le=120000)
    poll_interval_ms: int | None = Field(default=None, ge=50, le=10000)
    retry_delay_ms: int | None = Field(default=None, ge=0, le=60000)
    stop_on_action_error: bool = True
 OBSERVATIONS: dict[str, dict[str, Any]] = {}
 RESOLVED_TARGETS: dict[str, dict[str, Any]] = {}
 def _get_observation(observation_id: str) -> dict[str, Any]:
    observation = OBSERVATIONS.get(observation_id)
    if observation is None:
        raise HTTPException(status_code=404, detail="observation_id not found")
    return observation
 def _resolve_v2_action(req: ActionRequestV2) -> ActionRequest:
    target: Target | None = None
    if req.target is not None:
        if req.target.resolved_target_id:
            item = RESOLVED_TARGETS.get(req.target.resolved_target_id)
            if item is None:
                raise HTTPException(status_code=404, detail="resolved_target_id not found")
            target = PixelTarget(mode="pixel", x=item["x"], y=item["y"], dx=0, dy=0)
        else:
            target = PixelTarget(mode="pixel", x=req.target.pixel_x or 0, y=req.target.pixel_y or 0, dx=0, dy=0)
    return ActionRequest(
        action=req.action,
        target=target,
        duration_ms=req.duration_ms,
        button=req.button,
        clicks=req.clicks,
        scroll_amount=req.scroll_amount,
        text=req.text,
        keys=req.keys,
        interval_ms=req.interval_ms,
        dry_run=req.dry_run,
    )
 def _risk_defaults(risk_level: str) -> dict[str, int]:
    if risk_level == "high":
        return {"retries": 1, "timeout_ms": 6000, "poll_interval_ms": 250, "retry_delay_ms": 300}
    return {"retries": 0, "timeout_ms": 2500, "poll_interval_ms": 200, "retry_delay_ms": 150}
 def _auth(x_clickthrough_token: Optional[str] = Header(default=None)):
    token = SETTINGS["token"]
@@ -1377,14 +1566,208 @@ def _exec_action(req: ActionRequest, screen: int = 0) -> dict:
    }
 def _localization_confidence(source: str, confidence: float | None = None) -> str:
    if source == "image_tool_point":
        return "high"
    if source == "ocr" and confidence is not None:
        if confidence >= 0.8:
            return "high"
        if confidence >= 0.55:
            return "medium"
    return "low"
@app.post("/v2/observe")
 def observe_v2(req: ObserveRequestV2, screen: int = 0, _: None = Depends(_auth)):
    capture_started = time.perf_counter()
    image, region, mon, displays, screen_selection = _capture_region_image(
        screen,
        req.region_x if req.mode == "region" else None,
        req.region_y if req.mode == "region" else None,
        req.region_width if req.mode == "region" else None,
        req.region_height if req.mode == "region" else None,
    )
    capture_ms = int((time.perf_counter() - capture_started) * 1000)
    encoded = None
    if req.include_image:
        encoded = _encode_image(image, req.image_format, req.jpeg_quality)
    ocr_started = time.perf_counter()
    blocks: list[dict] = []
    grouped_lines: list[dict] = []
    ocr_applied_mode = "none"
    if req.ocr_mode != "none":
        if req.ocr_mode == "screen":
            ocr_image, ocr_region, _, _, _ = _capture_region_image(screen, None, None, None, None)
        else:
            ocr_image, ocr_region = image, region
        area = ocr_region["width"] * ocr_region["height"]
        if req.max_ocr_area_px is not None and area > req.max_ocr_area_px:
            raise HTTPException(
                status_code=400,
                detail=f"ocr area {area} exceeds max_ocr_area_px {req.max_ocr_area_px}",
            )
        blocks = _run_ocr(
            ocr_image,
            req.language_hint,
            req.min_confidence,
            ocr_region["x"],
            ocr_region["y"],
        )
        if req.group_lines:
            grouped_lines = _group_ocr_lines(blocks)
        ocr_applied_mode = req.ocr_mode
    ocr_ms = int((time.perf_counter() - ocr_started) * 1000)
    observation_id = _request_id()
    OBSERVATIONS[observation_id] = {
        "id": observation_id,
        "region": region,
        "screen": screen_selection,
        "display": mon,
        "image_width": image.size[0],
        "image_height": image.size[1],
        "ocr_blocks": blocks,
        "ocr_lines": grouped_lines,
        "created_at_ms": _now_ms(),
    }
    return _ok(
        {
            "observation_id": observation_id,
            "region": region,
            "screen": screen_selection,
            "display": mon,
            "image": {
                "included": req.include_image,
                "format": req.image_format if req.include_image else None,
                "base64": encoded,
                "width": image.size[0],
                "height": image.size[1],
            },
            "ocr": {
                "mode": ocr_applied_mode,
                "min_confidence": req.min_confidence,
                "language_hint": req.language_hint,
                "block_count": len(blocks),
                "line_count": len(grouped_lines),
                "blocks": blocks,
                "lines": grouped_lines,
            },
            "timing_ms": {
                "capture_ms": capture_ms,
                "ocr_ms": ocr_ms if req.ocr_mode != "none" else 0,
                "total_ms": capture_ms + (ocr_ms if req.ocr_mode != "none" else 0),
            },
        }
    )
@app.post("/v2/localize")
 def localize_v2(req: LocalizeRequestV2, _: None = Depends(_auth)):
    observation = _get_observation(req.observation_id)
    region = observation["region"]
    image_width = observation["image_width"]
    image_height = observation["image_height"]
    if req.image_tool_point is not None:
        if req.image_tool_point.x >= image_width or req.image_tool_point.y >= image_height:
            raise HTTPException(status_code=400, detail="image_tool_point outside observation image bounds")
        x = region["x"] + req.image_tool_point.x
        y = region["y"] + req.image_tool_point.y
        _enforce_allowed_region(x, y)
        resolved_target_id = _request_id()
        RESOLVED_TARGETS[resolved_target_id] = {
            "id": resolved_target_id,
            "observation_id": req.observation_id,
            "x": x,
            "y": y,
            "source": "image_tool_point",
        }
        return _ok(
            {
                "resolved_target_id": resolved_target_id,
                "source": "image_tool_point",
                "localization_confidence": _localization_confidence("image_tool_point"),
                "pixel": {"x": x, "y": y},
                "observation_region": region,
                "image_bounds": {"width": image_width, "height": image_height},
            }
        )
    lines = observation.get("ocr_lines") or _group_ocr_lines(observation.get("ocr_blocks", []))
    matches = _find_text_matches(lines, req.text_query or "", req.text_match, False, 200)
    if not matches:
        return _err("not_found", "no localization candidates found", 404, {"found": False, "matches": []})
    if req.candidate_index >= len(matches):
        raise HTTPException(status_code=400, detail="candidate_index is outside match results")
    chosen = matches[req.candidate_index]
    bbox = chosen["bbox"]
    x = bbox["x"] + max(1, bbox["width"] // 2)
    y = bbox["y"] + max(1, bbox["height"] // 2)
    _enforce_allowed_region(x, y)
    resolved_target_id = _request_id()
    RESOLVED_TARGETS[resolved_target_id] = {
        "id": resolved_target_id,
        "observation_id": req.observation_id,
        "x": x,
        "y": y,
        "source": "ocr",
        "match": chosen,
    }
    return _ok(
        {
            "resolved_target_id": resolved_target_id,
            "source": "ocr",
            "localization_confidence": _localization_confidence("ocr", chosen.get("confidence")),
            "pixel": {"x": x, "y": y},
            "selected_match": chosen,
            "match_count": len(matches),
        }
    )
@app.post("/v2/act")
 def act_v2(req: ActRequestV2, screen: int = 0, _: None = Depends(_auth)):
    legacy_action = _resolve_v2_action(req.action)
    result = _exec_action(legacy_action, screen)
    return _ok(result)
@app.post("/v2/act-verify")
 def act_verify_v2(req: ActVerifyRequestV2, screen: int = 0, _: None = Depends(_auth)):
    defaults = _risk_defaults(req.risk_level)
    verify_req = VerifyActionRequest(
        action=_resolve_v2_action(req.action),
        condition=req.condition,
        retries=defaults["retries"] if req.retries is None else req.retries,
        timeout_ms=defaults["timeout_ms"] if req.timeout_ms is None else req.timeout_ms,
        poll_interval_ms=defaults["poll_interval_ms"] if req.poll_interval_ms is None else req.poll_interval_ms,
        retry_delay_ms=defaults["retry_delay_ms"] if req.retry_delay_ms is None else req.retry_delay_ms,
        stop_on_action_error=req.stop_on_action_error,
    )
    result = _run_verified_action(verify_req, screen)
    payload = {
        "risk_level": req.risk_level,
        "defaults_applied": defaults,
        **result,
    }
    if result.get("success", False):
        return _ok(payload)
    return _err("verification_failed", "action verification did not satisfy condition", 409, payload)
@app.get("/health")
 def health(_: None = Depends(_auth)):
-    return {
+    return _ok(
-        "ok": True,
+        {
            "service": "clickthrough",
            "version": app.version,
        "time_ms": _now_ms(),
        "request_id": _request_id(),
            "dry_run": SETTINGS["dry_run"],
            "allowed_region": SETTINGS["allowed_region"],
            "exec": {
@@ -1395,136 +1778,13 @@ def health(_: None = Depends(_auth)):
                "max_timeout_s": SETTINGS["exec_max_timeout_s"],
            },
        }
    )
@app.get("/displays")
 def displays(_: None = Depends(_auth)):
    detected = _get_displays()
-    return {
+    return _ok({"displays": detected, "default_screen": 0})
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "displays": detected,
        "default_screen": 0,
    }
@app.get("/screen")
 def screen(
    with_grid: bool = True,
    grid_rows: int = SETTINGS["default_grid_rows"],
    grid_cols: int = SETTINGS["default_grid_cols"],
    include_labels: bool = True,
    image_format: Literal["png", "jpeg"] = "png",
    jpeg_quality: int = 85,
    asImage: bool = False,
    screen: int = 0,
    _: None = Depends(_auth),
 ):
    req = ScreenRequest(
        with_grid=with_grid,
        grid_rows=grid_rows,
        grid_cols=grid_cols,
        include_labels=include_labels,
        image_format=image_format,
        jpeg_quality=jpeg_quality,
    )
    base_img, mon, displays, screen_selection = _capture_screen(screen)
    meta = {"region": mon, "screen": screen_selection, "displays": displays}
    out_img = base_img
    if req.with_grid:
        out_img, grid_meta = _draw_grid(base_img, mon["x"], mon["y"], req.grid_rows, req.grid_cols, req.include_labels)
        meta.update(grid_meta)
    if asImage:
        image_bytes = _serialize_image(out_img, req.image_format, req.jpeg_quality)
        media_type = "image/jpeg" if req.image_format == "jpeg" else "image/png"
        return Response(content=image_bytes, media_type=media_type)
    encoded = _encode_image(out_img, req.image_format, req.jpeg_quality)
    return {
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "image": {
            "format": req.image_format,
            "base64": encoded,
            "width": out_img.size[0],
            "height": out_img.size[1],
        },
        "meta": meta,
    }
@app.post("/zoom")
 def zoom(req: ZoomRequest, asImage: bool = False, screen: int = 0, _: None = Depends(_auth)):
    base_img, mon, displays, screen_selection = _capture_screen(screen)
    cx = req.center_x - mon["x"]
    cy = req.center_y - mon["y"]
    half_w = req.width // 2
    half_h = req.height // 2
    left = max(0, cx - half_w)
    top = max(0, cy - half_h)
    right = min(base_img.size[0], left + req.width)
    bottom = min(base_img.size[1], top + req.height)
    crop = base_img.crop((left, top, right, bottom))
    region_x = mon["x"] + left
    region_y = mon["y"] + top
    meta = {
        "source_monitor": mon,
        "screen": screen_selection,
        "displays": displays,
        "region": {
            "x": region_x,
            "y": region_y,
            "width": crop.size[0],
            "height": crop.size[1],
        },
    }
    out_img = crop
    if req.with_grid:
        out_img, grid_meta = _draw_grid(crop, region_x, region_y, req.grid_rows, req.grid_cols, req.include_labels)
        meta.update(grid_meta)
    if asImage:
        image_bytes = _serialize_image(out_img, req.image_format, req.jpeg_quality)
        media_type = "image/jpeg" if req.image_format == "jpeg" else "image/png"
        return Response(content=image_bytes, media_type=media_type)
    encoded = _encode_image(out_img, req.image_format, req.jpeg_quality)
    return {
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "image": {
            "format": req.image_format,
            "base64": encoded,
            "width": out_img.size[0],
            "height": out_img.size[1],
        },
        "meta": meta,
    }
@app.post("/action")
 def action(req: ActionRequest, screen: int = 0, _: None = Depends(_auth)):
    result = _exec_action(req, screen)
    return {
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "result": result,
    }
@app.post("/exec")
@@ -1540,12 +1800,7 @@ def exec_command(
        raise HTTPException(status_code=401, detail="invalid exec secret")
    result = _exec_command(req)
-    return {
+    return _ok(result)
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "result": result,
    }
@app.get("/windows")
@@ -1565,151 +1820,19 @@ def windows(
        visible_only=visible_only,
    )
    matches = _list_windows(query)
-    return {
+    return _ok({"windows": matches, "count": len(matches)})
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "windows": matches,
        "count": len(matches),
    }
@app.post("/windows/action")
 def window_action(req: WindowActionRequest, _: None = Depends(_auth)):
    result = _apply_window_action(req)
-    return {
+    return _ok(result)
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "result": result,
    }
@app.post("/launch")
 def launch(req: LaunchRequest, _: None = Depends(_auth)):
    result = _launch_app(req)
-    return {
+    return _ok(result)
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "result": result,
    }
@app.post("/wait")
 def wait(req: WaitRequest, screen: int = 0, _: None = Depends(_auth)):
    result = _wait_for_condition(req, screen)
    return {
        "ok": result.get("satisfied", False),
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "result": result,
    }
@app.post("/vision/diff")
 def vision_diff(req: VisionDiffRequest, screen: int = 0, _: None = Depends(_auth)):
    result = _compute_visual_diff(req, screen)
    return {
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "result": result,
    }
@app.post("/vision/stability")
 def vision_stability(req: VisionStabilityRequest, screen: int = 0, _: None = Depends(_auth)):
    result = _measure_stability(req, screen)
    return {
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "result": result,
    }
@app.post("/action/verify")
 def action_verify(req: VerifyActionRequest, screen: int = 0, _: None = Depends(_auth)):
    result = _run_verified_action(req, screen)
    return {
        "ok": result.get("success", False),
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "result": result,
    }
@app.post("/ocr")
 def ocr(req: OCRRequest, screen: int = 0, _: None = Depends(_auth)):
    image, region, mon, displays, screen_selection, source = _capture_ocr_source(req, screen)
    offset_x = region["x"] if source != "image" else 0
    offset_y = region["y"] if source != "image" else 0
    blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y)
    return {
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "result": {
            "mode": source,
            "screen": screen_selection if source != "image" else None,
            "display": mon if source != "image" else None,
            "language_hint": req.language_hint,
            "min_confidence": req.min_confidence,
            "region": region,
            "blocks": blocks,
        },
    }
@app.post("/ocr/find")
 def ocr_find(req: OCRFindRequest, screen: int = 0, _: None = Depends(_auth)):
    image, region, mon, displays, screen_selection, source = _capture_ocr_source(req, screen)
    offset_x = region["x"] if source != "image" else 0
    offset_y = region["y"] if source != "image" else 0
    blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y)
    matches = _find_text_matches(blocks, req.query, req.match, req.group_lines, req.max_results)
    return {
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "result": {
            "mode": source,
            "screen": screen_selection if source != "image" else None,
            "display": mon if source != "image" else None,
            "language_hint": req.language_hint,
            "min_confidence": req.min_confidence,
            "query": req.query,
            "match": req.match,
            "group_lines": req.group_lines,
            "region": region,
            "matches": matches,
            "match_count": len(matches),
            "blocks_considered": len(blocks),
        },
    }
@app.post("/batch")
 def batch(req: BatchRequest, screen: int = 0, _: None = Depends(_auth)):
    results = []
    for index, item in enumerate(req.actions):
        try:
            item_result = _exec_action(item, screen)
            results.append({"index": index, "ok": True, "result": item_result})
        except Exception as exc:
            results.append({"index": index, "ok": False, "error": str(exc)})
            if req.stop_on_error:
                break
    return {
        "ok": all(r["ok"] for r in results),
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "results": results,
    }
 if __name__ == "__main__":
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -1,381 +1,97 @@
 ---
 name: clickthrough-http-control
-description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
+description: Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops.
 ---
-# Clickthrough HTTP Control
+# Clickthrough HTTP Control (v2)
-Use a strict observe-decide-act-verify loop.
+Agents do not see live desktop video. They operate on snapshots.
 Use this loop: **observe -> localize -> act -> verify**.
-## Getting a computer instance (user-owned setup)
+## Fast defaults
-The **user/operator** is responsible for provisioning and exposing the target machine.
+- Start with `POST /v2/observe` on a tight region, not full screen.
-The agent should not assume it can self-install this stack.
+- Set `ocr_mode` to `none` unless text is required immediately.
 - Use `image` tool localization for icon-heavy or dense controls.
 - Use `POST /v2/act-verify` instead of manual sleep/poll loops.
-### What the user must do
+## Mandatory image-tool click localization
-1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
+When OCR is weak or ambiguous, ask image tool for one coordinate in bounds.
 2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
 3. Configure secrets on target machine:
   - `CLICKTHROUGH_TOKEN` for general API auth
   - `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
 4. Share connection details with the agent through a secure channel:
   - `base_url`
   - `x-clickthrough-token`
   - `x-clickthrough-exec-secret` (only when `/exec` is needed)
-### What the agent should do
+Prompt template:
-
+- "Return one click point as JSON `{\"x\":<int>,\"y\":<int>}` inside this image (`width=W`, `height=H`) for the **<exact target>** control."
 1. Validate connection with `GET /health` using provided headers.
 2. Refuse `/exec` attempts when exec secret is missing/invalid.
 3. Ask user for missing setup inputs instead of guessing infrastructure.
 ## What the agent can actually see
 The agent does **not** inherently see the remote desktop.
 Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision.
 That means:
 - `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly
 - `POST /ocr` returns machine-readable text blocks when text extraction is enough
 - the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues
 - every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected
 Do not write or think as if the agent is directly watching the screen in real time.
 Say what you actually have: screenshots, OCR output, and fresh verification captures.
 ## Mini API map
 - `GET /health` → server status + safety flags
 - `GET /displays` → detected displays in zero-based API order
 - `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
 - `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
 - `GET /windows` → discover visible desktop windows and their handles/processes
 - `POST /windows/action` → focus/restore/minimize/maximize/close a matched window
 - `POST /launch` → start an app/process without dropping to a shell
 - `POST /wait?screen=0` → wait for text, window, or visual state changes
 - `POST /vision/diff?screen=0` → compare screenshots or regions for meaningful visual change
 - `POST /vision/stability?screen=0` → measure short-interval visual stability
 - `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
 - `POST /ocr/find?screen=0` → search OCR output for matching text candidates
 - `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
 - `POST /action/verify?screen=0` → execute one action plus structured success verification
 - `POST /batch?screen=0` → sequential action list
 - `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
 ### Display selection
 - Use `GET /displays` before operating on multi-monitor systems.
 - Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
 - Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
 - Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
 - If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
 - Window rectangles from `GET /windows` are also in global desktop coordinates. Use them to sanity-check which monitor the app is really on before clicking.
 ### OCR usage
 - Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
 - Use `mode=screen` for discovery, then `mode=region` for precision and speed.
 - Use `language_hint` when known (for example `eng`) to improve consistency.
 - Filter noise with `min_confidence` (start around `0.4` and tune per app).
 - Treat OCR as one signal, not the only signal, before high-impact clicks.
 - Current response shape is nested under `result.blocks`, not top-level `blocks`. Parse the real payload before assuming the endpoint failed.
 - OCR can be noisy on dense shopping pages, streaming apps, and button-heavy sidebars. Re-crop tightly before escalating.
 ### Screenshot + `image` tool usage
 Use the OpenClaw `image` tool when OCR is not enough.
 This is especially useful for:
 - identifying which visible button looks like the primary confirm action
 - understanding dialog layout or pane structure
 - distinguishing similar nearby controls by icon, spacing, or emphasis
 - checking whether a visual state changed after a click
 - telling you where something is and where to click when text alone is not reliable
 Good pattern:
 1. capture with `GET /screen` or `POST /zoom`
 2. hand that screenshot to the `image` tool
 3. ask a precise question about the visible UI
 4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
 5. convert the answer into a concrete Clickthrough target
 6. act once
 7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
 Prefer vision over guessing.
 If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
 The model should help answer things like:
 - which visible button is the real primary action
 - whether the target is left/right/top/bottom within the crop
 - which of several similar buttons is the one to click
 - an approximate click point inside the provided image bounds
 Ask narrow questions.
 Good:
 - "Which button in this dialog is the primary confirmation action?"
 - "Is the scan still running, or does this look complete?"
 - "Which of these tabs appears selected?"
 - "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
 - "Which visible control says Stop Recording, and where should I click?"
 Bad:
 - "What should I click?"
 - "Use your eyes and do the task"
 - anything that assumes the model has live continuity without a new screenshot
 - requesting coordinates without telling the model the image bounds or expected output format
 ### Header requirements
 - Always send `x-clickthrough-token` when token auth is enabled.
 - For `/exec`, also send `x-clickthrough-exec-secret`.
 ## `POST /action` request shape (important)
 `/action` always expects an `action` plus an optional `target` object.
 Do **not** invent top-level `x` / `y` fields.
 Minimal pixel click:
 ```json
 {
  "action": "click",
  "target": {"mode": "pixel", "x": 100, "y": 200},
  "button": "left",
  "clicks": 1
 }
 ```
 Minimal grid click:
 ```json
 {
  "action": "click",
  "target": {
    "mode": "grid",
    "region_x": 0,
    "region_y": 0,
    "region_width": 1920,
    "region_height": 1080,
    "rows": 12,
    "cols": 12,
    "row": 6,
    "col": 8,
    "dx": 0.0,
    "dy": 0.0
  }
 }
 ```
 Other canonical examples:
 ```json
 {"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
 {"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
 {"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
 {"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
 {"action": "type", "text": "hello world", "interval_ms": 20}
 {"action": "hotkey", "keys": ["ctrl", "l"]}
 ```
 Rules:
- `dx` / `dy` belong inside `target`, not beside it.
+- Ask for one point only.
- `type` and `hotkey` usually do not need a `target`.
+- Include bounds in the prompt.
- For pixel targets, `x` / `y` are global desktop coordinates.
+- If answer is not parseable `x,y`, re-ask once with stricter format.
- For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used.
+- Send returned point to `POST /v2/localize` via `image_tool_point`.
-## When to use `/exec`
+## API playbook
-Prefer structured GUI control first:
+1. **Observe**
 - `/screen`, `/zoom`, `/ocr` to observe
 - `/action` or `/batch` to interact
-Use `/exec` only when it is the cleanest available tool for the job, for example:
+```json
- querying machine state that the GUI does not expose well
+POST /v2/observe?screen=0
- performing an explicit user-requested shell/system task
+{
- recovering from a blocked GUI flow when normal interaction failed
+  "mode": "region",
  "region_x": 820,
  "region_y": 420,
  "region_width": 700,
  "region_height": 420,
  "include_image": true,
  "ocr_mode": "none"
 }
 ```
-Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
+2. **Localize** (choose one)
 Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
 When a task can be completed with window focus/restore, keyboard shortcuts, screenshots, OCR, and normal actions, stay out of `/exec` entirely.
-## Core workflow (mandatory)
+Text:
 ```json
 POST /v2/localize
 {"observation_id":"...","text_query":"Save","text_match":"exact"}
 ```
-1. Call `GET /windows` first when the task mentions a known app; focus/restore the right window before screen hunting.
+Image-tool point:
-2. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
+```json
-3. Identify likely target region and compute an initial confidence score.
+POST /v2/localize
-4. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
+{"observation_id":"...","image_tool_point":{"x":312,"y":188}}
-5. **Before any click**, verify target identity (OCR text/icon/location consistency).
+```
 6. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
 7. Execute one minimal action via `POST /action`.
 8. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
 9. Repeat until objective is complete.
-## Verify-before-click rules
+3. **Act**
- Never click if target identity is ambiguous.
+```json
- Require at least two matching signals before click.
+POST /v2/act?screen=0
- Good signal pairs include:
+{"action":{"action":"click","target":{"resolved_target_id":"..."}}}
-  - OCR text + expected UI region
+```
  - OCR text + matching button shape/icon nearby
  - dialog title text + expected button position within that dialog
  - known app/window focus + expected control location
  - OCR candidate + vision-model localization inside the same crop
 - If confidence is low, do not "test click"; zoom and re-localize first.
 - If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
 - For high-impact actions (close/delete/send/purchase), use two-phase flow:
  1) preview intended coordinate + reason
  2) execute only after explicit confirmation.
-## Precision rules
+4. **Verify**
- Prefer grid targets first, then use `dx/dy` for subcell precision.
+```json
- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
+POST /v2/act-verify?screen=0
- Use zoom before guessing offsets.
+{
- Avoid stale coordinates: re-capture before action if UI moved/scrolled.
+  "action":{"action":"click","target":{"resolved_target_id":"..."}},
  "condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420},
  "risk_level":"low"
 }
 ```
-## Safety rules
+## Risk policy
- Respect `dry_run` and `allowed_region` restrictions from `/health`.
+- Low risk (navigation, focus, benign clicks): single verification signal.
- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
+- High risk (delete/send/purchase/close-lossy): use `risk_level=high` and require two checks before act.
- Avoid destructive shortcuts unless explicitly requested.
+- Never do speculative repeat clicks; switch strategy after one failed verify.
 - Send one action at a time unless deterministic; then use `/batch`.
-## Reliability rules
+## Anti-latency rules
- After every meaningful action, verify with a fresh screenshot.
+- Never repeat full-screen OCR by default.
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
+- Re-observe only the active pane/region.
- Prefer short, reversible actions over long macros.
+- Prefer keyboard + window APIs for app switching.
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
+- Use OCR on region only and cap area with `max_ocr_area_px`.
-## Fallback ladder for uncertain targeting
+## Setup and auth
-1. Full-screen capture with a coarse grid.
+- Include `x-clickthrough-token` when token auth is enabled.
-2. Zoom into the candidate area with a denser grid.
+- `/exec` additionally requires `x-clickthrough-exec-secret`.
-3. OCR the full screen or the tighter region.
+- Validate server first: `GET /health`.
 4. Re-anchor on a more reliable nearby control, title, or label.
 5. Try a keyboard-first flow if the app supports it.
 6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner.
 Do not skip from "uncertain click" straight to random retries.
 ## Concrete screenshot -> `image` -> action example
 Example loop:
 1. `GET /screen?screen=0` to capture the current app state
 2. if the UI is text-heavy, try `POST /ocr` first
 3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like:
   - "In this save dialog, which visible button is the primary action?"
   - "Is there a dismiss/close button in the top-right of this modal?"
 4. map the answer back to a Clickthrough target using the returned grid/region metadata
 5. click once with `POST /action`
 6. recapture the screen
 7. optionally use `POST /wait` or another `image`/OCR check to confirm the result
 The key rule is simple: screenshot first, interpret second, click third, verify fourth.
 Do not collapse those steps into fake certainty.
 When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.
 ## App-specific playbooks (recommended)
 Build per-app routines for repetitive tasks instead of generic clicking.
 ### Launcher / search / start app playbook
 Use this when the goal is "open app X" or "bring up tool Y".
 1. check `GET /windows` first in case the app is already open
 2. if present, use `POST /windows/action` to focus or restore it
 3. if absent, prefer `POST /launch` when you know the executable path
 4. if launch path is unknown but the OS launcher/search UI is available, use a keyboard-first flow:
   - open launcher (`win`, `cmd+space`, or app-specific shortcut depending on host)
   - type exact app name
   - wait for stable results with `POST /wait` or recapture
   - verify the result text with OCR or the `image` tool
   - press Enter or click the exact result once
 5. verify the app window now exists or is focused
 Do not keep relaunching if the window already exists; that’s sloppy.
 ### Dialog confirmation playbook
 Use for modals like save/discard, delete confirmation, permission prompts, and installer dialogs.
 1. capture the dialog region with `POST /zoom`
 2. use OCR first for title/body/button labels
 3. if button hierarchy or emphasis matters, inspect the zoomed screenshot with the `image` tool
 4. identify the exact intended action (`Cancel`, `Save`, `Allow`, `Delete`, etc.)
 5. for destructive actions, require explicit user confirmation unless already requested
 6. click once and verify the dialog disappeared or changed state
 Good verification targets:
 - dialog title vanished
 - expected next window appeared
 - destructive side effect is visible and confirmed
 ### File picker playbook
 Use for open/save dialogs.
 1. verify the file picker window is focused
 2. OCR the visible breadcrumb/path area, filename field, and button row
 3. prefer keyboard-first entry when possible:
   - type or paste the target path/name into the focused field
   - use `tab` / `shift+tab` to move predictably between filename and action buttons
 4. if the target path is uncertain, use OCR plus the `image` tool to identify the active field and selected folder/file row
 5. verify the intended filename/path is visible before confirming
 6. activate `Open` / `Save` once and verify the picker closes
 If the picker stays open, stop and inspect why instead of hammering Enter like a maniac.
 ### Browser tab / window playbook
 Use for browser navigation, tab targeting, or web app recovery.
 1. use `GET /windows` to focus the correct browser window first
 2. prefer keyboard-first navigation:
   - `ctrl+l` / `cmd+l` to focus the address bar
   - `ctrl+tab` / `ctrl+shift+tab` for tab movement when order is known
   - `ctrl+w` only for explicitly requested close actions
 3. verify tab or page identity with OCR on the tab strip or page heading
 4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs
 5. after navigation, wait for visual stability or expected text before taking the next action
 6. on shopping/checkouts, tighten crops around the buy box or checkout panel before reading button text; full-page OCR often misses the one thing that matters
 Do not assume a page loaded just because the click landed. Verify it.
 ### Settings / preferences navigation playbook
 Use when the task involves toggles, dropdowns, sidebars, or nested settings panels.
 1. identify the current settings page with OCR on the heading/sidebar
 2. use OCR to find the specific section label before trying to toggle anything
 3. if the layout is dense, zoom into the relevant pane and use the `image` tool to distinguish labels from controls
 4. prefer small reversible actions: one toggle, one dropdown, one field edit at a time
 5. after each change, verify the control state changed visually or via visible text
 6. if a save/apply button exists, treat it as a separate confirmation step and verify completion
 Settings UIs love hiding side effects. Assume nothing.
 ### Dense app / control-strip playbook
 Use for apps like OBS, IDEs, mixers, dashboards, or anything with tiny bottom-right control clusters.
 1. focus the exact app window with `POST /windows/action`
 2. capture the full target display once to confirm the window is actually frontmost
 3. crop tightly around the suspected control strip with `POST /zoom`
 4. run OCR on the crop, not the full screen
 5. if labels are still ambiguous, ask the `image` tool a narrow question about the specific buttons
 6. click once and immediately verify the control label changed (`Start Recording` -> `Stop Recording`, etc.)
 Do not trust OCR taken from the wrong frontmost window. It will happily waste your time.
 ### Spotify playbook
 - Focus app window before search/navigation.
 - Prefer keyboard-first flow for song start:
  1) `Ctrl+L` (search)
  2) type exact query
  3) Enter
  4) verify exact song+artist text
  5) click/double-click row
  6) verify now-playing bar
 - If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.