From aced5be25e476780edcd5ff05fb6e45da925c8ce Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Paul=20W=C3=A4hner?= <space@reversed.dev>
Date: Sun, 3 May 2026 19:11:11 +0200
Subject: [PATCH] feat: migrate to v2-only API and unified response envelope

---
 README.md              |  85 ++---
 docs/API.md            | 641 +++++---------------------------------
 examples/quickstart.py |  31 +-
 server/app.py          | 691 ++++++++++++++++++++++++-----------------
 skill/SKILL.md         | 422 ++++---------------------
 5 files changed, 603 insertions(+), 1267 deletions(-)

diff --git a/README.md b/README.md
index f0be27e..ae7ef4f 100644
--- a/README.md
+++ b/README.md
@@ -1,22 +1,25 @@
 # Clickthrough
 
-Let an Agent interact with your computer over HTTP, with grid-aware screenshots and precise input actions.
+Let an agent interact with a computer over HTTP.
+
+## Primary mode (v2)
+
+Use the v2 contract for faster, less OCR-heavy control loops:
+- `POST /v2/observe`
+- `POST /v2/localize`
+- `POST /v2/act`
+- `POST /v2/act-verify`
+
+This is optimized for agents that cannot directly see the screen and must use screenshot/image tools.
 
 ## What this provides
 
-- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
-- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
-- **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ...
-- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
-- **Window lifecycle endpoints**: list/focus/restore/minimize/maximize/close windows via `GET /windows` + `POST /windows/action`
-- **Structured launch endpoint**: start an app/process without dropping to a shell via `POST /launch`
-- **Wait/sync endpoint**: poll for text, window, or visual state changes via `POST /wait`
-- **Vision helper endpoints**: compare screenshots and measure stability via `POST /vision/diff` and `POST /vision/stability`
-- **OCR endpoints**: extract text blocks or search for matching text via `POST /ocr` and `POST /ocr/find`
-- **Compound verify endpoint**: execute an action and wait for a structured success condition via `POST /action/verify`
-- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
-- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
-- **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
+- Screen/region capture with optional OCR and timing stats
+- Observation IDs for deterministic follow-up localization
+- Text localization and image-tool coordinate localization
+- Action execution with resolved target IDs
+- Risk-aware action+verification defaults
+- Unified response envelope across all endpoints
 
 ## Quick start
 
@@ -30,53 +33,17 @@ CLICKTHROUGH_TOKEN=change-me python -m server.app
 
 Server defaults to `127.0.0.1:8123`.
 
-For OCR support, install the native `tesseract` binary on the host (in addition to Python deps), or point `CLICKTHROUGH_TESSERACT_CMD` at the executable if it lives somewhere weird.
+## Fast control loop
 
-`python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically.
+1. `POST /v2/observe` on a tight region
+2. If OCR is enough, `POST /v2/localize` with `text_query`
+3. If ambiguous, ask image tool for one x,y in observation bounds
+4. `POST /v2/localize` with `image_tool_point`
+5. `POST /v2/act` or `POST /v2/act-verify`
+6. Re-observe only changed region
 
-## Minimal API flow
+## See docs
 
-1. `GET /displays` if you need a non-primary monitor
-2. `GET /screen?screen=0` with grid
-3. Decide cell / target
-4. Optional `POST /zoom?screen=0` for finer targeting
-5. `POST /action?screen=0` to execute (or `POST /action/verify?screen=0` for a bundled action+wait flow)
-6. `GET /screen?screen=0` again to verify result, or use `POST /wait`, `POST /vision/diff`, or `POST /ocr/find`
-
-Important:
-- `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields.
-- Pixel coordinates and OCR bounding boxes are always global desktop coordinates.
-- The agent does **not** inherently see the remote desktop; it reasons from screenshots, OCR, and window metadata.
-- When OCR is not enough, pair Clickthrough screenshots with OpenClaw's `image` tool for explicit screenshot interpretation.
-- Prefer structured GUI interaction first; use `/windows`, `/launch`, `/wait`, and `/action` before reaching for `/exec`.
-
-See:
 - `docs/API.md`
-- `docs/coordinate-system.md`
 - `skill/SKILL.md`
-
-## Configuration
-
-Environment variables:
-
-- `CLICKTHROUGH_HOST` (default `127.0.0.1`)
-- `CLICKTHROUGH_PORT` (default `8123`)
-- `CLICKTHROUGH_TOKEN` (optional; if set, require `x-clickthrough-token` header)
-- `CLICKTHROUGH_DRY_RUN` (`true`/`false`; default `false`)
-- `CLICKTHROUGH_GRID_ROWS` (default `12`)
-- `CLICKTHROUGH_GRID_COLS` (default `12`)
-- `CLICKTHROUGH_ALLOWED_REGION` (optional `x,y,width,height`)
-- `CLICKTHROUGH_EXEC_ENABLED` (default `true`)
-- `CLICKTHROUGH_EXEC_SECRET` (**required for `/exec` to run**)
-- `CLICKTHROUGH_EXEC_DEFAULT_SHELL` (default `powershell`; one of `powershell`, `bash`, `cmd`)
-- `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`)
-- `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`)
-- `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`)
-- `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable)
-
-Window management endpoints currently target Windows hosts. On non-Windows hosts they return `501` instead of guessing.
-
-## Gitea CI
-
-A Gitea Actions workflow is included at `.gitea/workflows/python-syntax.yml`.
-It runs Python syntax checks (`py_compile`) on every push and pull request.
+- `docs/coordinate-system.md`
diff --git a/docs/API.md b/docs/API.md
index 76cfa59..54c7466 100644
--- a/docs/API.md
+++ b/docs/API.md
@@ -1,614 +1,141 @@
-# API Reference (v0.1)
+# API Reference (v2)
 
 Base URL: `http://127.0.0.1:8123`
 
-If `CLICKTHROUGH_TOKEN` is set, include header:
+If `CLICKTHROUGH_TOKEN` is set, include:
 
 ```http
 x-clickthrough-token: <token>
 ```
 
-## `GET /health`
+## Endpoints
 
-Returns status and runtime safety flags, including `exec` capability config.
+- `POST /v2/observe`
+- `POST /v2/localize`
+- `POST /v2/act`
+- `POST /v2/act-verify`
+- `GET /health`
+- `GET /displays`
+- `GET /windows`
+- `POST /windows/action`
+- `POST /launch`
+- `POST /exec`
 
-## `GET /displays`
+No v1 endpoints are supported.
 
-Returns detected displays in API screen order.
-
-```json
-{
-  "ok": true,
-  "default_screen": 0,
-  "displays": [
-    {"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
-    {"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
-  ]
-}
-```
-
-`screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend.
-Invalid `screen` values fall back to `0`.
-
-## `GET /screen`
-
-Query params:
-
-- `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0`
-- `with_grid` (bool, default `true`)
-- `grid_rows` (int, default env or `12`)
-- `grid_cols` (int, default env or `12`)
-- `include_labels` (bool, default `true`)
-- `image_format` (`png`|`jpeg`, default `png`)
-- `jpeg_quality` (1-100, default `85`)
-- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
-
-Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
-`meta.region` uses global desktop coordinates.
-
-These image-returning endpoints do not magically grant the agent live vision.
-If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI.
-
-## `POST /zoom`
-
-Body:
-
-```json
-{
-  "center_x": 1200,
-  "center_y": 700,
-  "width": 500,
-  "height": 350,
-  "with_grid": true,
-  "grid_rows": 20,
-  "grid_cols": 20,
-  "include_labels": true,
-  "image_format": "png",
-  "jpeg_quality": 90
-}
-```
-
-Query params:
-
-- `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0`
-- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
-
-Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.
-
-`POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout.
-
-## `POST /action`
-
-Body: one action.
-
-Important:
-- the request body uses `action` plus an optional `target`
-- pixel coordinates live inside `target` when `target.mode="pixel"`
-- do **not** send top-level `x` / `y` fields
-
-Query params:
-
-- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`
-
-Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target.
-
-### Pointer target modes
-
-#### Pixel target
-
-```json
-{
-  "mode": "pixel",
-  "x": 100,
-  "y": 200,
-  "dx": 0,
-  "dy": 0
-}
-```
-
-#### Grid target
-
-```json
-{
-  "mode": "grid",
-  "region_x": 0,
-  "region_y": 0,
-  "region_width": 1920,
-  "region_height": 1080,
-  "rows": 12,
-  "cols": 12,
-  "row": 5,
-  "col": 9,
-  "dx": 0.0,
-  "dy": 0.0
-}
-```
-
-`dx`/`dy` are normalized offsets in `[-1, 1]` inside the selected cell.
-
-### Action examples
-
-Click:
-
-```json
-{
-  "action": "click",
-  "target": {
-    "mode": "grid",
-    "region_x": 0,
-    "region_y": 0,
-    "region_width": 1920,
-    "region_height": 1080,
-    "rows": 12,
-    "cols": 12,
-    "row": 7,
-    "col": 3,
-    "dx": 0.2,
-    "dy": -0.1
-  },
-  "clicks": 1,
-  "button": "left"
-}
-```
-
-Scroll:
-
-```json
-{
-  "action": "scroll",
-  "target": {"mode": "pixel", "x": 1300, "y": 740},
-  "scroll_amount": -500
-}
-```
-
-Type text:
-
-```json
-{
-  "action": "type",
-  "text": "hello world",
-  "interval_ms": 20
-}
-```
-
-Hotkey:
-
-```json
-{
-  "action": "hotkey",
-  "keys": ["ctrl", "l"]
-}
-```
-
-Right click:
-
-```json
-{
-  "action": "right_click",
-  "target": {"mode": "pixel", "x": 1300, "y": 740}
-}
-```
-
-Move only:
-
-```json
-{
-  "action": "move",
-  "target": {"mode": "pixel", "x": 1300, "y": 740},
-  "duration_ms": 150
-}
-```
-
-## `GET /windows`
-
-List desktop windows using structured filters instead of shelling out.
-
-Query params:
-
-- `title_contains` (optional substring match)
-- `title_regex` (optional case-insensitive regex)
-- `process_name` (optional exact process name, e.g. `explorer.exe`)
-- `hwnd` (optional exact window handle)
-- `visible_only` (bool, default `true`)
-
-```json
-{
-  "ok": true,
-  "count": 1,
-  "windows": [
-    {
-      "hwnd": 132640,
-      "title": "WinDirStat",
-      "class_name": "WinDirStatMainWindow",
-      "pid": 18420,
-      "process_name": "windirstat.exe",
-      "visible": true,
-      "enabled": true,
-      "minimized": false,
-      "maximized": false,
-      "foreground": true,
-      "rect": {"x": 194, "y": 116, "width": 1532, "height": 870}
-    }
-  ]
-}
-```
-
-Notes:
-- Currently supported on Windows hosts only.
-- Returns `409` for ambiguous write-target matches when a mutation endpoint would affect multiple windows.
-
-## `POST /windows/action`
-
-Perform a structured window action against exactly one matched window.
-
-```json
-{
-  "action": "focus",
-  "title_contains": "WinDirStat",
-  "visible_only": true,
-  "timeout_ms": 3000
-}
-```
-
-Supported actions:
-- `focus`
-- `restore`
-- `minimize`
-- `maximize`
-- `close`
-
-The response includes the matched pre-action window and the final observed window state (or `closed=true` if it disappeared).
-
-## `POST /launch`
-
-Start an app/process without invoking a shell.
-
-```json
-{
-  "executable": "C:/Program Files/WinDirStat/WinDirStat.exe",
-  "args": [],
-  "cwd": "C:/Program Files/WinDirStat",
-  "wait_for_window": true,
-  "match": {
-    "title_contains": "WinDirStat",
-    "visible_only": true
-  },
-  "timeout_ms": 8000
-}
-```
-
-Notes:
-- Launch uses direct process execution (`subprocess.Popen`) rather than PowerShell/CMD.
-- If `wait_for_window=true`, the server polls for a matching window and returns `window_found`.
-- `dry_run=true` returns the resolved argv/cwd without launching.
-
-## `POST /vision/diff`
-
-Measure whether a screen region changed meaningfully between two captures.
-
-Query params:
-
-- `screen` (int, default `0`) - used for `mode=screen` and `mode=region`
-
-Compare live captures:
+## `POST /v2/observe`
 
 ```json
 {
   "mode": "region",
-  "region_x": 120,
-  "region_y": 80,
-  "region_width": 600,
-  "region_height": 300,
-  "delay_ms": 400,
-  "diff_threshold": 0.01
+  "region_x": 800,
+  "region_y": 420,
+  "region_width": 700,
+  "region_height": 420,
+  "include_image": true,
+  "image_format": "jpeg",
+  "jpeg_quality": 75,
+  "ocr_mode": "region",
+  "language_hint": "eng",
+  "min_confidence": 0.45,
+  "max_ocr_area_px": 1500000,
+  "group_lines": true
 }
 ```
 
-Compare provided images:
+Returns observation metadata, optional image, OCR blocks/lines, and timing fields.
+
+## `POST /v2/localize`
+
+Text localization:
 
 ```json
 {
-  "mode": "image",
-  "before_image_base64": "iVBORw0KGgoAAA...",
-  "after_image_base64": "iVBORw0KGgoBBB...",
-  "diff_threshold": 0.01
+  "observation_id": "...",
+  "text_query": "Save",
+  "text_match": "exact",
+  "candidate_index": 0
 }
 ```
 
-Response includes:
-- `diff_ratio` — average normalized pixel difference
-- `changed` — whether `diff_ratio >= diff_threshold`
-- `region` — compared region
-
-## `POST /vision/stability`
-
-Measure whether a screen region stays visually stable over a short interval.
-
-Query params:
-
-- `screen` (int, default `0`)
+Image-tool point localization:
 
 ```json
 {
-  "region_x": 0,
-  "region_y": 0,
-  "region_width": 1920,
-  "region_height": 1080,
-  "sample_interval_ms": 250,
-  "duration_ms": 1200,
-  "diff_threshold": 0.005
+  "observation_id": "...",
+  "image_tool_point": {"x": 312, "y": 188}
 }
 ```
 
-Response includes:
-- `stable`
-- `sample_count`
-- `max_diff_ratio`
-- `avg_diff_ratio`
+Returns `resolved_target_id`, global pixel, and `localization_confidence`.
 
-## `POST /wait`
-
-Wait on a structured UI condition instead of guessing sleep durations.
-
-Query params:
-
-- `screen` (int, default `0`) - used for text and visual waits
-
-### Wait for text to appear
-
-```json
-{
-  "condition": {
-    "kind": "text",
-    "mode": "screen",
-    "text": "Scan complete",
-    "match": "contains",
-    "present": true,
-    "language_hint": "eng",
-    "min_confidence": 0.4
-  },
-  "timeout_ms": 15000,
-  "poll_interval_ms": 400
-}
-```
-
-### Wait for a window state
-
-```json
-{
-  "condition": {
-    "kind": "window",
-    "title_contains": "WinDirStat",
-    "visible_only": true,
-    "state": "focused"
-  },
-  "timeout_ms": 5000,
-  "poll_interval_ms": 200
-}
-```
-
-Window states:
-- `exists`
-- `focused`
-- `closed`
-
-### Wait for visual change or stability
-
-```json
-{
-  "condition": {
-    "kind": "visual",
-    "state": "stable",
-    "region_x": 0,
-    "region_y": 0,
-    "region_width": 1920,
-    "region_height": 1080,
-    "diff_threshold": 0.005,
-    "stable_for_ms": 1000
-  },
-  "timeout_ms": 12000,
-  "poll_interval_ms": 300
-}
-```
-
-Visual states:
-- `change` — succeeds when the average pixel diff crosses `diff_threshold`
-- `stable` — succeeds when the diff stays at or below `diff_threshold` for `stable_for_ms`
-
-Notes:
-- Text waits reuse the OCR pipeline and return matching OCR blocks on success.
-- Window waits build on the structured window discovery endpoint.
-- Visual waits compare repeated captures of either the full selected display or an explicit region.
-
-## `POST /action/verify`
-
-Execute one action and wait for a structured success condition.
-
-Query params:
-
-- `screen` (int, default `0`)
+## `POST /v2/act`
 
 ```json
 {
   "action": {
     "action": "click",
-    "target": {"mode": "pixel", "x": 1300, "y": 740}
+    "target": {"resolved_target_id": "..."},
+    "button": "left",
+    "clicks": 1
+  }
+}
+```
+
+## `POST /v2/act-verify`
+
+```json
+{
+  "action": {
+    "action": "click",
+    "target": {"resolved_target_id": "..."}
   },
   "condition": {
     "kind": "text",
-    "mode": "screen",
-    "text": "Settings",
+    "mode": "region",
+    "text": "Saved",
     "match": "contains",
     "present": true,
-    "language_hint": "eng",
+    "region_x": 820,
+    "region_y": 420,
+    "region_width": 500,
+    "region_height": 140,
     "min_confidence": 0.4
   },
-  "retries": 1,
-  "timeout_ms": 4000,
-  "poll_interval_ms": 250,
-  "retry_delay_ms": 250
+  "risk_level": "low"
 }
 ```
 
-Condition kinds mirror `POST /wait`:
-- `text`
-- `window`
-- `visual`
+Risk defaults:
+- `low`: retries `0`, timeout `2500ms`
+- `high`: retries `1`, timeout `6000ms`
 
-The response returns per-attempt action output plus structured verification output.
+## Response envelope
 
-## `POST /ocr`
-
-Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
-
-Query params:
-
-- `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0`
-
-Body:
-
-```json
-{
-  "mode": "screen",
-  "language_hint": "eng",
-  "min_confidence": 0.4
-}
-```
-
-Modes:
-- `screen` (default): OCR over full selected monitor
-- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
-- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
-
-Region mode example:
-
-```json
-{
-  "mode": "region",
-  "region_x": 220,
-  "region_y": 160,
-  "region_width": 900,
-  "region_height": 400,
-  "language_hint": "eng",
-  "min_confidence": 0.5
-}
-```
-
-Image mode example:
-
-```json
-{
-  "mode": "image",
-  "image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
-  "language_hint": "eng"
-}
-```
-
-Response shape:
+Success:
 
 ```json
 {
   "ok": true,
   "request_id": "...",
   "time_ms": 1710000000000,
-  "result": {
-    "mode": "screen",
-    "language_hint": "eng",
-    "min_confidence": 0.4,
-    "region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
-    "blocks": [
-      {
-        "text": "Settings",
-        "confidence": 0.9821,
-        "bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
-      }
-    ]
+  "data": { },
+  "error": null
+}
+```
+
+Error:
+
+```json
+{
+  "ok": false,
+  "request_id": "...",
+  "time_ms": 1710000000000,
+  "data": null,
+  "error": {
+    "code": "http_error",
+    "message": "...",
+    "details": {}
   }
 }
 ```
-
-Notes:
-- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
-- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
-- Requires `tesseract` executable plus Python package `pytesseract`.
-- If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.
-
-## `POST /ocr/find`
-
-Search OCR output for matching text instead of post-processing raw OCR blocks client-side.
-
-Query params:
-
-- `screen` (int, default `0`) - used for `mode=screen` and `mode=region`
-
-```json
-{
-  "mode": "screen",
-  "query": "Settings",
-  "match": "contains",
-  "group_lines": true,
-  "max_results": 10,
-  "language_hint": "eng",
-  "min_confidence": 0.4
-}
-```
-
-Modes:
-- `screen`
-- `region`
-- `image`
-
-Options:
-- `match`: `contains`, `exact`, or `regex`
-- `group_lines=true`: combine nearby OCR words into line-level candidates before matching
-- `max_results`: result cap after confidence sorting
-
-Response includes:
-- `matches` — confidence-sorted candidate matches
-- `match_count`
-- `blocks_considered`
-
-## `POST /exec`
-
-Execute a shell command on the host running Clickthrough.
-
-Requirements:
-- `CLICKTHROUGH_EXEC_SECRET` must be configured on the server
-- send header `x-clickthrough-exec-secret: <secret>`
-
-```json
-{
-  "command": "Get-Process | Select-Object -First 5",
-  "shell": "powershell",
-  "timeout_s": 20,
-  "cwd": "C:/Users/Paul",
-  "dry_run": false
-}
-```
-
-Notes:
-- `shell` supports `powershell`, `bash`, `cmd`
-- if `shell` is omitted, server uses `CLICKTHROUGH_EXEC_DEFAULT_SHELL`
-- output is truncated based on `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS`
-- endpoint can be disabled with `CLICKTHROUGH_EXEC_ENABLED=false`
-- if `CLICKTHROUGH_EXEC_SECRET` is missing, `/exec` is blocked (`403`)
-
-Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution metadata.
-
-## `POST /batch`
-
-Runs multiple `action` payloads sequentially.
-
-Query params:
-
-- `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0`
-
-```json
-{
-  "actions": [
-    {"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}},
-    {"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}}
-  ],
-  "stop_on_error": true
-}
-```
diff --git a/examples/quickstart.py b/examples/quickstart.py
index 5aba923..3ad8ce2 100644
--- a/examples/quickstart.py
+++ b/examples/quickstart.py
@@ -13,23 +13,26 @@ if TOKEN:
 
 
 def main():
-    r = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10)
-    r.raise_for_status()
-    print("health:", r.json())
+    health = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10)
+    health.raise_for_status()
+    print("health ok:", health.json().get("ok"))
 
-    d = requests.get(f"{BASE_URL}/displays", headers=headers, timeout=10)
-    d.raise_for_status()
-    print("displays:", d.json().get("displays", []))
-
-    s = requests.get(
-        f"{BASE_URL}/screen",
+    observe = requests.post(
+        f"{BASE_URL}/v2/observe",
         headers=headers,
-        params={"screen": SCREEN, "with_grid": True, "grid_rows": 12, "grid_cols": 12},
-        timeout=30,
+        params={"screen": SCREEN},
+        json={
+            "mode": "screen",
+            "include_image": False,
+            "ocr_mode": "none",
+        },
+        timeout=20,
     )
-    s.raise_for_status()
-    payload = s.json()
-    print("screen meta:", payload.get("meta", {}))
+    observe.raise_for_status()
+    payload = observe.json()["data"]
+    print("observation_id:", payload["observation_id"])
+    print("region:", payload["region"])
+    print("timing_ms:", payload["timing_ms"])
 
 
 if __name__ == "__main__":
diff --git a/server/app.py b/server/app.py
index dca2378..bd43ed6 100644
--- a/server/app.py
+++ b/server/app.py
@@ -8,10 +8,12 @@ import subprocess
 import sys
 import time
 import uuid
-from typing import Literal, Optional
+from typing import Any, Literal, Optional
 
 from dotenv import load_dotenv
-from fastapi import Depends, FastAPI, Header, HTTPException, Response
+from fastapi import Depends, FastAPI, Header, HTTPException, Request
+from fastapi.exceptions import RequestValidationError
+from fastapi.responses import JSONResponse
 from PIL import ImageChops, ImageStat
 from pydantic import BaseModel, Field, model_validator
 
@@ -21,6 +23,55 @@ load_dotenv(dotenv_path=".env", override=False)
 app = FastAPI(title="clickthrough", version="0.1.0")
 
 
+def _ok(data: Any, status_code: int = 200):
+    return JSONResponse(
+        status_code=status_code,
+        content={
+            "ok": True,
+            "request_id": _request_id(),
+            "time_ms": _now_ms(),
+            "data": data,
+            "error": None,
+        },
+    )
+
+
+def _err(code: str, message: str, status_code: int, details: Any = None):
+    return JSONResponse(
+        status_code=status_code,
+        content={
+            "ok": False,
+            "request_id": _request_id(),
+            "time_ms": _now_ms(),
+            "data": None,
+            "error": {
+                "code": code,
+                "message": message,
+                "details": details,
+            },
+        },
+    )
+
+
+@app.exception_handler(HTTPException)
+async def _http_exception_handler(_: Request, exc: HTTPException):
+    detail = exc.detail
+    if isinstance(detail, dict):
+        message = str(detail.get("message", "request failed"))
+        return _err("http_error", message, exc.status_code, detail)
+    return _err("http_error", str(detail), exc.status_code)
+
+
+@app.exception_handler(Exception)
+async def _unhandled_exception_handler(_: Request, exc: Exception):
+    return _err("internal_error", "internal server error", 500, {"type": type(exc).__name__})
+
+
+@app.exception_handler(RequestValidationError)
+async def _validation_exception_handler(_: Request, exc: RequestValidationError):
+    return _err("validation_error", "request validation failed", 422, exc.errors())
+
+
 def _env_bool(name: str, default: bool) -> bool:
     raw = os.getenv(name)
     if raw is None:
@@ -288,6 +339,144 @@ class VerifyActionRequest(BaseModel):
     stop_on_action_error: bool = True
 
 
+class ObserveRequestV2(BaseModel):
+    mode: Literal["screen", "region"] = "screen"
+    region_x: int | None = Field(default=None, ge=0)
+    region_y: int | None = Field(default=None, ge=0)
+    region_width: int | None = Field(default=None, gt=0)
+    region_height: int | None = Field(default=None, gt=0)
+    include_image: bool = True
+    image_format: Literal["png", "jpeg"] = "jpeg"
+    jpeg_quality: int = Field(default=75, ge=1, le=100)
+    ocr_mode: Literal["none", "region", "screen"] = "none"
+    language_hint: str | None = Field(default=None, min_length=1, max_length=64)
+    min_confidence: float = Field(default=0.4, ge=0.0, le=1.0)
+    max_ocr_area_px: int | None = Field(default=1_500_000, ge=1000)
+    group_lines: bool = True
+
+    @model_validator(mode="after")
+    def _validate_region(self):
+        if self.mode == "region":
+            required = [self.region_x, self.region_y, self.region_width, self.region_height]
+            if any(v is None for v in required):
+                raise ValueError("region_x, region_y, region_width, region_height are required for mode=region")
+        return self
+
+
+class ImageToolPoint(BaseModel):
+    x: int = Field(ge=0)
+    y: int = Field(ge=0)
+
+
+class LocalizeRequestV2(BaseModel):
+    observation_id: str = Field(min_length=1, max_length=128)
+    text_query: str | None = Field(default=None, max_length=512)
+    text_match: Literal["contains", "exact", "regex"] = "contains"
+    image_tool_point: ImageToolPoint | None = None
+    candidate_index: int = Field(default=0, ge=0)
+
+    @model_validator(mode="after")
+    def _validate_selector(self):
+        has_text = bool((self.text_query or "").strip())
+        has_point = self.image_tool_point is not None
+        if has_text == has_point:
+            raise ValueError("provide exactly one of text_query or image_tool_point")
+        return self
+
+
+class ActionTargetV2(BaseModel):
+    resolved_target_id: str | None = Field(default=None, max_length=128)
+    pixel_x: int | None = None
+    pixel_y: int | None = None
+
+    @model_validator(mode="after")
+    def _validate_shape(self):
+        has_resolved = bool(self.resolved_target_id)
+        has_pixel = self.pixel_x is not None or self.pixel_y is not None
+        if has_resolved == has_pixel:
+            raise ValueError("provide either resolved_target_id or pixel_x/pixel_y")
+        if has_pixel and (self.pixel_x is None or self.pixel_y is None):
+            raise ValueError("pixel_x and pixel_y are both required")
+        return self
+
+
+class ActionRequestV2(BaseModel):
+    action: Literal[
+        "move",
+        "click",
+        "right_click",
+        "double_click",
+        "middle_click",
+        "scroll",
+        "type",
+        "hotkey",
+    ]
+    target: ActionTargetV2 | None = None
+    duration_ms: int = Field(default=0, ge=0, le=20000)
+    button: Literal["left", "right", "middle"] = "left"
+    clicks: int = Field(default=1, ge=1, le=10)
+    scroll_amount: int = 0
+    text: str = ""
+    keys: list[str] = Field(default_factory=list)
+    interval_ms: int = Field(default=20, ge=0, le=5000)
+    dry_run: bool = False
+
+
+class ActRequestV2(BaseModel):
+    action: ActionRequestV2
+
+
+class ActVerifyRequestV2(BaseModel):
+    action: ActionRequestV2
+    condition: WaitTextCondition | WaitWindowCondition | WaitVisualCondition
+    risk_level: Literal["low", "high"] = "low"
+    retries: int | None = Field(default=None, ge=0, le=10)
+    timeout_ms: int | None = Field(default=None, ge=0, le=120000)
+    poll_interval_ms: int | None = Field(default=None, ge=50, le=10000)
+    retry_delay_ms: int | None = Field(default=None, ge=0, le=60000)
+    stop_on_action_error: bool = True
+
+
+OBSERVATIONS: dict[str, dict[str, Any]] = {}
+RESOLVED_TARGETS: dict[str, dict[str, Any]] = {}
+
+
+def _get_observation(observation_id: str) -> dict[str, Any]:
+    observation = OBSERVATIONS.get(observation_id)
+    if observation is None:
+        raise HTTPException(status_code=404, detail="observation_id not found")
+    return observation
+
+
+def _resolve_v2_action(req: ActionRequestV2) -> ActionRequest:
+    target: Target | None = None
+    if req.target is not None:
+        if req.target.resolved_target_id:
+            item = RESOLVED_TARGETS.get(req.target.resolved_target_id)
+            if item is None:
+                raise HTTPException(status_code=404, detail="resolved_target_id not found")
+            target = PixelTarget(mode="pixel", x=item["x"], y=item["y"], dx=0, dy=0)
+        else:
+            target = PixelTarget(mode="pixel", x=req.target.pixel_x or 0, y=req.target.pixel_y or 0, dx=0, dy=0)
+    return ActionRequest(
+        action=req.action,
+        target=target,
+        duration_ms=req.duration_ms,
+        button=req.button,
+        clicks=req.clicks,
+        scroll_amount=req.scroll_amount,
+        text=req.text,
+        keys=req.keys,
+        interval_ms=req.interval_ms,
+        dry_run=req.dry_run,
+    )
+
+
+def _risk_defaults(risk_level: str) -> dict[str, int]:
+    if risk_level == "high":
+        return {"retries": 1, "timeout_ms": 6000, "poll_interval_ms": 250, "retry_delay_ms": 300}
+    return {"retries": 0, "timeout_ms": 2500, "poll_interval_ms": 200, "retry_delay_ms": 150}
+
 
 def _auth(x_clickthrough_token: Optional[str] = Header(default=None)):
     token = SETTINGS["token"]
@@ -1377,154 +1566,225 @@ def _exec_action(req: ActionRequest, screen: int = 0) -> dict:
     }
 
 
+def _localization_confidence(source: str, confidence: float | None = None) -> str:
+    if source == "image_tool_point":
+        return "high"
+    if source == "ocr" and confidence is not None:
+        if confidence >= 0.8:
+            return "high"
+        if confidence >= 0.55:
+            return "medium"
+    return "low"
+
+
+@app.post("/v2/observe")
+def observe_v2(req: ObserveRequestV2, screen: int = 0, _: None = Depends(_auth)):
+    capture_started = time.perf_counter()
+    image, region, mon, displays, screen_selection = _capture_region_image(
+        screen,
+        req.region_x if req.mode == "region" else None,
+        req.region_y if req.mode == "region" else None,
+        req.region_width if req.mode == "region" else None,
+        req.region_height if req.mode == "region" else None,
+    )
+    capture_ms = int((time.perf_counter() - capture_started) * 1000)
+
+    encoded = None
+    if req.include_image:
+        encoded = _encode_image(image, req.image_format, req.jpeg_quality)
+
+    ocr_started = time.perf_counter()
+    blocks: list[dict] = []
+    grouped_lines: list[dict] = []
+    ocr_applied_mode = "none"
+    if req.ocr_mode != "none":
+        if req.ocr_mode == "screen":
+            ocr_image, ocr_region, _, _, _ = _capture_region_image(screen, None, None, None, None)
+        else:
+            ocr_image, ocr_region = image, region
+
+        area = ocr_region["width"] * ocr_region["height"]
+        if req.max_ocr_area_px is not None and area > req.max_ocr_area_px:
+            raise HTTPException(
+                status_code=400,
+                detail=f"ocr area {area} exceeds max_ocr_area_px {req.max_ocr_area_px}",
+            )
+
+        blocks = _run_ocr(
+            ocr_image,
+            req.language_hint,
+            req.min_confidence,
+            ocr_region["x"],
+            ocr_region["y"],
+        )
+        if req.group_lines:
+            grouped_lines = _group_ocr_lines(blocks)
+        ocr_applied_mode = req.ocr_mode
+    ocr_ms = int((time.perf_counter() - ocr_started) * 1000)
+
+    observation_id = _request_id()
+    OBSERVATIONS[observation_id] = {
+        "id": observation_id,
+        "region": region,
+        "screen": screen_selection,
+        "display": mon,
+        "image_width": image.size[0],
+        "image_height": image.size[1],
+        "ocr_blocks": blocks,
+        "ocr_lines": grouped_lines,
+        "created_at_ms": _now_ms(),
+    }
+
+    return _ok(
+        {
+            "observation_id": observation_id,
+            "region": region,
+            "screen": screen_selection,
+            "display": mon,
+            "image": {
+                "included": req.include_image,
+                "format": req.image_format if req.include_image else None,
+                "base64": encoded,
+                "width": image.size[0],
+                "height": image.size[1],
+            },
+            "ocr": {
+                "mode": ocr_applied_mode,
+                "min_confidence": req.min_confidence,
+                "language_hint": req.language_hint,
+                "block_count": len(blocks),
+                "line_count": len(grouped_lines),
+                "blocks": blocks,
+                "lines": grouped_lines,
+            },
+            "timing_ms": {
+                "capture_ms": capture_ms,
+                "ocr_ms": ocr_ms if req.ocr_mode != "none" else 0,
+                "total_ms": capture_ms + (ocr_ms if req.ocr_mode != "none" else 0),
+            },
+        }
+    )
+
+
+@app.post("/v2/localize")
+def localize_v2(req: LocalizeRequestV2, _: None = Depends(_auth)):
+    observation = _get_observation(req.observation_id)
+    region = observation["region"]
+    image_width = observation["image_width"]
+    image_height = observation["image_height"]
+
+    if req.image_tool_point is not None:
+        if req.image_tool_point.x >= image_width or req.image_tool_point.y >= image_height:
+            raise HTTPException(status_code=400, detail="image_tool_point outside observation image bounds")
+        x = region["x"] + req.image_tool_point.x
+        y = region["y"] + req.image_tool_point.y
+        _enforce_allowed_region(x, y)
+        resolved_target_id = _request_id()
+        RESOLVED_TARGETS[resolved_target_id] = {
+            "id": resolved_target_id,
+            "observation_id": req.observation_id,
+            "x": x,
+            "y": y,
+            "source": "image_tool_point",
+        }
+        return _ok(
+            {
+                "resolved_target_id": resolved_target_id,
+                "source": "image_tool_point",
+                "localization_confidence": _localization_confidence("image_tool_point"),
+                "pixel": {"x": x, "y": y},
+                "observation_region": region,
+                "image_bounds": {"width": image_width, "height": image_height},
+            }
+        )
+
+    lines = observation.get("ocr_lines") or _group_ocr_lines(observation.get("ocr_blocks", []))
+    matches = _find_text_matches(lines, req.text_query or "", req.text_match, False, 200)
+    if not matches:
+        return _err("not_found", "no localization candidates found", 404, {"found": False, "matches": []})
+    if req.candidate_index >= len(matches):
+        raise HTTPException(status_code=400, detail="candidate_index is outside match results")
+
+    chosen = matches[req.candidate_index]
+    bbox = chosen["bbox"]
+    x = bbox["x"] + max(1, bbox["width"] // 2)
+    y = bbox["y"] + max(1, bbox["height"] // 2)
+    _enforce_allowed_region(x, y)
+    resolved_target_id = _request_id()
+    RESOLVED_TARGETS[resolved_target_id] = {
+        "id": resolved_target_id,
+        "observation_id": req.observation_id,
+        "x": x,
+        "y": y,
+        "source": "ocr",
+        "match": chosen,
+    }
+
+    return _ok(
+        {
+            "resolved_target_id": resolved_target_id,
+            "source": "ocr",
+            "localization_confidence": _localization_confidence("ocr", chosen.get("confidence")),
+            "pixel": {"x": x, "y": y},
+            "selected_match": chosen,
+            "match_count": len(matches),
+        }
+    )
+
+
+@app.post("/v2/act")
+def act_v2(req: ActRequestV2, screen: int = 0, _: None = Depends(_auth)):
+    legacy_action = _resolve_v2_action(req.action)
+    result = _exec_action(legacy_action, screen)
+    return _ok(result)
+
+
+@app.post("/v2/act-verify")
+def act_verify_v2(req: ActVerifyRequestV2, screen: int = 0, _: None = Depends(_auth)):
+    defaults = _risk_defaults(req.risk_level)
+    verify_req = VerifyActionRequest(
+        action=_resolve_v2_action(req.action),
+        condition=req.condition,
+        retries=defaults["retries"] if req.retries is None else req.retries,
+        timeout_ms=defaults["timeout_ms"] if req.timeout_ms is None else req.timeout_ms,
+        poll_interval_ms=defaults["poll_interval_ms"] if req.poll_interval_ms is None else req.poll_interval_ms,
+        retry_delay_ms=defaults["retry_delay_ms"] if req.retry_delay_ms is None else req.retry_delay_ms,
+        stop_on_action_error=req.stop_on_action_error,
+    )
+    result = _run_verified_action(verify_req, screen)
+    payload = {
+        "risk_level": req.risk_level,
+        "defaults_applied": defaults,
+        **result,
+    }
+    if result.get("success", False):
+        return _ok(payload)
+    return _err("verification_failed", "action verification did not satisfy condition", 409, payload)
+
+
 @app.get("/health")
 def health(_: None = Depends(_auth)):
-    return {
-        "ok": True,
-        "service": "clickthrough",
-        "version": app.version,
-        "time_ms": _now_ms(),
-        "request_id": _request_id(),
-        "dry_run": SETTINGS["dry_run"],
-        "allowed_region": SETTINGS["allowed_region"],
-        "exec": {
-            "enabled": SETTINGS["exec_enabled"],
-            "secret_configured": bool(SETTINGS["exec_secret"]),
-            "default_shell": SETTINGS["exec_default_shell"],
-            "default_timeout_s": SETTINGS["exec_default_timeout_s"],
-            "max_timeout_s": SETTINGS["exec_max_timeout_s"],
-        },
-    }
+    return _ok(
+        {
+            "service": "clickthrough",
+            "version": app.version,
+            "dry_run": SETTINGS["dry_run"],
+            "allowed_region": SETTINGS["allowed_region"],
+            "exec": {
+                "enabled": SETTINGS["exec_enabled"],
+                "secret_configured": bool(SETTINGS["exec_secret"]),
+                "default_shell": SETTINGS["exec_default_shell"],
+                "default_timeout_s": SETTINGS["exec_default_timeout_s"],
+                "max_timeout_s": SETTINGS["exec_max_timeout_s"],
+            },
+        }
+    )
 
 
 @app.get("/displays")
 def displays(_: None = Depends(_auth)):
     detected = _get_displays()
-    return {
-        "ok": True,
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "displays": detected,
-        "default_screen": 0,
-    }
-
-
-@app.get("/screen")
-def screen(
-    with_grid: bool = True,
-    grid_rows: int = SETTINGS["default_grid_rows"],
-    grid_cols: int = SETTINGS["default_grid_cols"],
-    include_labels: bool = True,
-    image_format: Literal["png", "jpeg"] = "png",
-    jpeg_quality: int = 85,
-    asImage: bool = False,
-    screen: int = 0,
-    _: None = Depends(_auth),
-):
-    req = ScreenRequest(
-        with_grid=with_grid,
-        grid_rows=grid_rows,
-        grid_cols=grid_cols,
-        include_labels=include_labels,
-        image_format=image_format,
-        jpeg_quality=jpeg_quality,
-    )
-
-    base_img, mon, displays, screen_selection = _capture_screen(screen)
-    meta = {"region": mon, "screen": screen_selection, "displays": displays}
-    out_img = base_img
-
-    if req.with_grid:
-        out_img, grid_meta = _draw_grid(base_img, mon["x"], mon["y"], req.grid_rows, req.grid_cols, req.include_labels)
-        meta.update(grid_meta)
-
-    if asImage:
-        image_bytes = _serialize_image(out_img, req.image_format, req.jpeg_quality)
-        media_type = "image/jpeg" if req.image_format == "jpeg" else "image/png"
-        return Response(content=image_bytes, media_type=media_type)
-
-    encoded = _encode_image(out_img, req.image_format, req.jpeg_quality)
-    return {
-        "ok": True,
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "image": {
-            "format": req.image_format,
-            "base64": encoded,
-            "width": out_img.size[0],
-            "height": out_img.size[1],
-        },
-        "meta": meta,
-    }
-
-
-@app.post("/zoom")
-def zoom(req: ZoomRequest, asImage: bool = False, screen: int = 0, _: None = Depends(_auth)):
-    base_img, mon, displays, screen_selection = _capture_screen(screen)
-
-    cx = req.center_x - mon["x"]
-    cy = req.center_y - mon["y"]
-
-    half_w = req.width // 2
-    half_h = req.height // 2
-
-    left = max(0, cx - half_w)
-    top = max(0, cy - half_h)
-    right = min(base_img.size[0], left + req.width)
-    bottom = min(base_img.size[1], top + req.height)
-
-    crop = base_img.crop((left, top, right, bottom))
-
-    region_x = mon["x"] + left
-    region_y = mon["y"] + top
-
-    meta = {
-        "source_monitor": mon,
-        "screen": screen_selection,
-        "displays": displays,
-        "region": {
-            "x": region_x,
-            "y": region_y,
-            "width": crop.size[0],
-            "height": crop.size[1],
-        },
-    }
-
-    out_img = crop
-    if req.with_grid:
-        out_img, grid_meta = _draw_grid(crop, region_x, region_y, req.grid_rows, req.grid_cols, req.include_labels)
-        meta.update(grid_meta)
-
-    if asImage:
-        image_bytes = _serialize_image(out_img, req.image_format, req.jpeg_quality)
-        media_type = "image/jpeg" if req.image_format == "jpeg" else "image/png"
-        return Response(content=image_bytes, media_type=media_type)
-
-    encoded = _encode_image(out_img, req.image_format, req.jpeg_quality)
-
-    return {
-        "ok": True,
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "image": {
-            "format": req.image_format,
-            "base64": encoded,
-            "width": out_img.size[0],
-            "height": out_img.size[1],
-        },
-        "meta": meta,
-    }
-
-
-@app.post("/action")
-def action(req: ActionRequest, screen: int = 0, _: None = Depends(_auth)):
-    result = _exec_action(req, screen)
-    return {
-        "ok": True,
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "result": result,
-    }
+    return _ok({"displays": detected, "default_screen": 0})
 
 
 @app.post("/exec")
@@ -1540,12 +1800,7 @@ def exec_command(
         raise HTTPException(status_code=401, detail="invalid exec secret")
 
     result = _exec_command(req)
-    return {
-        "ok": True,
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "result": result,
-    }
+    return _ok(result)
 
 
 @app.get("/windows")
@@ -1565,151 +1820,19 @@ def windows(
         visible_only=visible_only,
     )
     matches = _list_windows(query)
-    return {
-        "ok": True,
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "windows": matches,
-        "count": len(matches),
-    }
+    return _ok({"windows": matches, "count": len(matches)})
 
 
 @app.post("/windows/action")
 def window_action(req: WindowActionRequest, _: None = Depends(_auth)):
     result = _apply_window_action(req)
-    return {
-        "ok": True,
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "result": result,
-    }
+    return _ok(result)
 
 
 @app.post("/launch")
 def launch(req: LaunchRequest, _: None = Depends(_auth)):
     result = _launch_app(req)
-    return {
-        "ok": True,
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "result": result,
-    }
-
-
-@app.post("/wait")
-def wait(req: WaitRequest, screen: int = 0, _: None = Depends(_auth)):
-    result = _wait_for_condition(req, screen)
-    return {
-        "ok": result.get("satisfied", False),
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "result": result,
-    }
-
-
-@app.post("/vision/diff")
-def vision_diff(req: VisionDiffRequest, screen: int = 0, _: None = Depends(_auth)):
-    result = _compute_visual_diff(req, screen)
-    return {
-        "ok": True,
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "result": result,
-    }
-
-
-@app.post("/vision/stability")
-def vision_stability(req: VisionStabilityRequest, screen: int = 0, _: None = Depends(_auth)):
-    result = _measure_stability(req, screen)
-    return {
-        "ok": True,
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "result": result,
-    }
-
-
-@app.post("/action/verify")
-def action_verify(req: VerifyActionRequest, screen: int = 0, _: None = Depends(_auth)):
-    result = _run_verified_action(req, screen)
-    return {
-        "ok": result.get("success", False),
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "result": result,
-    }
-
-
-@app.post("/ocr")
-def ocr(req: OCRRequest, screen: int = 0, _: None = Depends(_auth)):
-    image, region, mon, displays, screen_selection, source = _capture_ocr_source(req, screen)
-    offset_x = region["x"] if source != "image" else 0
-    offset_y = region["y"] if source != "image" else 0
-    blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y)
-
-    return {
-        "ok": True,
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "result": {
-            "mode": source,
-            "screen": screen_selection if source != "image" else None,
-            "display": mon if source != "image" else None,
-            "language_hint": req.language_hint,
-            "min_confidence": req.min_confidence,
-            "region": region,
-            "blocks": blocks,
-        },
-    }
-
-
-@app.post("/ocr/find")
-def ocr_find(req: OCRFindRequest, screen: int = 0, _: None = Depends(_auth)):
-    image, region, mon, displays, screen_selection, source = _capture_ocr_source(req, screen)
-    offset_x = region["x"] if source != "image" else 0
-    offset_y = region["y"] if source != "image" else 0
-    blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y)
-    matches = _find_text_matches(blocks, req.query, req.match, req.group_lines, req.max_results)
-
-    return {
-        "ok": True,
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "result": {
-            "mode": source,
-            "screen": screen_selection if source != "image" else None,
-            "display": mon if source != "image" else None,
-            "language_hint": req.language_hint,
-            "min_confidence": req.min_confidence,
-            "query": req.query,
-            "match": req.match,
-            "group_lines": req.group_lines,
-            "region": region,
-            "matches": matches,
-            "match_count": len(matches),
-            "blocks_considered": len(blocks),
-        },
-    }
-
-
-@app.post("/batch")
-def batch(req: BatchRequest, screen: int = 0, _: None = Depends(_auth)):
-    results = []
-    for index, item in enumerate(req.actions):
-        try:
-            item_result = _exec_action(item, screen)
-            results.append({"index": index, "ok": True, "result": item_result})
-        except Exception as exc:
-            results.append({"index": index, "ok": False, "error": str(exc)})
-            if req.stop_on_error:
-                break
-
-    return {
-        "ok": all(r["ok"] for r in results),
-        "request_id": _request_id(),
-        "time_ms": _now_ms(),
-        "results": results,
-    }
+    return _ok(result)
 
 
 if __name__ == "__main__":
diff --git a/skill/SKILL.md b/skill/SKILL.md
index cc53f72..334befa 100644
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -1,381 +1,97 @@
 ---
 name: clickthrough-http-control
-description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
+description: Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops.
 ---
 
-# Clickthrough HTTP Control
+# Clickthrough HTTP Control (v2)
 
-Use a strict observe-decide-act-verify loop.
+Agents do not see live desktop video. They operate on snapshots.
+Use this loop: **observe -> localize -> act -> verify**.
 
-## Getting a computer instance (user-owned setup)
+## Fast defaults
 
-The **user/operator** is responsible for provisioning and exposing the target machine.
-The agent should not assume it can self-install this stack.
+- Start with `POST /v2/observe` on a tight region, not full screen.
+- Set `ocr_mode` to `none` unless text is required immediately.
+- Use `image` tool localization for icon-heavy or dense controls.
+- Use `POST /v2/act-verify` instead of manual sleep/poll loops.
 
-### What the user must do
+## Mandatory image-tool click localization
 
-1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
-2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
-3. Configure secrets on target machine:
-   - `CLICKTHROUGH_TOKEN` for general API auth
-   - `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
-4. Share connection details with the agent through a secure channel:
-   - `base_url`
-   - `x-clickthrough-token`
-   - `x-clickthrough-exec-secret` (only when `/exec` is needed)
+When OCR is weak or ambiguous, ask image tool for one coordinate in bounds.
 
-### What the agent should do
-
-1. Validate connection with `GET /health` using provided headers.
-2. Refuse `/exec` attempts when exec secret is missing/invalid.
-3. Ask user for missing setup inputs instead of guessing infrastructure.
-
-## What the agent can actually see
-
-The agent does **not** inherently see the remote desktop.
-Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision.
-
-That means:
-- `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly
-- `POST /ocr` returns machine-readable text blocks when text extraction is enough
-- the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues
-- every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected
-
-Do not write or think as if the agent is directly watching the screen in real time.
-Say what you actually have: screenshots, OCR output, and fresh verification captures.
-
-## Mini API map
-
-- `GET /health` → server status + safety flags
-- `GET /displays` → detected displays in zero-based API order
-- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
-- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
-- `GET /windows` → discover visible desktop windows and their handles/processes
-- `POST /windows/action` → focus/restore/minimize/maximize/close a matched window
-- `POST /launch` → start an app/process without dropping to a shell
-- `POST /wait?screen=0` → wait for text, window, or visual state changes
-- `POST /vision/diff?screen=0` → compare screenshots or regions for meaningful visual change
-- `POST /vision/stability?screen=0` → measure short-interval visual stability
-- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
-- `POST /ocr/find?screen=0` → search OCR output for matching text candidates
-- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
-- `POST /action/verify?screen=0` → execute one action plus structured success verification
-- `POST /batch?screen=0` → sequential action list
-- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
-
-### Display selection
-
-- Use `GET /displays` before operating on multi-monitor systems.
-- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
-- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
-- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
-- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
-- Window rectangles from `GET /windows` are also in global desktop coordinates. Use them to sanity-check which monitor the app is really on before clicking.
-
-### OCR usage
-
-- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
-- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
-- Use `language_hint` when known (for example `eng`) to improve consistency.
-- Filter noise with `min_confidence` (start around `0.4` and tune per app).
-- Treat OCR as one signal, not the only signal, before high-impact clicks.
-- Current response shape is nested under `result.blocks`, not top-level `blocks`. Parse the real payload before assuming the endpoint failed.
-- OCR can be noisy on dense shopping pages, streaming apps, and button-heavy sidebars. Re-crop tightly before escalating.
-
-### Screenshot + `image` tool usage
-
-Use the OpenClaw `image` tool when OCR is not enough.
-This is especially useful for:
-- identifying which visible button looks like the primary confirm action
-- understanding dialog layout or pane structure
-- distinguishing similar nearby controls by icon, spacing, or emphasis
-- checking whether a visual state changed after a click
-- telling you where something is and where to click when text alone is not reliable
-
-Good pattern:
-1. capture with `GET /screen` or `POST /zoom`
-2. hand that screenshot to the `image` tool
-3. ask a precise question about the visible UI
-4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
-5. convert the answer into a concrete Clickthrough target
-6. act once
-7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
-
-Prefer vision over guessing.
-If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
-The model should help answer things like:
-- which visible button is the real primary action
-- whether the target is left/right/top/bottom within the crop
-- which of several similar buttons is the one to click
-- an approximate click point inside the provided image bounds
-
-Ask narrow questions.
-Good:
-- "Which button in this dialog is the primary confirmation action?"
-- "Is the scan still running, or does this look complete?"
-- "Which of these tabs appears selected?"
-- "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
-- "Which visible control says Stop Recording, and where should I click?"
-
-Bad:
-- "What should I click?"
-- "Use your eyes and do the task"
-- anything that assumes the model has live continuity without a new screenshot
-- requesting coordinates without telling the model the image bounds or expected output format
-
-### Header requirements
-
-- Always send `x-clickthrough-token` when token auth is enabled.
-- For `/exec`, also send `x-clickthrough-exec-secret`.
-
-## `POST /action` request shape (important)
-
-`/action` always expects an `action` plus an optional `target` object.
-Do **not** invent top-level `x` / `y` fields.
-
-Minimal pixel click:
-
-```json
-{
-  "action": "click",
-  "target": {"mode": "pixel", "x": 100, "y": 200},
-  "button": "left",
-  "clicks": 1
-}
-```
-
-Minimal grid click:
-
-```json
-{
-  "action": "click",
-  "target": {
-    "mode": "grid",
-    "region_x": 0,
-    "region_y": 0,
-    "region_width": 1920,
-    "region_height": 1080,
-    "rows": 12,
-    "cols": 12,
-    "row": 6,
-    "col": 8,
-    "dx": 0.0,
-    "dy": 0.0
-  }
-}
-```
-
-Other canonical examples:
-
-```json
-{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
-{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
-{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
-{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
-{"action": "type", "text": "hello world", "interval_ms": 20}
-{"action": "hotkey", "keys": ["ctrl", "l"]}
-```
+Prompt template:
+- "Return one click point as JSON `{\"x\":<int>,\"y\":<int>}` inside this image (`width=W`, `height=H`) for the **<exact target>** control."
 
 Rules:
-- `dx` / `dy` belong inside `target`, not beside it.
-- `type` and `hotkey` usually do not need a `target`.
-- For pixel targets, `x` / `y` are global desktop coordinates.
-- For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used.
+- Ask for one point only.
+- Include bounds in the prompt.
+- If answer is not parseable `x,y`, re-ask once with stricter format.
+- Send returned point to `POST /v2/localize` via `image_tool_point`.
 
-## When to use `/exec`
+## API playbook
 
-Prefer structured GUI control first:
-- `/screen`, `/zoom`, `/ocr` to observe
-- `/action` or `/batch` to interact
+1. **Observe**
 
-Use `/exec` only when it is the cleanest available tool for the job, for example:
-- querying machine state that the GUI does not expose well
-- performing an explicit user-requested shell/system task
-- recovering from a blocked GUI flow when normal interaction failed
+```json
+POST /v2/observe?screen=0
+{
+  "mode": "region",
+  "region_x": 820,
+  "region_y": 420,
+  "region_width": 700,
+  "region_height": 420,
+  "include_image": true,
+  "ocr_mode": "none"
+}
+```
 
-Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
-Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
-When a task can be completed with window focus/restore, keyboard shortcuts, screenshots, OCR, and normal actions, stay out of `/exec` entirely.
+2. **Localize** (choose one)
 
-## Core workflow (mandatory)
+Text:
+```json
+POST /v2/localize
+{"observation_id":"...","text_query":"Save","text_match":"exact"}
+```
 
-1. Call `GET /windows` first when the task mentions a known app; focus/restore the right window before screen hunting.
-2. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
-3. Identify likely target region and compute an initial confidence score.
-4. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
-5. **Before any click**, verify target identity (OCR text/icon/location consistency).
-6. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
-7. Execute one minimal action via `POST /action`.
-8. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
-9. Repeat until objective is complete.
+Image-tool point:
+```json
+POST /v2/localize
+{"observation_id":"...","image_tool_point":{"x":312,"y":188}}
+```
 
-## Verify-before-click rules
+3. **Act**
 
-- Never click if target identity is ambiguous.
-- Require at least two matching signals before click.
-- Good signal pairs include:
-  - OCR text + expected UI region
-  - OCR text + matching button shape/icon nearby
-  - dialog title text + expected button position within that dialog
-  - known app/window focus + expected control location
-  - OCR candidate + vision-model localization inside the same crop
-- If confidence is low, do not "test click"; zoom and re-localize first.
-- If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
-- For high-impact actions (close/delete/send/purchase), use two-phase flow:
-  1) preview intended coordinate + reason
-  2) execute only after explicit confirmation.
+```json
+POST /v2/act?screen=0
+{"action":{"action":"click","target":{"resolved_target_id":"..."}}}
+```
 
-## Precision rules
+4. **Verify**
 
-- Prefer grid targets first, then use `dx/dy` for subcell precision.
-- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
-- Use zoom before guessing offsets.
-- Avoid stale coordinates: re-capture before action if UI moved/scrolled.
+```json
+POST /v2/act-verify?screen=0
+{
+  "action":{"action":"click","target":{"resolved_target_id":"..."}},
+  "condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420},
+  "risk_level":"low"
+}
+```
 
-## Safety rules
+## Risk policy
 
-- Respect `dry_run` and `allowed_region` restrictions from `/health`.
-- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
-- Avoid destructive shortcuts unless explicitly requested.
-- Send one action at a time unless deterministic; then use `/batch`.
+- Low risk (navigation, focus, benign clicks): single verification signal.
+- High risk (delete/send/purchase/close-lossy): use `risk_level=high` and require two checks before act.
+- Never do speculative repeat clicks; switch strategy after one failed verify.
 
-## Reliability rules
+## Anti-latency rules
 
-- After every meaningful action, verify with a fresh screenshot.
-- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
-- Prefer short, reversible actions over long macros.
-- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
+- Never repeat full-screen OCR by default.
+- Re-observe only the active pane/region.
+- Prefer keyboard + window APIs for app switching.
+- Use OCR on region only and cap area with `max_ocr_area_px`.
 
-## Fallback ladder for uncertain targeting
+## Setup and auth
 
-1. Full-screen capture with a coarse grid.
-2. Zoom into the candidate area with a denser grid.
-3. OCR the full screen or the tighter region.
-4. Re-anchor on a more reliable nearby control, title, or label.
-5. Try a keyboard-first flow if the app supports it.
-6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner.
-
-Do not skip from "uncertain click" straight to random retries.
-
-## Concrete screenshot -> `image` -> action example
-
-Example loop:
-1. `GET /screen?screen=0` to capture the current app state
-2. if the UI is text-heavy, try `POST /ocr` first
-3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like:
-   - "In this save dialog, which visible button is the primary action?"
-   - "Is there a dismiss/close button in the top-right of this modal?"
-4. map the answer back to a Clickthrough target using the returned grid/region metadata
-5. click once with `POST /action`
-6. recapture the screen
-7. optionally use `POST /wait` or another `image`/OCR check to confirm the result
-
-The key rule is simple: screenshot first, interpret second, click third, verify fourth.
-Do not collapse those steps into fake certainty.
-When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.
-
-## App-specific playbooks (recommended)
-
-Build per-app routines for repetitive tasks instead of generic clicking.
-
-### Launcher / search / start app playbook
-
-Use this when the goal is "open app X" or "bring up tool Y".
-
-1. check `GET /windows` first in case the app is already open
-2. if present, use `POST /windows/action` to focus or restore it
-3. if absent, prefer `POST /launch` when you know the executable path
-4. if launch path is unknown but the OS launcher/search UI is available, use a keyboard-first flow:
-   - open launcher (`win`, `cmd+space`, or app-specific shortcut depending on host)
-   - type exact app name
-   - wait for stable results with `POST /wait` or recapture
-   - verify the result text with OCR or the `image` tool
-   - press Enter or click the exact result once
-5. verify the app window now exists or is focused
-
-Do not keep relaunching if the window already exists; that’s sloppy.
-
-### Dialog confirmation playbook
-
-Use for modals like save/discard, delete confirmation, permission prompts, and installer dialogs.
-
-1. capture the dialog region with `POST /zoom`
-2. use OCR first for title/body/button labels
-3. if button hierarchy or emphasis matters, inspect the zoomed screenshot with the `image` tool
-4. identify the exact intended action (`Cancel`, `Save`, `Allow`, `Delete`, etc.)
-5. for destructive actions, require explicit user confirmation unless already requested
-6. click once and verify the dialog disappeared or changed state
-
-Good verification targets:
-- dialog title vanished
-- expected next window appeared
-- destructive side effect is visible and confirmed
-
-### File picker playbook
-
-Use for open/save dialogs.
-
-1. verify the file picker window is focused
-2. OCR the visible breadcrumb/path area, filename field, and button row
-3. prefer keyboard-first entry when possible:
-   - type or paste the target path/name into the focused field
-   - use `tab` / `shift+tab` to move predictably between filename and action buttons
-4. if the target path is uncertain, use OCR plus the `image` tool to identify the active field and selected folder/file row
-5. verify the intended filename/path is visible before confirming
-6. activate `Open` / `Save` once and verify the picker closes
-
-If the picker stays open, stop and inspect why instead of hammering Enter like a maniac.
-
-### Browser tab / window playbook
-
-Use for browser navigation, tab targeting, or web app recovery.
-
-1. use `GET /windows` to focus the correct browser window first
-2. prefer keyboard-first navigation:
-   - `ctrl+l` / `cmd+l` to focus the address bar
-   - `ctrl+tab` / `ctrl+shift+tab` for tab movement when order is known
-   - `ctrl+w` only for explicitly requested close actions
-3. verify tab or page identity with OCR on the tab strip or page heading
-4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs
-5. after navigation, wait for visual stability or expected text before taking the next action
-6. on shopping/checkouts, tighten crops around the buy box or checkout panel before reading button text; full-page OCR often misses the one thing that matters
-
-Do not assume a page loaded just because the click landed. Verify it.
-
-### Settings / preferences navigation playbook
-
-Use when the task involves toggles, dropdowns, sidebars, or nested settings panels.
-
-1. identify the current settings page with OCR on the heading/sidebar
-2. use OCR to find the specific section label before trying to toggle anything
-3. if the layout is dense, zoom into the relevant pane and use the `image` tool to distinguish labels from controls
-4. prefer small reversible actions: one toggle, one dropdown, one field edit at a time
-5. after each change, verify the control state changed visually or via visible text
-6. if a save/apply button exists, treat it as a separate confirmation step and verify completion
-
-Settings UIs love hiding side effects. Assume nothing.
-
-### Dense app / control-strip playbook
-
-Use for apps like OBS, IDEs, mixers, dashboards, or anything with tiny bottom-right control clusters.
-
-1. focus the exact app window with `POST /windows/action`
-2. capture the full target display once to confirm the window is actually frontmost
-3. crop tightly around the suspected control strip with `POST /zoom`
-4. run OCR on the crop, not the full screen
-5. if labels are still ambiguous, ask the `image` tool a narrow question about the specific buttons
-6. click once and immediately verify the control label changed (`Start Recording` -> `Stop Recording`, etc.)
-
-Do not trust OCR taken from the wrong frontmost window. It will happily waste your time.
-
-### Spotify playbook
-
-- Focus app window before search/navigation.
-- Prefer keyboard-first flow for song start:
-  1) `Ctrl+L` (search)
-  2) type exact query
-  3) Enter
-  4) verify exact song+artist text
-  5) click/double-click row
-  6) verify now-playing bar
-- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.
+- Include `x-clickthrough-token` when token auth is enabled.
+- `/exec` additionally requires `x-clickthrough-exec-secret`.
+- Validate server first: `GET /health`.