feat: migrate to v2-only API and unified response envelope
All checks were successful
python-syntax / syntax-check (push) Successful in 7s

This commit is contained in:
2026-05-03 19:11:11 +02:00
parent 2585bc3a7c
commit aced5be25e
5 changed files with 603 additions and 1267 deletions

View File

@@ -1,381 +1,97 @@
---
name: clickthrough-http-control
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
description: Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops.
---
# Clickthrough HTTP Control
# Clickthrough HTTP Control (v2)
Use a strict observe-decide-act-verify loop.
Agents do not see live desktop video. They operate on snapshots.
Use this loop: **observe -> localize -> act -> verify**.
## Getting a computer instance (user-owned setup)
## Fast defaults
The **user/operator** is responsible for provisioning and exposing the target machine.
The agent should not assume it can self-install this stack.
- Start with `POST /v2/observe` on a tight region, not full screen.
- Set `ocr_mode` to `none` unless text is required immediately.
- Use `image` tool localization for icon-heavy or dense controls.
- Use `POST /v2/act-verify` instead of manual sleep/poll loops.
### What the user must do
## Mandatory image-tool click localization
1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
3. Configure secrets on target machine:
- `CLICKTHROUGH_TOKEN` for general API auth
- `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
4. Share connection details with the agent through a secure channel:
- `base_url`
- `x-clickthrough-token`
- `x-clickthrough-exec-secret` (only when `/exec` is needed)
When OCR is weak or ambiguous, ask image tool for one coordinate in bounds.
### What the agent should do
1. Validate connection with `GET /health` using provided headers.
2. Refuse `/exec` attempts when exec secret is missing/invalid.
3. Ask user for missing setup inputs instead of guessing infrastructure.
## What the agent can actually see
The agent does **not** inherently see the remote desktop.
Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision.
That means:
- `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly
- `POST /ocr` returns machine-readable text blocks when text extraction is enough
- the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues
- every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected
Do not write or think as if the agent is directly watching the screen in real time.
Say what you actually have: screenshots, OCR output, and fresh verification captures.
## Mini API map
- `GET /health` → server status + safety flags
- `GET /displays` → detected displays in zero-based API order
- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
- `GET /windows` → discover visible desktop windows and their handles/processes
- `POST /windows/action` → focus/restore/minimize/maximize/close a matched window
- `POST /launch` → start an app/process without dropping to a shell
- `POST /wait?screen=0` → wait for text, window, or visual state changes
- `POST /vision/diff?screen=0` → compare screenshots or regions for meaningful visual change
- `POST /vision/stability?screen=0` → measure short-interval visual stability
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
- `POST /ocr/find?screen=0` → search OCR output for matching text candidates
- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
- `POST /action/verify?screen=0` → execute one action plus structured success verification
- `POST /batch?screen=0` → sequential action list
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
### Display selection
- Use `GET /displays` before operating on multi-monitor systems.
- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
- Window rectangles from `GET /windows` are also in global desktop coordinates. Use them to sanity-check which monitor the app is really on before clicking.
### OCR usage
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
- Use `language_hint` when known (for example `eng`) to improve consistency.
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
- Treat OCR as one signal, not the only signal, before high-impact clicks.
- Current response shape is nested under `result.blocks`, not top-level `blocks`. Parse the real payload before assuming the endpoint failed.
- OCR can be noisy on dense shopping pages, streaming apps, and button-heavy sidebars. Re-crop tightly before escalating.
### Screenshot + `image` tool usage
Use the OpenClaw `image` tool when OCR is not enough.
This is especially useful for:
- identifying which visible button looks like the primary confirm action
- understanding dialog layout or pane structure
- distinguishing similar nearby controls by icon, spacing, or emphasis
- checking whether a visual state changed after a click
- telling you where something is and where to click when text alone is not reliable
Good pattern:
1. capture with `GET /screen` or `POST /zoom`
2. hand that screenshot to the `image` tool
3. ask a precise question about the visible UI
4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
5. convert the answer into a concrete Clickthrough target
6. act once
7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
Prefer vision over guessing.
If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
The model should help answer things like:
- which visible button is the real primary action
- whether the target is left/right/top/bottom within the crop
- which of several similar buttons is the one to click
- an approximate click point inside the provided image bounds
Ask narrow questions.
Good:
- "Which button in this dialog is the primary confirmation action?"
- "Is the scan still running, or does this look complete?"
- "Which of these tabs appears selected?"
- "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
- "Which visible control says Stop Recording, and where should I click?"
Bad:
- "What should I click?"
- "Use your eyes and do the task"
- anything that assumes the model has live continuity without a new screenshot
- requesting coordinates without telling the model the image bounds or expected output format
### Header requirements
- Always send `x-clickthrough-token` when token auth is enabled.
- For `/exec`, also send `x-clickthrough-exec-secret`.
## `POST /action` request shape (important)
`/action` always expects an `action` plus an optional `target` object.
Do **not** invent top-level `x` / `y` fields.
Minimal pixel click:
```json
{
"action": "click",
"target": {"mode": "pixel", "x": 100, "y": 200},
"button": "left",
"clicks": 1
}
```
Minimal grid click:
```json
{
"action": "click",
"target": {
"mode": "grid",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"rows": 12,
"cols": 12,
"row": 6,
"col": 8,
"dx": 0.0,
"dy": 0.0
}
}
```
Other canonical examples:
```json
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
{"action": "type", "text": "hello world", "interval_ms": 20}
{"action": "hotkey", "keys": ["ctrl", "l"]}
```
Prompt template:
- "Return one click point as JSON `{\"x\":<int>,\"y\":<int>}` inside this image (`width=W`, `height=H`) for the **<exact target>** control."
Rules:
- `dx` / `dy` belong inside `target`, not beside it.
- `type` and `hotkey` usually do not need a `target`.
- For pixel targets, `x` / `y` are global desktop coordinates.
- For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used.
- Ask for one point only.
- Include bounds in the prompt.
- If answer is not parseable `x,y`, re-ask once with stricter format.
- Send returned point to `POST /v2/localize` via `image_tool_point`.
## When to use `/exec`
## API playbook
Prefer structured GUI control first:
- `/screen`, `/zoom`, `/ocr` to observe
- `/action` or `/batch` to interact
1. **Observe**
Use `/exec` only when it is the cleanest available tool for the job, for example:
- querying machine state that the GUI does not expose well
- performing an explicit user-requested shell/system task
- recovering from a blocked GUI flow when normal interaction failed
```json
POST /v2/observe?screen=0
{
"mode": "region",
"region_x": 820,
"region_y": 420,
"region_width": 700,
"region_height": 420,
"include_image": true,
"ocr_mode": "none"
}
```
Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
When a task can be completed with window focus/restore, keyboard shortcuts, screenshots, OCR, and normal actions, stay out of `/exec` entirely.
2. **Localize** (choose one)
## Core workflow (mandatory)
Text:
```json
POST /v2/localize
{"observation_id":"...","text_query":"Save","text_match":"exact"}
```
1. Call `GET /windows` first when the task mentions a known app; focus/restore the right window before screen hunting.
2. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
3. Identify likely target region and compute an initial confidence score.
4. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
5. **Before any click**, verify target identity (OCR text/icon/location consistency).
6. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
7. Execute one minimal action via `POST /action`.
8. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
9. Repeat until objective is complete.
Image-tool point:
```json
POST /v2/localize
{"observation_id":"...","image_tool_point":{"x":312,"y":188}}
```
## Verify-before-click rules
3. **Act**
- Never click if target identity is ambiguous.
- Require at least two matching signals before click.
- Good signal pairs include:
- OCR text + expected UI region
- OCR text + matching button shape/icon nearby
- dialog title text + expected button position within that dialog
- known app/window focus + expected control location
- OCR candidate + vision-model localization inside the same crop
- If confidence is low, do not "test click"; zoom and re-localize first.
- If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
1) preview intended coordinate + reason
2) execute only after explicit confirmation.
```json
POST /v2/act?screen=0
{"action":{"action":"click","target":{"resolved_target_id":"..."}}}
```
## Precision rules
4. **Verify**
- Prefer grid targets first, then use `dx/dy` for subcell precision.
- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
- Use zoom before guessing offsets.
- Avoid stale coordinates: re-capture before action if UI moved/scrolled.
```json
POST /v2/act-verify?screen=0
{
"action":{"action":"click","target":{"resolved_target_id":"..."}},
"condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420},
"risk_level":"low"
}
```
## Safety rules
## Risk policy
- Respect `dry_run` and `allowed_region` restrictions from `/health`.
- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
- Avoid destructive shortcuts unless explicitly requested.
- Send one action at a time unless deterministic; then use `/batch`.
- Low risk (navigation, focus, benign clicks): single verification signal.
- High risk (delete/send/purchase/close-lossy): use `risk_level=high` and require two checks before act.
- Never do speculative repeat clicks; switch strategy after one failed verify.
## Reliability rules
## Anti-latency rules
- After every meaningful action, verify with a fresh screenshot.
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
- Prefer short, reversible actions over long macros.
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
- Never repeat full-screen OCR by default.
- Re-observe only the active pane/region.
- Prefer keyboard + window APIs for app switching.
- Use OCR on region only and cap area with `max_ocr_area_px`.
## Fallback ladder for uncertain targeting
## Setup and auth
1. Full-screen capture with a coarse grid.
2. Zoom into the candidate area with a denser grid.
3. OCR the full screen or the tighter region.
4. Re-anchor on a more reliable nearby control, title, or label.
5. Try a keyboard-first flow if the app supports it.
6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner.
Do not skip from "uncertain click" straight to random retries.
## Concrete screenshot -> `image` -> action example
Example loop:
1. `GET /screen?screen=0` to capture the current app state
2. if the UI is text-heavy, try `POST /ocr` first
3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like:
- "In this save dialog, which visible button is the primary action?"
- "Is there a dismiss/close button in the top-right of this modal?"
4. map the answer back to a Clickthrough target using the returned grid/region metadata
5. click once with `POST /action`
6. recapture the screen
7. optionally use `POST /wait` or another `image`/OCR check to confirm the result
The key rule is simple: screenshot first, interpret second, click third, verify fourth.
Do not collapse those steps into fake certainty.
When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.
## App-specific playbooks (recommended)
Build per-app routines for repetitive tasks instead of generic clicking.
### Launcher / search / start app playbook
Use this when the goal is "open app X" or "bring up tool Y".
1. check `GET /windows` first in case the app is already open
2. if present, use `POST /windows/action` to focus or restore it
3. if absent, prefer `POST /launch` when you know the executable path
4. if launch path is unknown but the OS launcher/search UI is available, use a keyboard-first flow:
- open launcher (`win`, `cmd+space`, or app-specific shortcut depending on host)
- type exact app name
- wait for stable results with `POST /wait` or recapture
- verify the result text with OCR or the `image` tool
- press Enter or click the exact result once
5. verify the app window now exists or is focused
Do not keep relaunching if the window already exists; thats sloppy.
### Dialog confirmation playbook
Use for modals like save/discard, delete confirmation, permission prompts, and installer dialogs.
1. capture the dialog region with `POST /zoom`
2. use OCR first for title/body/button labels
3. if button hierarchy or emphasis matters, inspect the zoomed screenshot with the `image` tool
4. identify the exact intended action (`Cancel`, `Save`, `Allow`, `Delete`, etc.)
5. for destructive actions, require explicit user confirmation unless already requested
6. click once and verify the dialog disappeared or changed state
Good verification targets:
- dialog title vanished
- expected next window appeared
- destructive side effect is visible and confirmed
### File picker playbook
Use for open/save dialogs.
1. verify the file picker window is focused
2. OCR the visible breadcrumb/path area, filename field, and button row
3. prefer keyboard-first entry when possible:
- type or paste the target path/name into the focused field
- use `tab` / `shift+tab` to move predictably between filename and action buttons
4. if the target path is uncertain, use OCR plus the `image` tool to identify the active field and selected folder/file row
5. verify the intended filename/path is visible before confirming
6. activate `Open` / `Save` once and verify the picker closes
If the picker stays open, stop and inspect why instead of hammering Enter like a maniac.
### Browser tab / window playbook
Use for browser navigation, tab targeting, or web app recovery.
1. use `GET /windows` to focus the correct browser window first
2. prefer keyboard-first navigation:
- `ctrl+l` / `cmd+l` to focus the address bar
- `ctrl+tab` / `ctrl+shift+tab` for tab movement when order is known
- `ctrl+w` only for explicitly requested close actions
3. verify tab or page identity with OCR on the tab strip or page heading
4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs
5. after navigation, wait for visual stability or expected text before taking the next action
6. on shopping/checkouts, tighten crops around the buy box or checkout panel before reading button text; full-page OCR often misses the one thing that matters
Do not assume a page loaded just because the click landed. Verify it.
### Settings / preferences navigation playbook
Use when the task involves toggles, dropdowns, sidebars, or nested settings panels.
1. identify the current settings page with OCR on the heading/sidebar
2. use OCR to find the specific section label before trying to toggle anything
3. if the layout is dense, zoom into the relevant pane and use the `image` tool to distinguish labels from controls
4. prefer small reversible actions: one toggle, one dropdown, one field edit at a time
5. after each change, verify the control state changed visually or via visible text
6. if a save/apply button exists, treat it as a separate confirmation step and verify completion
Settings UIs love hiding side effects. Assume nothing.
### Dense app / control-strip playbook
Use for apps like OBS, IDEs, mixers, dashboards, or anything with tiny bottom-right control clusters.
1. focus the exact app window with `POST /windows/action`
2. capture the full target display once to confirm the window is actually frontmost
3. crop tightly around the suspected control strip with `POST /zoom`
4. run OCR on the crop, not the full screen
5. if labels are still ambiguous, ask the `image` tool a narrow question about the specific buttons
6. click once and immediately verify the control label changed (`Start Recording` -> `Stop Recording`, etc.)
Do not trust OCR taken from the wrong frontmost window. It will happily waste your time.
### Spotify playbook
- Focus app window before search/navigation.
- Prefer keyboard-first flow for song start:
1) `Ctrl+L` (search)
2) type exact query
3) Enter
4) verify exact song+artist text
5) click/double-click row
6) verify now-playing bar
- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.
- Include `x-clickthrough-token` when token auth is enabled.
- `/exec` additionally requires `x-clickthrough-exec-secret`.
- Validate server first: `GET /health`.