feat: migrate to v2-only API and unified response envelope
All checks were successful
python-syntax / syntax-check (push) Successful in 7s
All checks were successful
python-syntax / syntax-check (push) Successful in 7s
This commit is contained in:
422
skill/SKILL.md
422
skill/SKILL.md
@@ -1,381 +1,97 @@
|
||||
---
|
||||
name: clickthrough-http-control
|
||||
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
|
||||
description: Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops.
|
||||
---
|
||||
|
||||
# Clickthrough HTTP Control
|
||||
# Clickthrough HTTP Control (v2)
|
||||
|
||||
Use a strict observe-decide-act-verify loop.
|
||||
Agents do not see live desktop video. They operate on snapshots.
|
||||
Use this loop: **observe -> localize -> act -> verify**.
|
||||
|
||||
## Getting a computer instance (user-owned setup)
|
||||
## Fast defaults
|
||||
|
||||
The **user/operator** is responsible for provisioning and exposing the target machine.
|
||||
The agent should not assume it can self-install this stack.
|
||||
- Start with `POST /v2/observe` on a tight region, not full screen.
|
||||
- Set `ocr_mode` to `none` unless text is required immediately.
|
||||
- Use `image` tool localization for icon-heavy or dense controls.
|
||||
- Use `POST /v2/act-verify` instead of manual sleep/poll loops.
|
||||
|
||||
### What the user must do
|
||||
## Mandatory image-tool click localization
|
||||
|
||||
1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
|
||||
2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
|
||||
3. Configure secrets on target machine:
|
||||
- `CLICKTHROUGH_TOKEN` for general API auth
|
||||
- `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
|
||||
4. Share connection details with the agent through a secure channel:
|
||||
- `base_url`
|
||||
- `x-clickthrough-token`
|
||||
- `x-clickthrough-exec-secret` (only when `/exec` is needed)
|
||||
When OCR is weak or ambiguous, ask image tool for one coordinate in bounds.
|
||||
|
||||
### What the agent should do
|
||||
|
||||
1. Validate connection with `GET /health` using provided headers.
|
||||
2. Refuse `/exec` attempts when exec secret is missing/invalid.
|
||||
3. Ask user for missing setup inputs instead of guessing infrastructure.
|
||||
|
||||
## What the agent can actually see
|
||||
|
||||
The agent does **not** inherently see the remote desktop.
|
||||
Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision.
|
||||
|
||||
That means:
|
||||
- `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly
|
||||
- `POST /ocr` returns machine-readable text blocks when text extraction is enough
|
||||
- the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues
|
||||
- every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected
|
||||
|
||||
Do not write or think as if the agent is directly watching the screen in real time.
|
||||
Say what you actually have: screenshots, OCR output, and fresh verification captures.
|
||||
|
||||
## Mini API map
|
||||
|
||||
- `GET /health` → server status + safety flags
|
||||
- `GET /displays` → detected displays in zero-based API order
|
||||
- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
|
||||
- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
|
||||
- `GET /windows` → discover visible desktop windows and their handles/processes
|
||||
- `POST /windows/action` → focus/restore/minimize/maximize/close a matched window
|
||||
- `POST /launch` → start an app/process without dropping to a shell
|
||||
- `POST /wait?screen=0` → wait for text, window, or visual state changes
|
||||
- `POST /vision/diff?screen=0` → compare screenshots or regions for meaningful visual change
|
||||
- `POST /vision/stability?screen=0` → measure short-interval visual stability
|
||||
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
|
||||
- `POST /ocr/find?screen=0` → search OCR output for matching text candidates
|
||||
- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
|
||||
- `POST /action/verify?screen=0` → execute one action plus structured success verification
|
||||
- `POST /batch?screen=0` → sequential action list
|
||||
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
|
||||
|
||||
### Display selection
|
||||
|
||||
- Use `GET /displays` before operating on multi-monitor systems.
|
||||
- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
|
||||
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
|
||||
- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
|
||||
- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
|
||||
- Window rectangles from `GET /windows` are also in global desktop coordinates. Use them to sanity-check which monitor the app is really on before clicking.
|
||||
|
||||
### OCR usage
|
||||
|
||||
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
|
||||
- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
|
||||
- Use `language_hint` when known (for example `eng`) to improve consistency.
|
||||
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
|
||||
- Treat OCR as one signal, not the only signal, before high-impact clicks.
|
||||
- Current response shape is nested under `result.blocks`, not top-level `blocks`. Parse the real payload before assuming the endpoint failed.
|
||||
- OCR can be noisy on dense shopping pages, streaming apps, and button-heavy sidebars. Re-crop tightly before escalating.
|
||||
|
||||
### Screenshot + `image` tool usage
|
||||
|
||||
Use the OpenClaw `image` tool when OCR is not enough.
|
||||
This is especially useful for:
|
||||
- identifying which visible button looks like the primary confirm action
|
||||
- understanding dialog layout or pane structure
|
||||
- distinguishing similar nearby controls by icon, spacing, or emphasis
|
||||
- checking whether a visual state changed after a click
|
||||
- telling you where something is and where to click when text alone is not reliable
|
||||
|
||||
Good pattern:
|
||||
1. capture with `GET /screen` or `POST /zoom`
|
||||
2. hand that screenshot to the `image` tool
|
||||
3. ask a precise question about the visible UI
|
||||
4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
|
||||
5. convert the answer into a concrete Clickthrough target
|
||||
6. act once
|
||||
7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
|
||||
|
||||
Prefer vision over guessing.
|
||||
If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
|
||||
The model should help answer things like:
|
||||
- which visible button is the real primary action
|
||||
- whether the target is left/right/top/bottom within the crop
|
||||
- which of several similar buttons is the one to click
|
||||
- an approximate click point inside the provided image bounds
|
||||
|
||||
Ask narrow questions.
|
||||
Good:
|
||||
- "Which button in this dialog is the primary confirmation action?"
|
||||
- "Is the scan still running, or does this look complete?"
|
||||
- "Which of these tabs appears selected?"
|
||||
- "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
|
||||
- "Which visible control says Stop Recording, and where should I click?"
|
||||
|
||||
Bad:
|
||||
- "What should I click?"
|
||||
- "Use your eyes and do the task"
|
||||
- anything that assumes the model has live continuity without a new screenshot
|
||||
- requesting coordinates without telling the model the image bounds or expected output format
|
||||
|
||||
### Header requirements
|
||||
|
||||
- Always send `x-clickthrough-token` when token auth is enabled.
|
||||
- For `/exec`, also send `x-clickthrough-exec-secret`.
|
||||
|
||||
## `POST /action` request shape (important)
|
||||
|
||||
`/action` always expects an `action` plus an optional `target` object.
|
||||
Do **not** invent top-level `x` / `y` fields.
|
||||
|
||||
Minimal pixel click:
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "click",
|
||||
"target": {"mode": "pixel", "x": 100, "y": 200},
|
||||
"button": "left",
|
||||
"clicks": 1
|
||||
}
|
||||
```
|
||||
|
||||
Minimal grid click:
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "click",
|
||||
"target": {
|
||||
"mode": "grid",
|
||||
"region_x": 0,
|
||||
"region_y": 0,
|
||||
"region_width": 1920,
|
||||
"region_height": 1080,
|
||||
"rows": 12,
|
||||
"cols": 12,
|
||||
"row": 6,
|
||||
"col": 8,
|
||||
"dx": 0.0,
|
||||
"dy": 0.0
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Other canonical examples:
|
||||
|
||||
```json
|
||||
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
||||
{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
||||
{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
||||
{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
|
||||
{"action": "type", "text": "hello world", "interval_ms": 20}
|
||||
{"action": "hotkey", "keys": ["ctrl", "l"]}
|
||||
```
|
||||
Prompt template:
|
||||
- "Return one click point as JSON `{\"x\":<int>,\"y\":<int>}` inside this image (`width=W`, `height=H`) for the **<exact target>** control."
|
||||
|
||||
Rules:
|
||||
- `dx` / `dy` belong inside `target`, not beside it.
|
||||
- `type` and `hotkey` usually do not need a `target`.
|
||||
- For pixel targets, `x` / `y` are global desktop coordinates.
|
||||
- For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used.
|
||||
- Ask for one point only.
|
||||
- Include bounds in the prompt.
|
||||
- If answer is not parseable `x,y`, re-ask once with stricter format.
|
||||
- Send returned point to `POST /v2/localize` via `image_tool_point`.
|
||||
|
||||
## When to use `/exec`
|
||||
## API playbook
|
||||
|
||||
Prefer structured GUI control first:
|
||||
- `/screen`, `/zoom`, `/ocr` to observe
|
||||
- `/action` or `/batch` to interact
|
||||
1. **Observe**
|
||||
|
||||
Use `/exec` only when it is the cleanest available tool for the job, for example:
|
||||
- querying machine state that the GUI does not expose well
|
||||
- performing an explicit user-requested shell/system task
|
||||
- recovering from a blocked GUI flow when normal interaction failed
|
||||
```json
|
||||
POST /v2/observe?screen=0
|
||||
{
|
||||
"mode": "region",
|
||||
"region_x": 820,
|
||||
"region_y": 420,
|
||||
"region_width": 700,
|
||||
"region_height": 420,
|
||||
"include_image": true,
|
||||
"ocr_mode": "none"
|
||||
}
|
||||
```
|
||||
|
||||
Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
|
||||
Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
|
||||
When a task can be completed with window focus/restore, keyboard shortcuts, screenshots, OCR, and normal actions, stay out of `/exec` entirely.
|
||||
2. **Localize** (choose one)
|
||||
|
||||
## Core workflow (mandatory)
|
||||
Text:
|
||||
```json
|
||||
POST /v2/localize
|
||||
{"observation_id":"...","text_query":"Save","text_match":"exact"}
|
||||
```
|
||||
|
||||
1. Call `GET /windows` first when the task mentions a known app; focus/restore the right window before screen hunting.
|
||||
2. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
|
||||
3. Identify likely target region and compute an initial confidence score.
|
||||
4. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
||||
5. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
||||
6. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
|
||||
7. Execute one minimal action via `POST /action`.
|
||||
8. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
|
||||
9. Repeat until objective is complete.
|
||||
Image-tool point:
|
||||
```json
|
||||
POST /v2/localize
|
||||
{"observation_id":"...","image_tool_point":{"x":312,"y":188}}
|
||||
```
|
||||
|
||||
## Verify-before-click rules
|
||||
3. **Act**
|
||||
|
||||
- Never click if target identity is ambiguous.
|
||||
- Require at least two matching signals before click.
|
||||
- Good signal pairs include:
|
||||
- OCR text + expected UI region
|
||||
- OCR text + matching button shape/icon nearby
|
||||
- dialog title text + expected button position within that dialog
|
||||
- known app/window focus + expected control location
|
||||
- OCR candidate + vision-model localization inside the same crop
|
||||
- If confidence is low, do not "test click"; zoom and re-localize first.
|
||||
- If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
|
||||
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
|
||||
1) preview intended coordinate + reason
|
||||
2) execute only after explicit confirmation.
|
||||
```json
|
||||
POST /v2/act?screen=0
|
||||
{"action":{"action":"click","target":{"resolved_target_id":"..."}}}
|
||||
```
|
||||
|
||||
## Precision rules
|
||||
4. **Verify**
|
||||
|
||||
- Prefer grid targets first, then use `dx/dy` for subcell precision.
|
||||
- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
|
||||
- Use zoom before guessing offsets.
|
||||
- Avoid stale coordinates: re-capture before action if UI moved/scrolled.
|
||||
```json
|
||||
POST /v2/act-verify?screen=0
|
||||
{
|
||||
"action":{"action":"click","target":{"resolved_target_id":"..."}},
|
||||
"condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420},
|
||||
"risk_level":"low"
|
||||
}
|
||||
```
|
||||
|
||||
## Safety rules
|
||||
## Risk policy
|
||||
|
||||
- Respect `dry_run` and `allowed_region` restrictions from `/health`.
|
||||
- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
|
||||
- Avoid destructive shortcuts unless explicitly requested.
|
||||
- Send one action at a time unless deterministic; then use `/batch`.
|
||||
- Low risk (navigation, focus, benign clicks): single verification signal.
|
||||
- High risk (delete/send/purchase/close-lossy): use `risk_level=high` and require two checks before act.
|
||||
- Never do speculative repeat clicks; switch strategy after one failed verify.
|
||||
|
||||
## Reliability rules
|
||||
## Anti-latency rules
|
||||
|
||||
- After every meaningful action, verify with a fresh screenshot.
|
||||
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
|
||||
- Prefer short, reversible actions over long macros.
|
||||
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
|
||||
- Never repeat full-screen OCR by default.
|
||||
- Re-observe only the active pane/region.
|
||||
- Prefer keyboard + window APIs for app switching.
|
||||
- Use OCR on region only and cap area with `max_ocr_area_px`.
|
||||
|
||||
## Fallback ladder for uncertain targeting
|
||||
## Setup and auth
|
||||
|
||||
1. Full-screen capture with a coarse grid.
|
||||
2. Zoom into the candidate area with a denser grid.
|
||||
3. OCR the full screen or the tighter region.
|
||||
4. Re-anchor on a more reliable nearby control, title, or label.
|
||||
5. Try a keyboard-first flow if the app supports it.
|
||||
6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner.
|
||||
|
||||
Do not skip from "uncertain click" straight to random retries.
|
||||
|
||||
## Concrete screenshot -> `image` -> action example
|
||||
|
||||
Example loop:
|
||||
1. `GET /screen?screen=0` to capture the current app state
|
||||
2. if the UI is text-heavy, try `POST /ocr` first
|
||||
3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like:
|
||||
- "In this save dialog, which visible button is the primary action?"
|
||||
- "Is there a dismiss/close button in the top-right of this modal?"
|
||||
4. map the answer back to a Clickthrough target using the returned grid/region metadata
|
||||
5. click once with `POST /action`
|
||||
6. recapture the screen
|
||||
7. optionally use `POST /wait` or another `image`/OCR check to confirm the result
|
||||
|
||||
The key rule is simple: screenshot first, interpret second, click third, verify fourth.
|
||||
Do not collapse those steps into fake certainty.
|
||||
When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.
|
||||
|
||||
## App-specific playbooks (recommended)
|
||||
|
||||
Build per-app routines for repetitive tasks instead of generic clicking.
|
||||
|
||||
### Launcher / search / start app playbook
|
||||
|
||||
Use this when the goal is "open app X" or "bring up tool Y".
|
||||
|
||||
1. check `GET /windows` first in case the app is already open
|
||||
2. if present, use `POST /windows/action` to focus or restore it
|
||||
3. if absent, prefer `POST /launch` when you know the executable path
|
||||
4. if launch path is unknown but the OS launcher/search UI is available, use a keyboard-first flow:
|
||||
- open launcher (`win`, `cmd+space`, or app-specific shortcut depending on host)
|
||||
- type exact app name
|
||||
- wait for stable results with `POST /wait` or recapture
|
||||
- verify the result text with OCR or the `image` tool
|
||||
- press Enter or click the exact result once
|
||||
5. verify the app window now exists or is focused
|
||||
|
||||
Do not keep relaunching if the window already exists; that’s sloppy.
|
||||
|
||||
### Dialog confirmation playbook
|
||||
|
||||
Use for modals like save/discard, delete confirmation, permission prompts, and installer dialogs.
|
||||
|
||||
1. capture the dialog region with `POST /zoom`
|
||||
2. use OCR first for title/body/button labels
|
||||
3. if button hierarchy or emphasis matters, inspect the zoomed screenshot with the `image` tool
|
||||
4. identify the exact intended action (`Cancel`, `Save`, `Allow`, `Delete`, etc.)
|
||||
5. for destructive actions, require explicit user confirmation unless already requested
|
||||
6. click once and verify the dialog disappeared or changed state
|
||||
|
||||
Good verification targets:
|
||||
- dialog title vanished
|
||||
- expected next window appeared
|
||||
- destructive side effect is visible and confirmed
|
||||
|
||||
### File picker playbook
|
||||
|
||||
Use for open/save dialogs.
|
||||
|
||||
1. verify the file picker window is focused
|
||||
2. OCR the visible breadcrumb/path area, filename field, and button row
|
||||
3. prefer keyboard-first entry when possible:
|
||||
- type or paste the target path/name into the focused field
|
||||
- use `tab` / `shift+tab` to move predictably between filename and action buttons
|
||||
4. if the target path is uncertain, use OCR plus the `image` tool to identify the active field and selected folder/file row
|
||||
5. verify the intended filename/path is visible before confirming
|
||||
6. activate `Open` / `Save` once and verify the picker closes
|
||||
|
||||
If the picker stays open, stop and inspect why instead of hammering Enter like a maniac.
|
||||
|
||||
### Browser tab / window playbook
|
||||
|
||||
Use for browser navigation, tab targeting, or web app recovery.
|
||||
|
||||
1. use `GET /windows` to focus the correct browser window first
|
||||
2. prefer keyboard-first navigation:
|
||||
- `ctrl+l` / `cmd+l` to focus the address bar
|
||||
- `ctrl+tab` / `ctrl+shift+tab` for tab movement when order is known
|
||||
- `ctrl+w` only for explicitly requested close actions
|
||||
3. verify tab or page identity with OCR on the tab strip or page heading
|
||||
4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs
|
||||
5. after navigation, wait for visual stability or expected text before taking the next action
|
||||
6. on shopping/checkouts, tighten crops around the buy box or checkout panel before reading button text; full-page OCR often misses the one thing that matters
|
||||
|
||||
Do not assume a page loaded just because the click landed. Verify it.
|
||||
|
||||
### Settings / preferences navigation playbook
|
||||
|
||||
Use when the task involves toggles, dropdowns, sidebars, or nested settings panels.
|
||||
|
||||
1. identify the current settings page with OCR on the heading/sidebar
|
||||
2. use OCR to find the specific section label before trying to toggle anything
|
||||
3. if the layout is dense, zoom into the relevant pane and use the `image` tool to distinguish labels from controls
|
||||
4. prefer small reversible actions: one toggle, one dropdown, one field edit at a time
|
||||
5. after each change, verify the control state changed visually or via visible text
|
||||
6. if a save/apply button exists, treat it as a separate confirmation step and verify completion
|
||||
|
||||
Settings UIs love hiding side effects. Assume nothing.
|
||||
|
||||
### Dense app / control-strip playbook
|
||||
|
||||
Use for apps like OBS, IDEs, mixers, dashboards, or anything with tiny bottom-right control clusters.
|
||||
|
||||
1. focus the exact app window with `POST /windows/action`
|
||||
2. capture the full target display once to confirm the window is actually frontmost
|
||||
3. crop tightly around the suspected control strip with `POST /zoom`
|
||||
4. run OCR on the crop, not the full screen
|
||||
5. if labels are still ambiguous, ask the `image` tool a narrow question about the specific buttons
|
||||
6. click once and immediately verify the control label changed (`Start Recording` -> `Stop Recording`, etc.)
|
||||
|
||||
Do not trust OCR taken from the wrong frontmost window. It will happily waste your time.
|
||||
|
||||
### Spotify playbook
|
||||
|
||||
- Focus app window before search/navigation.
|
||||
- Prefer keyboard-first flow for song start:
|
||||
1) `Ctrl+L` (search)
|
||||
2) type exact query
|
||||
3) Enter
|
||||
4) verify exact song+artist text
|
||||
5) click/double-click row
|
||||
6) verify now-playing bar
|
||||
- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.
|
||||
- Include `x-clickthrough-token` when token auth is enabled.
|
||||
- `/exec` additionally requires `x-clickthrough-exec-secret`.
|
||||
- Validate server first: `GET /health`.
|
||||
|
||||
Reference in New Issue
Block a user