--- name: clickthrough-http-control description: Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops. --- # Clickthrough HTTP Control (v2) Agents do not see live desktop video. They operate on snapshots. Use this loop: **observe -> localize -> act -> verify**. ## Fast defaults - Start with `POST /v2/observe` on a tight region, not full screen. - Set `ocr_mode` to `none` unless text is required immediately. - Use `image` tool localization for icon-heavy or dense controls. - Use `POST /v2/act-verify` instead of manual sleep/poll loops. ## Mandatory image-tool click localization When OCR is weak or ambiguous, ask image tool for one coordinate in bounds. Prompt template: - "Return one click point as JSON `{\"x\":,\"y\":}` inside this image (`width=W`, `height=H`) for the **** control." Rules: - Ask for one point only. - Include bounds in the prompt. - If answer is not parseable `x,y`, re-ask once with stricter format. - Send returned point to `POST /v2/localize` via `image_tool_point`. ## API playbook 1. **Observe** ```json POST /v2/observe?screen=0 { "mode": "region", "region_x": 820, "region_y": 420, "region_width": 700, "region_height": 420, "include_image": true, "ocr_mode": "none" } ``` 2. **Localize** (choose one) Text: ```json POST /v2/localize {"observation_id":"...","text_query":"Save","text_match":"exact"} ``` Image-tool point: ```json POST /v2/localize {"observation_id":"...","image_tool_point":{"x":312,"y":188}} ``` 3. **Act** ```json POST /v2/act?screen=0 {"action":{"action":"click","target":{"resolved_target_id":"..."}}} ``` 4. **Verify** ```json POST /v2/act-verify?screen=0 { "action":{"action":"click","target":{"resolved_target_id":"..."}}, "condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420}, "risk_level":"low" } ``` ## Risk policy - Low risk (navigation, focus, benign clicks): single verification signal. - High risk (delete/send/purchase/close-lossy): use `risk_level=high` and require two checks before act. - Never do speculative repeat clicks; switch strategy after one failed verify. ## Anti-latency rules - Never repeat full-screen OCR by default. - Re-observe only the active pane/region. - Prefer keyboard + window APIs for app switching. - Use OCR on region only and cap area with `max_ocr_area_px`. ## Setup and auth - Include `x-clickthrough-token` when token auth is enabled. - `/exec` additionally requires `x-clickthrough-exec-secret`. - Validate server first: `GET /health`.