refactor: simplify to see/interact/exec and split server modules

2026-05-03 20:07:12 +02:00
parent aced5be25e
commit 1c03cab457
8 changed files with 911 additions and 1928 deletions
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -1,97 +1,60 @@
 ---
 name: clickthrough-http-control
-description: Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops.
+description: Use 3 methods to control a computer: see (screenshot+grid), interact (mouse/keyboard), and exec (shell).
 ---

-# Clickthrough HTTP Control (v2)
+# Clickthrough Computer Control

-Agents do not see live desktop video. They operate on snapshots.
-Use this loop: **observe -> localize -> act -> verify**.
+Use exactly 3 methods:
+- `see`
+- `interact`
+- `exec`

-## Fast defaults
+## Method 1: See

- Start with `POST /v2/observe` on a tight region, not full screen.
- Set `ocr_mode` to `none` unless text is required immediately.
- Use `image` tool localization for icon-heavy or dense controls.
- Use `POST /v2/act-verify` instead of manual sleep/poll loops.
-
-## Mandatory image-tool click localization
-
-When OCR is weak or ambiguous, ask image tool for one coordinate in bounds.
-
-Prompt template:
- "Return one click point as JSON `{\"x\":<int>,\"y\":<int>}` inside this image (`width=W`, `height=H`) for the **<exact target>** control."
+Use `POST /see` to capture full screen or a region with a grid overlay.
+Use `POST /see/zoom` to capture a tighter crop with a denser grid.

 Rules:
- Ask for one point only.
- Include bounds in the prompt.
- If answer is not parseable `x,y`, re-ask once with stricter format.
- Send returned point to `POST /v2/localize` via `image_tool_point`.
+- Start with coarse grid (`12x12`).
+- For precision, zoom and use denser grid (`20x20` or higher).
+- Always use returned `meta.region` and `meta.grid` when computing click targets.
+- Coordinates are global desktop coordinates.

-## API playbook
+## Method 2: Interact

-1. **Observe**
+Use `POST /interact` for one action at a time.

-```json
-POST /v2/observe?screen=0
-{
-  "mode": "region",
-  "region_x": 820,
-  "region_y": 420,
-  "region_width": 700,
-  "region_height": 420,
-  "include_image": true,
-  "ocr_mode": "none"
-}
-```
+Mouse actions:
+- `move`, `click`, `right_click`, `double_click`, `middle_click`, `scroll`

-2. **Localize** (choose one)
+Keyboard actions:
+- `type`, `hotkey`

-Text:
-```json
-POST /v2/localize
-{"observation_id":"...","text_query":"Save","text_match":"exact"}
-```
+Rules:
+- Prefer `grid` targets derived from fresh `see`/`see/zoom` captures.
+- Use `pixel` only when you already have reliable coordinates.
+- After each important action, call `see` again before continuing.

-Image-tool point:
-```json
-POST /v2/localize
-{"observation_id":"...","image_tool_point":{"x":312,"y":188}}
-```
+## Method 3: Exec

-3. **Act**
+Use `POST /exec` only for shell/system tasks.

-```json
-POST /v2/act?screen=0
-{"action":{"action":"click","target":{"resolved_target_id":"..."}}}
-```
+Rules:
+- Requires `x-clickthrough-exec-secret`.
+- Do not use exec for normal clicking/typing flows.
+- Prefer GUI interaction first; exec is fallback or explicit shell task.

-4. **Verify**
+## Lightweight Procedure

-```json
-POST /v2/act-verify?screen=0
-{
-  "action":{"action":"click","target":{"resolved_target_id":"..."}},
-  "condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420},
-  "risk_level":"low"
-}
-```
+1. `see` capture.
+2. If needed, `see/zoom` refine.
+3. `interact` one step.
+4. `see` verify.
+5. Repeat.

-## Risk policy
+## Quick Safety Rules

- Low risk (navigation, focus, benign clicks): single verification signal.
- High risk (delete/send/purchase/close-lossy): use `risk_level=high` and require two checks before act.
- Never do speculative repeat clicks; switch strategy after one failed verify.
-
-## Anti-latency rules
-
- Never repeat full-screen OCR by default.
- Re-observe only the active pane/region.
- Prefer keyboard + window APIs for app switching.
- Use OCR on region only and cap area with `max_ocr_area_px`.
-
-## Setup and auth
-
- Include `x-clickthrough-token` when token auth is enabled.
- `/exec` additionally requires `x-clickthrough-exec-secret`.
- Validate server first: `GET /health`.
+- Never click with stale screenshots.
+- Never send multiple uncertain clicks in a row.
+- If localization is ambiguous, re-capture with a tighter zoom.