--- name: clickthrough-http-control description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification. --- # Clickthrough HTTP Control Use a strict observe-decide-act-verify loop. ## Getting a computer instance (user-owned setup) The **user/operator** is responsible for provisioning and exposing the target machine. The agent should not assume it can self-install this stack. ### What the user must do 1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`). 2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL. 3. Configure secrets on target machine: - `CLICKTHROUGH_TOKEN` for general API auth - `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls 4. Share connection details with the agent through a secure channel: - `base_url` - `x-clickthrough-token` - `x-clickthrough-exec-secret` (only when `/exec` is needed) ### What the agent should do 1. Validate connection with `GET /health` using provided headers. 2. Refuse `/exec` attempts when exec secret is missing/invalid. 3. Ask user for missing setup inputs instead of guessing infrastructure. ## Mini API map - `GET /health` → server status + safety flags - `GET /displays` → detected displays in zero-based API order - `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`) - `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`) - `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes - `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...) - `POST /batch?screen=0` → sequential action list - `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header) ### Display selection - Use `GET /displays` before operating on multi-monitor systems. - Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`. - Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates. - Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset. - If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets. ### OCR usage - Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs). - Use `mode=screen` for discovery, then `mode=region` for precision and speed. - Use `language_hint` when known (for example `eng`) to improve consistency. - Filter noise with `min_confidence` (start around `0.4` and tune per app). - Treat OCR as one signal, not the only signal, before high-impact clicks. ### Header requirements - Always send `x-clickthrough-token` when token auth is enabled. - For `/exec`, also send `x-clickthrough-exec-secret`. ## `POST /action` request shape (important) `/action` always expects an `action` plus an optional `target` object. Do **not** invent top-level `x` / `y` fields. Minimal pixel click: ```json { "action": "click", "target": {"mode": "pixel", "x": 100, "y": 200}, "button": "left", "clicks": 1 } ``` Minimal grid click: ```json { "action": "click", "target": { "mode": "grid", "region_x": 0, "region_y": 0, "region_width": 1920, "region_height": 1080, "rows": 12, "cols": 12, "row": 6, "col": 8, "dx": 0.0, "dy": 0.0 } } ``` Other canonical examples: ```json {"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}} {"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}} {"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}} {"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500} {"action": "type", "text": "hello world", "interval_ms": 20} {"action": "hotkey", "keys": ["ctrl", "l"]} ``` Rules: - `dx` / `dy` belong inside `target`, not beside it. - `type` and `hotkey` usually do not need a `target`. - For pixel targets, `x` / `y` are global desktop coordinates. - For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used. ## When to use `/exec` Prefer structured GUI control first: - `/screen`, `/zoom`, `/ocr` to observe - `/action` or `/batch` to interact Use `/exec` only when it is the cleanest available tool for the job, for example: - launching an app that is not already visible - querying machine state that the GUI does not expose well - performing an explicit user-requested shell/system task - recovering from a blocked GUI flow when normal interaction failed Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly. ## Core workflow (mandatory) 1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display. 2. Identify likely target region and compute an initial confidence score. 3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate. 4. **Before any click**, verify target identity (OCR text/icon/location consistency). 5. Execute one minimal action via `POST /action`. 6. Re-capture with `GET /screen` and verify the expected state change. 7. Repeat until objective is complete. ## Verify-before-click rules - Never click if target identity is ambiguous. - Require at least two matching signals before click. - Good signal pairs include: - OCR text + expected UI region - OCR text + matching button shape/icon nearby - dialog title text + expected button position within that dialog - known app/window focus + expected control location - If confidence is low, do not "test click"; zoom and re-localize first. - For high-impact actions (close/delete/send/purchase), use two-phase flow: 1) preview intended coordinate + reason 2) execute only after explicit confirmation. ## Precision rules - Prefer grid targets first, then use `dx/dy` for subcell precision. - Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed. - Use zoom before guessing offsets. - Avoid stale coordinates: re-capture before action if UI moved/scrolled. ## Safety rules - Respect `dry_run` and `allowed_region` restrictions from `/health`. - Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`). - Avoid destructive shortcuts unless explicitly requested. - Send one action at a time unless deterministic; then use `/batch`. ## Reliability rules - After every meaningful action, verify with a fresh screenshot. - On mismatch, do not spam clicks: zoom, re-localize, and retry once. - Prefer short, reversible actions over long macros. - If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click. ## Fallback ladder for uncertain targeting 1. Full-screen capture with a coarse grid. 2. Zoom into the candidate area with a denser grid. 3. OCR the full screen or the tighter region. 4. Re-anchor on a more reliable nearby control, title, or label. 5. Try a keyboard-first flow if the app supports it. 6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner. Do not skip from "uncertain click" straight to random retries. ## App-specific playbooks (recommended) Build per-app routines for repetitive tasks instead of generic clicking. ### Spotify playbook - Focus app window before search/navigation. - Prefer keyboard-first flow for song start: 1) `Ctrl+L` (search) 2) type exact query 3) Enter 4) verify exact song+artist text 5) click/double-click row 6) verify now-playing bar - If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.