207 lines
8.3 KiB
Markdown
207 lines
8.3 KiB
Markdown
---
|
|
name: clickthrough-http-control
|
|
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
|
|
---
|
|
|
|
# Clickthrough HTTP Control
|
|
|
|
Use a strict observe-decide-act-verify loop.
|
|
|
|
## Getting a computer instance (user-owned setup)
|
|
|
|
The **user/operator** is responsible for provisioning and exposing the target machine.
|
|
The agent should not assume it can self-install this stack.
|
|
|
|
### What the user must do
|
|
|
|
1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
|
|
2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
|
|
3. Configure secrets on target machine:
|
|
- `CLICKTHROUGH_TOKEN` for general API auth
|
|
- `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
|
|
4. Share connection details with the agent through a secure channel:
|
|
- `base_url`
|
|
- `x-clickthrough-token`
|
|
- `x-clickthrough-exec-secret` (only when `/exec` is needed)
|
|
|
|
### What the agent should do
|
|
|
|
1. Validate connection with `GET /health` using provided headers.
|
|
2. Refuse `/exec` attempts when exec secret is missing/invalid.
|
|
3. Ask user for missing setup inputs instead of guessing infrastructure.
|
|
|
|
## Mini API map
|
|
|
|
- `GET /health` → server status + safety flags
|
|
- `GET /displays` → detected displays in zero-based API order
|
|
- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
|
|
- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
|
|
- `GET /windows` → discover visible desktop windows and their handles/processes
|
|
- `POST /windows/action` → focus/restore/minimize/maximize/close a matched window
|
|
- `POST /launch` → start an app/process without dropping to a shell
|
|
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
|
|
- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
|
|
- `POST /batch?screen=0` → sequential action list
|
|
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
|
|
|
|
### Display selection
|
|
|
|
- Use `GET /displays` before operating on multi-monitor systems.
|
|
- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
|
|
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
|
|
- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
|
|
- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
|
|
|
|
### OCR usage
|
|
|
|
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
|
|
- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
|
|
- Use `language_hint` when known (for example `eng`) to improve consistency.
|
|
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
|
|
- Treat OCR as one signal, not the only signal, before high-impact clicks.
|
|
|
|
### Header requirements
|
|
|
|
- Always send `x-clickthrough-token` when token auth is enabled.
|
|
- For `/exec`, also send `x-clickthrough-exec-secret`.
|
|
|
|
## `POST /action` request shape (important)
|
|
|
|
`/action` always expects an `action` plus an optional `target` object.
|
|
Do **not** invent top-level `x` / `y` fields.
|
|
|
|
Minimal pixel click:
|
|
|
|
```json
|
|
{
|
|
"action": "click",
|
|
"target": {"mode": "pixel", "x": 100, "y": 200},
|
|
"button": "left",
|
|
"clicks": 1
|
|
}
|
|
```
|
|
|
|
Minimal grid click:
|
|
|
|
```json
|
|
{
|
|
"action": "click",
|
|
"target": {
|
|
"mode": "grid",
|
|
"region_x": 0,
|
|
"region_y": 0,
|
|
"region_width": 1920,
|
|
"region_height": 1080,
|
|
"rows": 12,
|
|
"cols": 12,
|
|
"row": 6,
|
|
"col": 8,
|
|
"dx": 0.0,
|
|
"dy": 0.0
|
|
}
|
|
}
|
|
```
|
|
|
|
Other canonical examples:
|
|
|
|
```json
|
|
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
|
{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
|
{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
|
{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
|
|
{"action": "type", "text": "hello world", "interval_ms": 20}
|
|
{"action": "hotkey", "keys": ["ctrl", "l"]}
|
|
```
|
|
|
|
Rules:
|
|
- `dx` / `dy` belong inside `target`, not beside it.
|
|
- `type` and `hotkey` usually do not need a `target`.
|
|
- For pixel targets, `x` / `y` are global desktop coordinates.
|
|
- For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used.
|
|
|
|
## When to use `/exec`
|
|
|
|
Prefer structured GUI control first:
|
|
- `/screen`, `/zoom`, `/ocr` to observe
|
|
- `/action` or `/batch` to interact
|
|
|
|
Use `/exec` only when it is the cleanest available tool for the job, for example:
|
|
- querying machine state that the GUI does not expose well
|
|
- performing an explicit user-requested shell/system task
|
|
- recovering from a blocked GUI flow when normal interaction failed
|
|
|
|
Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
|
|
Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
|
|
|
|
## Core workflow (mandatory)
|
|
|
|
1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
|
|
2. Identify likely target region and compute an initial confidence score.
|
|
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
|
4. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
|
5. Execute one minimal action via `POST /action`.
|
|
6. Re-capture with `GET /screen` and verify the expected state change.
|
|
7. Repeat until objective is complete.
|
|
|
|
## Verify-before-click rules
|
|
|
|
- Never click if target identity is ambiguous.
|
|
- Require at least two matching signals before click.
|
|
- Good signal pairs include:
|
|
- OCR text + expected UI region
|
|
- OCR text + matching button shape/icon nearby
|
|
- dialog title text + expected button position within that dialog
|
|
- known app/window focus + expected control location
|
|
- If confidence is low, do not "test click"; zoom and re-localize first.
|
|
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
|
|
1) preview intended coordinate + reason
|
|
2) execute only after explicit confirmation.
|
|
|
|
## Precision rules
|
|
|
|
- Prefer grid targets first, then use `dx/dy` for subcell precision.
|
|
- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
|
|
- Use zoom before guessing offsets.
|
|
- Avoid stale coordinates: re-capture before action if UI moved/scrolled.
|
|
|
|
## Safety rules
|
|
|
|
- Respect `dry_run` and `allowed_region` restrictions from `/health`.
|
|
- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
|
|
- Avoid destructive shortcuts unless explicitly requested.
|
|
- Send one action at a time unless deterministic; then use `/batch`.
|
|
|
|
## Reliability rules
|
|
|
|
- After every meaningful action, verify with a fresh screenshot.
|
|
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
|
|
- Prefer short, reversible actions over long macros.
|
|
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
|
|
|
|
## Fallback ladder for uncertain targeting
|
|
|
|
1. Full-screen capture with a coarse grid.
|
|
2. Zoom into the candidate area with a denser grid.
|
|
3. OCR the full screen or the tighter region.
|
|
4. Re-anchor on a more reliable nearby control, title, or label.
|
|
5. Try a keyboard-first flow if the app supports it.
|
|
6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner.
|
|
|
|
Do not skip from "uncertain click" straight to random retries.
|
|
|
|
## App-specific playbooks (recommended)
|
|
|
|
Build per-app routines for repetitive tasks instead of generic clicking.
|
|
|
|
### Spotify playbook
|
|
|
|
- Focus app window before search/navigation.
|
|
- Prefer keyboard-first flow for song start:
|
|
1) `Ctrl+L` (search)
|
|
2) type exact query
|
|
3) Enter
|
|
4) verify exact song+artist text
|
|
5) click/double-click row
|
|
6) verify now-playing bar
|
|
- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.
|