8.4 KiB
name, description
| name | description |
|---|---|
| clickthrough-http-control | Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification. |
Clickthrough HTTP Control
Use a strict observe-decide-act-verify loop.
Getting a computer instance (user-owned setup)
The user/operator is responsible for provisioning and exposing the target machine. The agent should not assume it can self-install this stack.
What the user must do
- Install dependencies and run Clickthrough on the target computer (default bind:
127.0.0.1:8123). - Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
- Configure secrets on target machine:
CLICKTHROUGH_TOKENfor general API authCLICKTHROUGH_EXEC_SECRETfor/execcalls
- Share connection details with the agent through a secure channel:
base_urlx-clickthrough-tokenx-clickthrough-exec-secret(only when/execis needed)
What the agent should do
- Validate connection with
GET /healthusing provided headers. - Refuse
/execattempts when exec secret is missing/invalid. - Ask user for missing setup inputs instead of guessing infrastructure.
Mini API map
GET /health→ server status + safety flagsGET /displays→ detected displays in zero-based API orderGET /screen?screen=0→ full screenshot (JSON with base64 by default, or raw image withasImage=true)POST /zoom?screen=0→ cropped screenshot around point/region (also supportsasImage=true)GET /windows→ discover visible desktop windows and their handles/processesPOST /windows/action→ focus/restore/minimize/maximize/close a matched windowPOST /launch→ start an app/process without dropping to a shellPOST /wait?screen=0→ wait for text, window, or visual state changesPOST /ocr→ text extraction with bounding boxes from full screen, region, or provided image bytesPOST /action?screen=0→ single interaction (move,click,scroll,type,hotkey, ...)POST /batch?screen=0→ sequential action listPOST /exec→ PowerShell/Bash/CMD command execution (requires configured exec secret + header)
Display selection
- Use
GET /displaysbefore operating on multi-monitor systems. - Use
?screen=Xon/screen,/zoom,/ocr,/action, and/batch; invalid values fall back toscreen=0. - Treat returned
regionand OCR bounding boxes as global desktop coordinates, not screen-local coordinates. - Do not assume
screen=1starts at(0,0); it may start at(1920,0),(-1920,0), or another global offset. - If a screenshot came from
/screen?screen=1, keep using that response'sregionmetadata when forming later/actiontargets.
OCR usage
- Prefer
POST /ocrwhen targeting text-heavy UI (menus, labels, buttons, dialogs). - Use
mode=screenfor discovery, thenmode=regionfor precision and speed. - Use
language_hintwhen known (for exampleeng) to improve consistency. - Filter noise with
min_confidence(start around0.4and tune per app). - Treat OCR as one signal, not the only signal, before high-impact clicks.
Header requirements
- Always send
x-clickthrough-tokenwhen token auth is enabled. - For
/exec, also sendx-clickthrough-exec-secret.
POST /action request shape (important)
/action always expects an action plus an optional target object.
Do not invent top-level x / y fields.
Minimal pixel click:
{
"action": "click",
"target": {"mode": "pixel", "x": 100, "y": 200},
"button": "left",
"clicks": 1
}
Minimal grid click:
{
"action": "click",
"target": {
"mode": "grid",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"rows": 12,
"cols": 12,
"row": 6,
"col": 8,
"dx": 0.0,
"dy": 0.0
}
}
Other canonical examples:
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
{"action": "type", "text": "hello world", "interval_ms": 20}
{"action": "hotkey", "keys": ["ctrl", "l"]}
Rules:
dx/dybelong insidetarget, not beside it.typeandhotkeyusually do not need atarget.- For pixel targets,
x/yare global desktop coordinates. - For grid targets, copy the exact
region_*,rows, andcolsbasis from the screenshot/zoom you actually used.
When to use /exec
Prefer structured GUI control first:
/screen,/zoom,/ocrto observe/actionor/batchto interact
Use /exec only when it is the cleanest available tool for the job, for example:
- querying machine state that the GUI does not expose well
- performing an explicit user-requested shell/system task
- recovering from a blocked GUI flow when normal interaction failed
Prefer GET /windows, POST /windows/action, and POST /launch for app lifecycle tasks before falling back to /exec.
Avoid using /exec for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
Core workflow (mandatory)
- Call
GET /screen?screen=0with coarse grid (e.g., 12x12), or another selected display. - Identify likely target region and compute an initial confidence score.
- If confidence < 0.85, call
POST /zoomwith denser grid (e.g., 20x20) and re-evaluate. - Before any click, verify target identity (OCR text/icon/location consistency).
- Execute one minimal action via
POST /action. - Re-capture with
GET /screenor usePOST /waitto verify the expected state change. - Repeat until objective is complete.
Verify-before-click rules
- Never click if target identity is ambiguous.
- Require at least two matching signals before click.
- Good signal pairs include:
- OCR text + expected UI region
- OCR text + matching button shape/icon nearby
- dialog title text + expected button position within that dialog
- known app/window focus + expected control location
- If confidence is low, do not "test click"; zoom and re-localize first.
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
- preview intended coordinate + reason
- execute only after explicit confirmation.
Precision rules
- Prefer grid targets first, then use
dx/dyfor subcell precision. - Keep
dx/dyin[-1,1]; start at0,0and only offset when needed. - Use zoom before guessing offsets.
- Avoid stale coordinates: re-capture before action if UI moved/scrolled.
Safety rules
- Respect
dry_runandallowed_regionrestrictions from/health. - Respect
/execsecurity requirements (CLICKTHROUGH_EXEC_SECRET+x-clickthrough-exec-secret). - Avoid destructive shortcuts unless explicitly requested.
- Send one action at a time unless deterministic; then use
/batch.
Reliability rules
- After every meaningful action, verify with a fresh screenshot.
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
- Prefer short, reversible actions over long macros.
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
Fallback ladder for uncertain targeting
- Full-screen capture with a coarse grid.
- Zoom into the candidate area with a denser grid.
- OCR the full screen or the tighter region.
- Re-anchor on a more reliable nearby control, title, or label.
- Try a keyboard-first flow if the app supports it.
- Use
/execonly if GUI control is blocked and shell-level intervention is genuinely cleaner.
Do not skip from "uncertain click" straight to random retries.
App-specific playbooks (recommended)
Build per-app routines for repetitive tasks instead of generic clicking.
Spotify playbook
- Focus app window before search/navigation.
- Prefer keyboard-first flow for song start:
Ctrl+L(search)- type exact query
- Enter
- verify exact song+artist text
- click/double-click row
- verify now-playing bar
- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.