This repository has been archived on 2026-05-20. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
clickthrough/skill/SKILL.md
Luna 5122d416e8
All checks were successful
python-syntax / syntax-check (push) Successful in 7s
feat(wait): add structured wait endpoint
2026-05-01 15:55:29 +02:00

8.4 KiB

name, description
name description
clickthrough-http-control Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.

Clickthrough HTTP Control

Use a strict observe-decide-act-verify loop.

Getting a computer instance (user-owned setup)

The user/operator is responsible for provisioning and exposing the target machine. The agent should not assume it can self-install this stack.

What the user must do

  1. Install dependencies and run Clickthrough on the target computer (default bind: 127.0.0.1:8123).
  2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
  3. Configure secrets on target machine:
    • CLICKTHROUGH_TOKEN for general API auth
    • CLICKTHROUGH_EXEC_SECRET for /exec calls
  4. Share connection details with the agent through a secure channel:
    • base_url
    • x-clickthrough-token
    • x-clickthrough-exec-secret (only when /exec is needed)

What the agent should do

  1. Validate connection with GET /health using provided headers.
  2. Refuse /exec attempts when exec secret is missing/invalid.
  3. Ask user for missing setup inputs instead of guessing infrastructure.

Mini API map

  • GET /health → server status + safety flags
  • GET /displays → detected displays in zero-based API order
  • GET /screen?screen=0 → full screenshot (JSON with base64 by default, or raw image with asImage=true)
  • POST /zoom?screen=0 → cropped screenshot around point/region (also supports asImage=true)
  • GET /windows → discover visible desktop windows and their handles/processes
  • POST /windows/action → focus/restore/minimize/maximize/close a matched window
  • POST /launch → start an app/process without dropping to a shell
  • POST /wait?screen=0 → wait for text, window, or visual state changes
  • POST /ocr → text extraction with bounding boxes from full screen, region, or provided image bytes
  • POST /action?screen=0 → single interaction (move, click, scroll, type, hotkey, ...)
  • POST /batch?screen=0 → sequential action list
  • POST /exec → PowerShell/Bash/CMD command execution (requires configured exec secret + header)

Display selection

  • Use GET /displays before operating on multi-monitor systems.
  • Use ?screen=X on /screen, /zoom, /ocr, /action, and /batch; invalid values fall back to screen=0.
  • Treat returned region and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
  • Do not assume screen=1 starts at (0,0); it may start at (1920,0), (-1920,0), or another global offset.
  • If a screenshot came from /screen?screen=1, keep using that response's region metadata when forming later /action targets.

OCR usage

  • Prefer POST /ocr when targeting text-heavy UI (menus, labels, buttons, dialogs).
  • Use mode=screen for discovery, then mode=region for precision and speed.
  • Use language_hint when known (for example eng) to improve consistency.
  • Filter noise with min_confidence (start around 0.4 and tune per app).
  • Treat OCR as one signal, not the only signal, before high-impact clicks.

Header requirements

  • Always send x-clickthrough-token when token auth is enabled.
  • For /exec, also send x-clickthrough-exec-secret.

POST /action request shape (important)

/action always expects an action plus an optional target object. Do not invent top-level x / y fields.

Minimal pixel click:

{
  "action": "click",
  "target": {"mode": "pixel", "x": 100, "y": 200},
  "button": "left",
  "clicks": 1
}

Minimal grid click:

{
  "action": "click",
  "target": {
    "mode": "grid",
    "region_x": 0,
    "region_y": 0,
    "region_width": 1920,
    "region_height": 1080,
    "rows": 12,
    "cols": 12,
    "row": 6,
    "col": 8,
    "dx": 0.0,
    "dy": 0.0
  }
}

Other canonical examples:

{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
{"action": "type", "text": "hello world", "interval_ms": 20}
{"action": "hotkey", "keys": ["ctrl", "l"]}

Rules:

  • dx / dy belong inside target, not beside it.
  • type and hotkey usually do not need a target.
  • For pixel targets, x / y are global desktop coordinates.
  • For grid targets, copy the exact region_*, rows, and cols basis from the screenshot/zoom you actually used.

When to use /exec

Prefer structured GUI control first:

  • /screen, /zoom, /ocr to observe
  • /action or /batch to interact

Use /exec only when it is the cleanest available tool for the job, for example:

  • querying machine state that the GUI does not expose well
  • performing an explicit user-requested shell/system task
  • recovering from a blocked GUI flow when normal interaction failed

Prefer GET /windows, POST /windows/action, and POST /launch for app lifecycle tasks before falling back to /exec. Avoid using /exec for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.

Core workflow (mandatory)

  1. Call GET /screen?screen=0 with coarse grid (e.g., 12x12), or another selected display.
  2. Identify likely target region and compute an initial confidence score.
  3. If confidence < 0.85, call POST /zoom with denser grid (e.g., 20x20) and re-evaluate.
  4. Before any click, verify target identity (OCR text/icon/location consistency).
  5. Execute one minimal action via POST /action.
  6. Re-capture with GET /screen or use POST /wait to verify the expected state change.
  7. Repeat until objective is complete.

Verify-before-click rules

  • Never click if target identity is ambiguous.
  • Require at least two matching signals before click.
  • Good signal pairs include:
    • OCR text + expected UI region
    • OCR text + matching button shape/icon nearby
    • dialog title text + expected button position within that dialog
    • known app/window focus + expected control location
  • If confidence is low, do not "test click"; zoom and re-localize first.
  • For high-impact actions (close/delete/send/purchase), use two-phase flow:
    1. preview intended coordinate + reason
    2. execute only after explicit confirmation.

Precision rules

  • Prefer grid targets first, then use dx/dy for subcell precision.
  • Keep dx/dy in [-1,1]; start at 0,0 and only offset when needed.
  • Use zoom before guessing offsets.
  • Avoid stale coordinates: re-capture before action if UI moved/scrolled.

Safety rules

  • Respect dry_run and allowed_region restrictions from /health.
  • Respect /exec security requirements (CLICKTHROUGH_EXEC_SECRET + x-clickthrough-exec-secret).
  • Avoid destructive shortcuts unless explicitly requested.
  • Send one action at a time unless deterministic; then use /batch.

Reliability rules

  • After every meaningful action, verify with a fresh screenshot.
  • On mismatch, do not spam clicks: zoom, re-localize, and retry once.
  • Prefer short, reversible actions over long macros.
  • If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.

Fallback ladder for uncertain targeting

  1. Full-screen capture with a coarse grid.
  2. Zoom into the candidate area with a denser grid.
  3. OCR the full screen or the tighter region.
  4. Re-anchor on a more reliable nearby control, title, or label.
  5. Try a keyboard-first flow if the app supports it.
  6. Use /exec only if GUI control is blocked and shell-level intervention is genuinely cleaner.

Do not skip from "uncertain click" straight to random retries.

Build per-app routines for repetitive tasks instead of generic clicking.

Spotify playbook

  • Focus app window before search/navigation.
  • Prefer keyboard-first flow for song start:
    1. Ctrl+L (search)
    2. type exact query
    3. Enter
    4. verify exact song+artist text
    5. click/double-click row
    6. verify now-playing bar
  • If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.