space/clickthrough

Fork 0

Files

Luna a6d7e37beb

python-syntax / syntax-check (push) Successful in 4s

Details

python-syntax / syntax-check (pull_request) Successful in 10s

Details

docs(skill): include OCR endpoint workflow guidance

2026-04-06 13:50:34 +02:00

4.8 KiB

Raw Permalink Blame History

name, description

name	description
clickthrough-http-control	Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.

Clickthrough HTTP Control

Use a strict observe-decide-act-verify loop.

Getting a computer instance (user-owned setup)

The user/operator is responsible for provisioning and exposing the target machine. The agent should not assume it can self-install this stack.

What the user must do

Install dependencies and run Clickthrough on the target computer (default bind: 127.0.0.1:8123).
Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
Configure secrets on target machine:
- CLICKTHROUGH_TOKEN for general API auth
- CLICKTHROUGH_EXEC_SECRET for /exec calls
Share connection details with the agent through a secure channel:
- base_url
- x-clickthrough-token
- x-clickthrough-exec-secret (only when /exec is needed)

What the agent should do

Validate connection with GET /health using provided headers.
Refuse /exec attempts when exec secret is missing/invalid.
Ask user for missing setup inputs instead of guessing infrastructure.

Mini API map

GET /health → server status + safety flags
GET /screen → full screenshot (JSON with base64 by default, or raw image with asImage=true)
POST /zoom → cropped screenshot around point/region (also supports asImage=true)
POST /ocr → text extraction with bounding boxes from full screen, region, or provided image bytes
POST /action → single interaction (move, click, scroll, type, hotkey, ...)
POST /batch → sequential action list
POST /exec → PowerShell/Bash/CMD command execution (requires configured exec secret + header)

OCR usage

Prefer POST /ocr when targeting text-heavy UI (menus, labels, buttons, dialogs).
Use mode=screen for discovery, then mode=region for precision and speed.
Use language_hint when known (for example eng) to improve consistency.
Filter noise with min_confidence (start around 0.4 and tune per app).
Treat OCR as one signal, not the only signal, before high-impact clicks.

Header requirements

Always send x-clickthrough-token when token auth is enabled.
For /exec, also send x-clickthrough-exec-secret.

Core workflow (mandatory)

Call GET /screen with coarse grid (e.g., 12x12).
Identify likely target region and compute an initial confidence score.
If confidence < 0.85, call POST /zoom with denser grid (e.g., 20x20) and re-evaluate.
Before any click, verify target identity (OCR text/icon/location consistency).
Execute one minimal action via POST /action.
Re-capture with GET /screen and verify the expected state change.
Repeat until objective is complete.

Verify-before-click rules

Never click if target identity is ambiguous.
Require at least two matching signals before click (example: OCR text + expected UI region).
If confidence is low, do not "test click"; zoom and re-localize first.
For high-impact actions (close/delete/send/purchase), use two-phase flow:
1. preview intended coordinate + reason
2. execute only after explicit confirmation.

Precision rules

Prefer grid targets first, then use dx/dy for subcell precision.
Keep dx/dy in [-1,1]; start at 0,0 and only offset when needed.
Use zoom before guessing offsets.
Avoid stale coordinates: re-capture before action if UI moved/scrolled.

Safety rules

Respect dry_run and allowed_region restrictions from /health.
Respect /exec security requirements (CLICKTHROUGH_EXEC_SECRET + x-clickthrough-exec-secret).
Avoid destructive shortcuts unless explicitly requested.
Send one action at a time unless deterministic; then use /batch.

Reliability rules

After every meaningful action, verify with a fresh screenshot.
On mismatch, do not spam clicks: zoom, re-localize, and retry once.
Prefer short, reversible actions over long macros.
If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.

App-specific playbooks (recommended)

Build per-app routines for repetitive tasks instead of generic clicking.

Spotify playbook

Focus app window before search/navigation.
Prefer keyboard-first flow for song start:
1. Ctrl+L (search)
2. type exact query
3. Enter
4. verify exact song+artist text
5. click/double-click row
6. verify now-playing bar
If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.

4.8 KiB Raw Permalink Blame History