clickthrough/skill/SKILL.md

---
name: clickthrough-http-control
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
---

# Clickthrough HTTP Control

Use a strict observe-decide-act-verify loop.

## Getting a computer instance (user-owned setup)

The **user/operator** is responsible for provisioning and exposing the target machine.
The agent should not assume it can self-install this stack.

### What the user must do

1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
3. Configure secrets on target machine:
   - `CLICKTHROUGH_TOKEN` for general API auth
   - `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
4. Share connection details with the agent through a secure channel:
   - `base_url`
   - `x-clickthrough-token`
   - `x-clickthrough-exec-secret` (only when `/exec` is needed)

### What the agent should do

1. Validate connection with `GET /health` using provided headers.
2. Refuse `/exec` attempts when exec secret is missing/invalid.
3. Ask user for missing setup inputs instead of guessing infrastructure.

## Mini API map

- `GET /health` → server status + safety flags
- `GET /screen` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
- `POST /zoom` → cropped screenshot around point/region (also supports `asImage=true`)
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
- `POST /action` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
- `POST /batch` → sequential action list
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)

### OCR usage

- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
- Use `language_hint` when known (for example `eng`) to improve consistency.
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
- Treat OCR as one signal, not the only signal, before high-impact clicks.

### Header requirements

- Always send `x-clickthrough-token` when token auth is enabled.
- For `/exec`, also send `x-clickthrough-exec-secret`.

## Core workflow (mandatory)

1. Call `GET /screen` with coarse grid (e.g., 12x12).
2. Identify likely target region and compute an initial confidence score.
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
4. **Before any click**, verify target identity (OCR text/icon/location consistency).
5. Execute one minimal action via `POST /action`.
6. Re-capture with `GET /screen` and verify the expected state change.
7. Repeat until objective is complete.

## Verify-before-click rules

- Never click if target identity is ambiguous.
- Require at least two matching signals before click (example: OCR text + expected UI region).
- If confidence is low, do not "test click"; zoom and re-localize first.
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
  1) preview intended coordinate + reason
  2) execute only after explicit confirmation.

## Precision rules

- Prefer grid targets first, then use `dx/dy` for subcell precision.
- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
- Use zoom before guessing offsets.
- Avoid stale coordinates: re-capture before action if UI moved/scrolled.

## Safety rules

- Respect `dry_run` and `allowed_region` restrictions from `/health`.
- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
- Avoid destructive shortcuts unless explicitly requested.
- Send one action at a time unless deterministic; then use `/batch`.

## Reliability rules

- After every meaningful action, verify with a fresh screenshot.
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
- Prefer short, reversible actions over long macros.
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.

## App-specific playbooks (recommended)

Build per-app routines for repetitive tasks instead of generic clicking.

### Spotify playbook

- Focus app window before search/navigation.
- Prefer keyboard-first flow for song start:
  1) `Ctrl+L` (search)
  2) type exact query
  3) Enter
  4) verify exact song+artist text
  5) click/double-click row
  6) verify now-playing bar
- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.