diff --git a/TODO.md b/TODO.md index a5e56aa..dc97073 100644 --- a/TODO.md +++ b/TODO.md @@ -21,5 +21,6 @@ - [x] Add exec configuration via env (`CLICKTHROUGH_EXEC_*`) - [x] Document exec API + config - [x] Create backlog issues for OCR/find/window/input/session-state improvements -- [ ] Open PR for exec feature branch and review/merge +- [x] Open PR for exec feature branch and review/merge - [x] Require configured exec secret + per-request exec secret header +- [x] Upgrade skill with verify-before-click rules, confidence thresholds, two-phase risky actions, and Spotify playbook diff --git a/skill/SKILL.md b/skill/SKILL.md index f96c3c2..7416de6 100644 --- a/skill/SKILL.md +++ b/skill/SKILL.md @@ -7,24 +7,36 @@ description: Control a local computer through the Clickthrough HTTP server using Use a strict observe-decide-act-verify loop. -## Workflow +## Core workflow (mandatory) 1. Call `GET /screen` with coarse grid (e.g., 12x12). -2. Identify likely cell/region for the target UI element. -3. If confidence is low, call `POST /zoom` centered on the candidate and use denser grid (e.g., 20x20). -4. Execute one minimal action via `POST /action`. -5. Re-capture with `GET /screen` and verify the expected state change. -6. Repeat until objective is complete. +2. Identify likely target region and compute an initial confidence score. +3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate. +4. **Before any click**, verify target identity (text/icon/location consistency). +5. Execute one minimal action via `POST /action`. +6. Re-capture with `GET /screen` and verify the expected state change. +7. Repeat until objective is complete. + +## Verify-before-click rules + +- Never click if target identity is ambiguous. +- Require at least two matching signals before click (example: expected text + expected UI region). +- If confidence is low, do not "test click"; zoom and re-localize first. +- For high-impact actions (close/delete/send/purchase), use two-phase flow: + 1) preview intended coordinate + reason + 2) execute only after explicit confirmation. ## Precision rules - Prefer grid targets first, then use `dx/dy` for subcell precision. - Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed. - Use zoom before guessing offsets. +- Avoid stale coordinates: re-capture before action if UI moved/scrolled. ## Safety rules - Respect `dry_run` and `allowed_region` restrictions from `/health`. +- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`). - Avoid destructive shortcuts unless explicitly requested. - Send one action at a time unless deterministic; then use `/batch`. @@ -33,3 +45,20 @@ Use a strict observe-decide-act-verify loop. - After every meaningful action, verify with a fresh screenshot. - On mismatch, do not spam clicks: zoom, re-localize, and retry once. - Prefer short, reversible actions over long macros. +- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click. + +## App-specific playbooks (recommended) + +Build per-app routines for repetitive tasks instead of generic clicking. + +### Spotify playbook + +- Focus app window before search/navigation. +- Prefer keyboard-first flow for song start: + 1) `Ctrl+L` (search) + 2) type exact query + 3) Enter + 4) verify exact song+artist text + 5) click/double-click row + 6) verify now-playing bar +- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.