From 1429e90be297b68446e879ed0dbde85f80703e4f Mon Sep 17 00:00:00 2001 From: Luna Date: Fri, 1 May 2026 15:40:38 +0200 Subject: [PATCH] docs: tighten clickthrough skill and API guidance Closes #10 --- README.md | 5 +++ docs/API.md | 24 ++++++++++++++ skill/SKILL.md | 88 +++++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 116 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 4658a10..bd29d03 100644 --- a/README.md +++ b/README.md @@ -38,6 +38,11 @@ For OCR support, install the native `tesseract` binary on the host (in addition 5. `POST /action?screen=0` to execute 6. `GET /screen?screen=0` again to verify result +Important: +- `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields. +- Pixel coordinates and OCR bounding boxes are always global desktop coordinates. +- Prefer structured GUI interaction first; use `/exec` for launch, recovery, or explicit system-level tasks. + See: - `docs/API.md` - `docs/coordinate-system.md` diff --git a/docs/API.md b/docs/API.md index 26e10af..dfbb104 100644 --- a/docs/API.md +++ b/docs/API.md @@ -76,6 +76,11 @@ Default response returns cropped image + region metadata in global pixel coordin Body: one action. +Important: +- the request body uses `action` plus an optional `target` +- pixel coordinates live inside `target` when `target.mode="pixel"` +- do **not** send top-level `x` / `y` fields + Query params: - `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0` @@ -170,6 +175,25 @@ Hotkey: } ``` +Right click: + +```json +{ + "action": "right_click", + "target": {"mode": "pixel", "x": 1300, "y": 740} +} +``` + +Move only: + +```json +{ + "action": "move", + "target": {"mode": "pixel", "x": 1300, "y": 740}, + "duration_ms": 150 +} +``` + ## `POST /ocr` Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes. diff --git a/skill/SKILL.md b/skill/SKILL.md index 6a06972..548163a 100644 --- a/skill/SKILL.md +++ b/skill/SKILL.md @@ -46,6 +46,8 @@ The agent should not assume it can self-install this stack. - Use `GET /displays` before operating on multi-monitor systems. - Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`. - Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates. +- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset. +- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets. ### OCR usage @@ -60,6 +62,74 @@ The agent should not assume it can self-install this stack. - Always send `x-clickthrough-token` when token auth is enabled. - For `/exec`, also send `x-clickthrough-exec-secret`. +## `POST /action` request shape (important) + +`/action` always expects an `action` plus an optional `target` object. +Do **not** invent top-level `x` / `y` fields. + +Minimal pixel click: + +```json +{ + "action": "click", + "target": {"mode": "pixel", "x": 100, "y": 200}, + "button": "left", + "clicks": 1 +} +``` + +Minimal grid click: + +```json +{ + "action": "click", + "target": { + "mode": "grid", + "region_x": 0, + "region_y": 0, + "region_width": 1920, + "region_height": 1080, + "rows": 12, + "cols": 12, + "row": 6, + "col": 8, + "dx": 0.0, + "dy": 0.0 + } +} +``` + +Other canonical examples: + +```json +{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}} +{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}} +{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}} +{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500} +{"action": "type", "text": "hello world", "interval_ms": 20} +{"action": "hotkey", "keys": ["ctrl", "l"]} +``` + +Rules: +- `dx` / `dy` belong inside `target`, not beside it. +- `type` and `hotkey` usually do not need a `target`. +- For pixel targets, `x` / `y` are global desktop coordinates. +- For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used. + +## When to use `/exec` + +Prefer structured GUI control first: +- `/screen`, `/zoom`, `/ocr` to observe +- `/action` or `/batch` to interact + +Use `/exec` only when it is the cleanest available tool for the job, for example: +- launching an app that is not already visible +- querying machine state that the GUI does not expose well +- performing an explicit user-requested shell/system task +- recovering from a blocked GUI flow when normal interaction failed + +Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly. + ## Core workflow (mandatory) 1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display. @@ -73,7 +143,12 @@ The agent should not assume it can self-install this stack. ## Verify-before-click rules - Never click if target identity is ambiguous. -- Require at least two matching signals before click (example: OCR text + expected UI region). +- Require at least two matching signals before click. +- Good signal pairs include: + - OCR text + expected UI region + - OCR text + matching button shape/icon nearby + - dialog title text + expected button position within that dialog + - known app/window focus + expected control location - If confidence is low, do not "test click"; zoom and re-localize first. - For high-impact actions (close/delete/send/purchase), use two-phase flow: 1) preview intended coordinate + reason @@ -100,6 +175,17 @@ The agent should not assume it can self-install this stack. - Prefer short, reversible actions over long macros. - If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click. +## Fallback ladder for uncertain targeting + +1. Full-screen capture with a coarse grid. +2. Zoom into the candidate area with a denser grid. +3. OCR the full screen or the tighter region. +4. Re-anchor on a more reliable nearby control, title, or label. +5. Try a keyboard-first flow if the app supports it. +6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner. + +Do not skip from "uncertain click" straight to random retries. + ## App-specific playbooks (recommended) Build per-app routines for repetitive tasks instead of generic clicking.