docs: tighten clickthrough skill and API guidance
All checks were successful
python-syntax / syntax-check (push) Successful in 43s
All checks were successful
python-syntax / syntax-check (push) Successful in 43s
Closes #10
This commit is contained in:
@@ -38,6 +38,11 @@ For OCR support, install the native `tesseract` binary on the host (in addition
|
|||||||
5. `POST /action?screen=0` to execute
|
5. `POST /action?screen=0` to execute
|
||||||
6. `GET /screen?screen=0` again to verify result
|
6. `GET /screen?screen=0` again to verify result
|
||||||
|
|
||||||
|
Important:
|
||||||
|
- `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields.
|
||||||
|
- Pixel coordinates and OCR bounding boxes are always global desktop coordinates.
|
||||||
|
- Prefer structured GUI interaction first; use `/exec` for launch, recovery, or explicit system-level tasks.
|
||||||
|
|
||||||
See:
|
See:
|
||||||
- `docs/API.md`
|
- `docs/API.md`
|
||||||
- `docs/coordinate-system.md`
|
- `docs/coordinate-system.md`
|
||||||
|
|||||||
24
docs/API.md
24
docs/API.md
@@ -76,6 +76,11 @@ Default response returns cropped image + region metadata in global pixel coordin
|
|||||||
|
|
||||||
Body: one action.
|
Body: one action.
|
||||||
|
|
||||||
|
Important:
|
||||||
|
- the request body uses `action` plus an optional `target`
|
||||||
|
- pixel coordinates live inside `target` when `target.mode="pixel"`
|
||||||
|
- do **not** send top-level `x` / `y` fields
|
||||||
|
|
||||||
Query params:
|
Query params:
|
||||||
|
|
||||||
- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`
|
- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`
|
||||||
@@ -170,6 +175,25 @@ Hotkey:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Right click:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"action": "right_click",
|
||||||
|
"target": {"mode": "pixel", "x": 1300, "y": 740}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Move only:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"action": "move",
|
||||||
|
"target": {"mode": "pixel", "x": 1300, "y": 740},
|
||||||
|
"duration_ms": 150
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
## `POST /ocr`
|
## `POST /ocr`
|
||||||
|
|
||||||
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
|
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
|
||||||
|
|||||||
@@ -46,6 +46,8 @@ The agent should not assume it can self-install this stack.
|
|||||||
- Use `GET /displays` before operating on multi-monitor systems.
|
- Use `GET /displays` before operating on multi-monitor systems.
|
||||||
- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
|
- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
|
||||||
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
|
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
|
||||||
|
- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
|
||||||
|
- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
|
||||||
|
|
||||||
### OCR usage
|
### OCR usage
|
||||||
|
|
||||||
@@ -60,6 +62,74 @@ The agent should not assume it can self-install this stack.
|
|||||||
- Always send `x-clickthrough-token` when token auth is enabled.
|
- Always send `x-clickthrough-token` when token auth is enabled.
|
||||||
- For `/exec`, also send `x-clickthrough-exec-secret`.
|
- For `/exec`, also send `x-clickthrough-exec-secret`.
|
||||||
|
|
||||||
|
## `POST /action` request shape (important)
|
||||||
|
|
||||||
|
`/action` always expects an `action` plus an optional `target` object.
|
||||||
|
Do **not** invent top-level `x` / `y` fields.
|
||||||
|
|
||||||
|
Minimal pixel click:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"action": "click",
|
||||||
|
"target": {"mode": "pixel", "x": 100, "y": 200},
|
||||||
|
"button": "left",
|
||||||
|
"clicks": 1
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Minimal grid click:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"action": "click",
|
||||||
|
"target": {
|
||||||
|
"mode": "grid",
|
||||||
|
"region_x": 0,
|
||||||
|
"region_y": 0,
|
||||||
|
"region_width": 1920,
|
||||||
|
"region_height": 1080,
|
||||||
|
"rows": 12,
|
||||||
|
"cols": 12,
|
||||||
|
"row": 6,
|
||||||
|
"col": 8,
|
||||||
|
"dx": 0.0,
|
||||||
|
"dy": 0.0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Other canonical examples:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
||||||
|
{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
||||||
|
{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
||||||
|
{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
|
||||||
|
{"action": "type", "text": "hello world", "interval_ms": 20}
|
||||||
|
{"action": "hotkey", "keys": ["ctrl", "l"]}
|
||||||
|
```
|
||||||
|
|
||||||
|
Rules:
|
||||||
|
- `dx` / `dy` belong inside `target`, not beside it.
|
||||||
|
- `type` and `hotkey` usually do not need a `target`.
|
||||||
|
- For pixel targets, `x` / `y` are global desktop coordinates.
|
||||||
|
- For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used.
|
||||||
|
|
||||||
|
## When to use `/exec`
|
||||||
|
|
||||||
|
Prefer structured GUI control first:
|
||||||
|
- `/screen`, `/zoom`, `/ocr` to observe
|
||||||
|
- `/action` or `/batch` to interact
|
||||||
|
|
||||||
|
Use `/exec` only when it is the cleanest available tool for the job, for example:
|
||||||
|
- launching an app that is not already visible
|
||||||
|
- querying machine state that the GUI does not expose well
|
||||||
|
- performing an explicit user-requested shell/system task
|
||||||
|
- recovering from a blocked GUI flow when normal interaction failed
|
||||||
|
|
||||||
|
Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
|
||||||
|
|
||||||
## Core workflow (mandatory)
|
## Core workflow (mandatory)
|
||||||
|
|
||||||
1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
|
1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
|
||||||
@@ -73,7 +143,12 @@ The agent should not assume it can self-install this stack.
|
|||||||
## Verify-before-click rules
|
## Verify-before-click rules
|
||||||
|
|
||||||
- Never click if target identity is ambiguous.
|
- Never click if target identity is ambiguous.
|
||||||
- Require at least two matching signals before click (example: OCR text + expected UI region).
|
- Require at least two matching signals before click.
|
||||||
|
- Good signal pairs include:
|
||||||
|
- OCR text + expected UI region
|
||||||
|
- OCR text + matching button shape/icon nearby
|
||||||
|
- dialog title text + expected button position within that dialog
|
||||||
|
- known app/window focus + expected control location
|
||||||
- If confidence is low, do not "test click"; zoom and re-localize first.
|
- If confidence is low, do not "test click"; zoom and re-localize first.
|
||||||
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
|
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
|
||||||
1) preview intended coordinate + reason
|
1) preview intended coordinate + reason
|
||||||
@@ -100,6 +175,17 @@ The agent should not assume it can self-install this stack.
|
|||||||
- Prefer short, reversible actions over long macros.
|
- Prefer short, reversible actions over long macros.
|
||||||
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
|
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
|
||||||
|
|
||||||
|
## Fallback ladder for uncertain targeting
|
||||||
|
|
||||||
|
1. Full-screen capture with a coarse grid.
|
||||||
|
2. Zoom into the candidate area with a denser grid.
|
||||||
|
3. OCR the full screen or the tighter region.
|
||||||
|
4. Re-anchor on a more reliable nearby control, title, or label.
|
||||||
|
5. Try a keyboard-first flow if the app supports it.
|
||||||
|
6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner.
|
||||||
|
|
||||||
|
Do not skip from "uncertain click" straight to random retries.
|
||||||
|
|
||||||
## App-specific playbooks (recommended)
|
## App-specific playbooks (recommended)
|
||||||
|
|
||||||
Build per-app routines for repetitive tasks instead of generic clicking.
|
Build per-app routines for repetitive tasks instead of generic clicking.
|
||||||
|
|||||||
Reference in New Issue
Block a user