docs(skill): refine clickthrough control guidance
This commit is contained in:
@@ -70,6 +70,7 @@ Say what you actually have: screenshots, OCR output, and fresh verification capt
|
||||
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
|
||||
- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
|
||||
- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
|
||||
- Window rectangles from `GET /windows` are also in global desktop coordinates. Use them to sanity-check which monitor the app is really on before clicking.
|
||||
|
||||
### OCR usage
|
||||
|
||||
@@ -78,6 +79,8 @@ Say what you actually have: screenshots, OCR output, and fresh verification capt
|
||||
- Use `language_hint` when known (for example `eng`) to improve consistency.
|
||||
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
|
||||
- Treat OCR as one signal, not the only signal, before high-impact clicks.
|
||||
- Current response shape is nested under `result.blocks`, not top-level `blocks`. Parse the real payload before assuming the endpoint failed.
|
||||
- OCR can be noisy on dense shopping pages, streaming apps, and button-heavy sidebars. Re-crop tightly before escalating.
|
||||
|
||||
### Screenshot + `image` tool usage
|
||||
|
||||
@@ -179,17 +182,19 @@ Use `/exec` only when it is the cleanest available tool for the job, for example
|
||||
|
||||
Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
|
||||
Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
|
||||
When a task can be completed with window focus/restore, keyboard shortcuts, screenshots, OCR, and normal actions, stay out of `/exec` entirely.
|
||||
|
||||
## Core workflow (mandatory)
|
||||
|
||||
1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
|
||||
2. Identify likely target region and compute an initial confidence score.
|
||||
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
||||
4. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
||||
5. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
|
||||
6. Execute one minimal action via `POST /action`.
|
||||
7. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
|
||||
8. Repeat until objective is complete.
|
||||
1. Call `GET /windows` first when the task mentions a known app; focus/restore the right window before screen hunting.
|
||||
2. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
|
||||
3. Identify likely target region and compute an initial confidence score.
|
||||
4. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
||||
5. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
||||
6. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
|
||||
7. Execute one minimal action via `POST /action`.
|
||||
8. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
|
||||
9. Repeat until objective is complete.
|
||||
|
||||
## Verify-before-click rules
|
||||
|
||||
@@ -317,6 +322,7 @@ Use for browser navigation, tab targeting, or web app recovery.
|
||||
3. verify tab or page identity with OCR on the tab strip or page heading
|
||||
4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs
|
||||
5. after navigation, wait for visual stability or expected text before taking the next action
|
||||
6. on shopping/checkouts, tighten crops around the buy box or checkout panel before reading button text; full-page OCR often misses the one thing that matters
|
||||
|
||||
Do not assume a page loaded just because the click landed. Verify it.
|
||||
|
||||
@@ -333,6 +339,19 @@ Use when the task involves toggles, dropdowns, sidebars, or nested settings pane
|
||||
|
||||
Settings UIs love hiding side effects. Assume nothing.
|
||||
|
||||
### Dense app / control-strip playbook
|
||||
|
||||
Use for apps like OBS, IDEs, mixers, dashboards, or anything with tiny bottom-right control clusters.
|
||||
|
||||
1. focus the exact app window with `POST /windows/action`
|
||||
2. capture the full target display once to confirm the window is actually frontmost
|
||||
3. crop tightly around the suspected control strip with `POST /zoom`
|
||||
4. run OCR on the crop, not the full screen
|
||||
5. if labels are still ambiguous, ask the `image` tool a narrow question about the specific buttons
|
||||
6. click once and immediately verify the control label changed (`Start Recording` -> `Stop Recording`, etc.)
|
||||
|
||||
Do not trust OCR taken from the wrong frontmost window. It will happily waste your time.
|
||||
|
||||
### Spotify playbook
|
||||
|
||||
- Focus app window before search/navigation.
|
||||
|
||||
Reference in New Issue
Block a user