docs(skill): refine clickthrough control guidance

2026-05-01 17:15:19 +02:00
parent c66779d929
commit 66615c8a81
1 changed files with 27 additions and 8 deletions
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -70,6 +70,7 @@ Say what you actually have: screenshots, OCR output, and fresh verification capt
 - Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
 - Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
 - If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
+- Window rectangles from `GET /windows` are also in global desktop coordinates. Use them to sanity-check which monitor the app is really on before clicking.

 ### OCR usage

@@ -78,6 +79,8 @@ Say what you actually have: screenshots, OCR output, and fresh verification capt
 - Use `language_hint` when known (for example `eng`) to improve consistency.
 - Filter noise with `min_confidence` (start around `0.4` and tune per app).
 - Treat OCR as one signal, not the only signal, before high-impact clicks.
+- Current response shape is nested under `result.blocks`, not top-level `blocks`. Parse the real payload before assuming the endpoint failed.
+- OCR can be noisy on dense shopping pages, streaming apps, and button-heavy sidebars. Re-crop tightly before escalating.

 ### Screenshot + `image` tool usage

@@ -179,17 +182,19 @@ Use `/exec` only when it is the cleanest available tool for the job, for example

 Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
 Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
+When a task can be completed with window focus/restore, keyboard shortcuts, screenshots, OCR, and normal actions, stay out of `/exec` entirely.

 ## Core workflow (mandatory)

-1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
-2. Identify likely target region and compute an initial confidence score.
-3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
-4. **Before any click**, verify target identity (OCR text/icon/location consistency).
-5. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
-6. Execute one minimal action via `POST /action`.
-7. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
-8. Repeat until objective is complete.
+1. Call `GET /windows` first when the task mentions a known app; focus/restore the right window before screen hunting.
+2. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
+3. Identify likely target region and compute an initial confidence score.
+4. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
+5. **Before any click**, verify target identity (OCR text/icon/location consistency).
+6. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
+7. Execute one minimal action via `POST /action`.
+8. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
+9. Repeat until objective is complete.

 ## Verify-before-click rules

@@ -317,6 +322,7 @@ Use for browser navigation, tab targeting, or web app recovery.
 3. verify tab or page identity with OCR on the tab strip or page heading
 4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs
 5. after navigation, wait for visual stability or expected text before taking the next action
+6. on shopping/checkouts, tighten crops around the buy box or checkout panel before reading button text; full-page OCR often misses the one thing that matters

 Do not assume a page loaded just because the click landed. Verify it.

@@ -333,6 +339,19 @@ Use when the task involves toggles, dropdowns, sidebars, or nested settings pane

 Settings UIs love hiding side effects. Assume nothing.

+### Dense app / control-strip playbook
+
+Use for apps like OBS, IDEs, mixers, dashboards, or anything with tiny bottom-right control clusters.
+
+1. focus the exact app window with `POST /windows/action`
+2. capture the full target display once to confirm the window is actually frontmost
+3. crop tightly around the suspected control strip with `POST /zoom`
+4. run OCR on the crop, not the full screen
+5. if labels are still ambiguous, ask the `image` tool a narrow question about the specific buttons
+6. click once and immediately verify the control label changed (`Start Recording` -> `Stop Recording`, etc.)
+
+Do not trust OCR taken from the wrong frontmost window. It will happily waste your time.
+
 ### Spotify playbook

 - Focus app window before search/navigation.