docs(skill): prefer vision for target localization

docs(skill): refine clickthrough control guidance
2026-05-01 17:16:37 +02:00 · 2026-05-01 17:15:19 +02:00
1 changed files with 46 additions and 11 deletions
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -70,6 +70,7 @@ Say what you actually have: screenshots, OCR output, and fresh verification capt
 - Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
 - Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
 - If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
 - Window rectangles from `GET /windows` are also in global desktop coordinates. Use them to sanity-check which monitor the app is really on before clicking.
 ### OCR usage
@@ -78,6 +79,8 @@ Say what you actually have: screenshots, OCR output, and fresh verification capt
 - Use `language_hint` when known (for example `eng`) to improve consistency.
 - Filter noise with `min_confidence` (start around `0.4` and tune per app).
 - Treat OCR as one signal, not the only signal, before high-impact clicks.
 - Current response shape is nested under `result.blocks`, not top-level `blocks`. Parse the real payload before assuming the endpoint failed.
 - OCR can be noisy on dense shopping pages, streaming apps, and button-heavy sidebars. Re-crop tightly before escalating.
 ### Screenshot + `image` tool usage
@@ -87,25 +90,38 @@ This is especially useful for:
 - understanding dialog layout or pane structure
 - distinguishing similar nearby controls by icon, spacing, or emphasis
 - checking whether a visual state changed after a click
 - telling you where something is and where to click when text alone is not reliable
 Good pattern:
 1. capture with `GET /screen` or `POST /zoom`
 2. hand that screenshot to the `image` tool
 3. ask a precise question about the visible UI
-4. convert the answer into a concrete Clickthrough target
+4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
-5. act once
+5. convert the answer into a concrete Clickthrough target
-6. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
+6. act once
 7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
 Prefer vision over guessing.
 If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
 The model should help answer things like:
 - which visible button is the real primary action
 - whether the target is left/right/top/bottom within the crop
 - which of several similar buttons is the one to click
 - an approximate click point inside the provided image bounds
 Ask narrow questions.
 Good:
 - "Which button in this dialog is the primary confirmation action?"
 - "Is the scan still running, or does this look complete?"
 - "Which of these tabs appears selected?"
 - "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
 - "Which visible control says Stop Recording, and where should I click?"
 Bad:
 - "What should I click?"
 - "Use your eyes and do the task"
 - anything that assumes the model has live continuity without a new screenshot
 - requesting coordinates without telling the model the image bounds or expected output format
 ### Header requirements
@@ -179,17 +195,19 @@ Use `/exec` only when it is the cleanest available tool for the job, for example
 Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
 Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
 When a task can be completed with window focus/restore, keyboard shortcuts, screenshots, OCR, and normal actions, stay out of `/exec` entirely.
 ## Core workflow (mandatory)
-1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
+1. Call `GET /windows` first when the task mentions a known app; focus/restore the right window before screen hunting.
-2. Identify likely target region and compute an initial confidence score.
+2. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
-3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
+3. Identify likely target region and compute an initial confidence score.
-4. **Before any click**, verify target identity (OCR text/icon/location consistency).
+4. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
-5. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
+5. **Before any click**, verify target identity (OCR text/icon/location consistency).
-6. Execute one minimal action via `POST /action`.
+6. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
-7. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
+7. Execute one minimal action via `POST /action`.
-8. Repeat until objective is complete.
+8. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
 9. Repeat until objective is complete.
 ## Verify-before-click rules
@@ -200,7 +218,9 @@ Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry wh
  - OCR text + matching button shape/icon nearby
  - dialog title text + expected button position within that dialog
  - known app/window focus + expected control location
  - OCR candidate + vision-model localization inside the same crop
 - If confidence is low, do not "test click"; zoom and re-localize first.
 - If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
 - For high-impact actions (close/delete/send/purchase), use two-phase flow:
  1) preview intended coordinate + reason
  2) execute only after explicit confirmation.
@@ -252,6 +272,7 @@ Example loop:
 The key rule is simple: screenshot first, interpret second, click third, verify fourth.
 Do not collapse those steps into fake certainty.
 When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.
 ## App-specific playbooks (recommended)
@@ -317,6 +338,7 @@ Use for browser navigation, tab targeting, or web app recovery.
 3. verify tab or page identity with OCR on the tab strip or page heading
 4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs
 5. after navigation, wait for visual stability or expected text before taking the next action
 6. on shopping/checkouts, tighten crops around the buy box or checkout panel before reading button text; full-page OCR often misses the one thing that matters
 Do not assume a page loaded just because the click landed. Verify it.
@@ -333,6 +355,19 @@ Use when the task involves toggles, dropdowns, sidebars, or nested settings pane
 Settings UIs love hiding side effects. Assume nothing.
 ### Dense app / control-strip playbook
 Use for apps like OBS, IDEs, mixers, dashboards, or anything with tiny bottom-right control clusters.
 1. focus the exact app window with `POST /windows/action`
 2. capture the full target display once to confirm the window is actually frontmost
 3. crop tightly around the suspected control strip with `POST /zoom`
 4. run OCR on the crop, not the full screen
 5. if labels are still ambiguous, ask the `image` tool a narrow question about the specific buttons
 6. click once and immediately verify the control label changed (`Start Recording` -> `Stop Recording`, etc.)
 Do not trust OCR taken from the wrong frontmost window. It will happily waste your time.
 ### Spotify playbook
 - Focus app window before search/navigation.
Author	SHA1	Message	Date
Luna	2585bc3a7c	docs(skill): prefer vision for target localization All checks were successful python-syntax / syntax-check (push) Successful in 11s Details	2026-05-01 17:16:37 +02:00
Luna	66615c8a81	docs(skill): refine clickthrough control guidance	2026-05-01 17:15:19 +02:00