Compare commits

..

2 Commits

Author SHA1 Message Date
2585bc3a7c docs(skill): prefer vision for target localization
All checks were successful
python-syntax / syntax-check (push) Successful in 11s
2026-05-01 17:16:37 +02:00
66615c8a81 docs(skill): refine clickthrough control guidance 2026-05-01 17:15:19 +02:00

View File

@@ -70,6 +70,7 @@ Say what you actually have: screenshots, OCR output, and fresh verification capt
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates. - Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset. - Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets. - If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
- Window rectangles from `GET /windows` are also in global desktop coordinates. Use them to sanity-check which monitor the app is really on before clicking.
### OCR usage ### OCR usage
@@ -78,6 +79,8 @@ Say what you actually have: screenshots, OCR output, and fresh verification capt
- Use `language_hint` when known (for example `eng`) to improve consistency. - Use `language_hint` when known (for example `eng`) to improve consistency.
- Filter noise with `min_confidence` (start around `0.4` and tune per app). - Filter noise with `min_confidence` (start around `0.4` and tune per app).
- Treat OCR as one signal, not the only signal, before high-impact clicks. - Treat OCR as one signal, not the only signal, before high-impact clicks.
- Current response shape is nested under `result.blocks`, not top-level `blocks`. Parse the real payload before assuming the endpoint failed.
- OCR can be noisy on dense shopping pages, streaming apps, and button-heavy sidebars. Re-crop tightly before escalating.
### Screenshot + `image` tool usage ### Screenshot + `image` tool usage
@@ -87,25 +90,38 @@ This is especially useful for:
- understanding dialog layout or pane structure - understanding dialog layout or pane structure
- distinguishing similar nearby controls by icon, spacing, or emphasis - distinguishing similar nearby controls by icon, spacing, or emphasis
- checking whether a visual state changed after a click - checking whether a visual state changed after a click
- telling you where something is and where to click when text alone is not reliable
Good pattern: Good pattern:
1. capture with `GET /screen` or `POST /zoom` 1. capture with `GET /screen` or `POST /zoom`
2. hand that screenshot to the `image` tool 2. hand that screenshot to the `image` tool
3. ask a precise question about the visible UI 3. ask a precise question about the visible UI
4. convert the answer into a concrete Clickthrough target 4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
5. act once 5. convert the answer into a concrete Clickthrough target
6. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly 6. act once
7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
Prefer vision over guessing.
If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
The model should help answer things like:
- which visible button is the real primary action
- whether the target is left/right/top/bottom within the crop
- which of several similar buttons is the one to click
- an approximate click point inside the provided image bounds
Ask narrow questions. Ask narrow questions.
Good: Good:
- "Which button in this dialog is the primary confirmation action?" - "Which button in this dialog is the primary confirmation action?"
- "Is the scan still running, or does this look complete?" - "Is the scan still running, or does this look complete?"
- "Which of these tabs appears selected?" - "Which of these tabs appears selected?"
- "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
- "Which visible control says Stop Recording, and where should I click?"
Bad: Bad:
- "What should I click?" - "What should I click?"
- "Use your eyes and do the task" - "Use your eyes and do the task"
- anything that assumes the model has live continuity without a new screenshot - anything that assumes the model has live continuity without a new screenshot
- requesting coordinates without telling the model the image bounds or expected output format
### Header requirements ### Header requirements
@@ -179,17 +195,19 @@ Use `/exec` only when it is the cleanest available tool for the job, for example
Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`. Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly. Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
When a task can be completed with window focus/restore, keyboard shortcuts, screenshots, OCR, and normal actions, stay out of `/exec` entirely.
## Core workflow (mandatory) ## Core workflow (mandatory)
1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display. 1. Call `GET /windows` first when the task mentions a known app; focus/restore the right window before screen hunting.
2. Identify likely target region and compute an initial confidence score. 2. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate. 3. Identify likely target region and compute an initial confidence score.
4. **Before any click**, verify target identity (OCR text/icon/location consistency). 4. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
5. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough. 5. **Before any click**, verify target identity (OCR text/icon/location consistency).
6. Execute one minimal action via `POST /action`. 6. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
7. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change. 7. Execute one minimal action via `POST /action`.
8. Repeat until objective is complete. 8. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
9. Repeat until objective is complete.
## Verify-before-click rules ## Verify-before-click rules
@@ -200,7 +218,9 @@ Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry wh
- OCR text + matching button shape/icon nearby - OCR text + matching button shape/icon nearby
- dialog title text + expected button position within that dialog - dialog title text + expected button position within that dialog
- known app/window focus + expected control location - known app/window focus + expected control location
- OCR candidate + vision-model localization inside the same crop
- If confidence is low, do not "test click"; zoom and re-localize first. - If confidence is low, do not "test click"; zoom and re-localize first.
- If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
- For high-impact actions (close/delete/send/purchase), use two-phase flow: - For high-impact actions (close/delete/send/purchase), use two-phase flow:
1) preview intended coordinate + reason 1) preview intended coordinate + reason
2) execute only after explicit confirmation. 2) execute only after explicit confirmation.
@@ -252,6 +272,7 @@ Example loop:
The key rule is simple: screenshot first, interpret second, click third, verify fourth. The key rule is simple: screenshot first, interpret second, click third, verify fourth.
Do not collapse those steps into fake certainty. Do not collapse those steps into fake certainty.
When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.
## App-specific playbooks (recommended) ## App-specific playbooks (recommended)
@@ -317,6 +338,7 @@ Use for browser navigation, tab targeting, or web app recovery.
3. verify tab or page identity with OCR on the tab strip or page heading 3. verify tab or page identity with OCR on the tab strip or page heading
4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs 4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs
5. after navigation, wait for visual stability or expected text before taking the next action 5. after navigation, wait for visual stability or expected text before taking the next action
6. on shopping/checkouts, tighten crops around the buy box or checkout panel before reading button text; full-page OCR often misses the one thing that matters
Do not assume a page loaded just because the click landed. Verify it. Do not assume a page loaded just because the click landed. Verify it.
@@ -333,6 +355,19 @@ Use when the task involves toggles, dropdowns, sidebars, or nested settings pane
Settings UIs love hiding side effects. Assume nothing. Settings UIs love hiding side effects. Assume nothing.
### Dense app / control-strip playbook
Use for apps like OBS, IDEs, mixers, dashboards, or anything with tiny bottom-right control clusters.
1. focus the exact app window with `POST /windows/action`
2. capture the full target display once to confirm the window is actually frontmost
3. crop tightly around the suspected control strip with `POST /zoom`
4. run OCR on the crop, not the full screen
5. if labels are still ambiguous, ask the `image` tool a narrow question about the specific buttons
6. click once and immediately verify the control label changed (`Start Recording` -> `Stop Recording`, etc.)
Do not trust OCR taken from the wrong frontmost window. It will happily waste your time.
### Spotify playbook ### Spotify playbook
- Focus app window before search/navigation. - Focus app window before search/navigation.