Compare commits
2 Commits
c66779d929
...
2585bc3a7c
| Author | SHA1 | Date | |
|---|---|---|---|
| 2585bc3a7c | |||
| 66615c8a81 |
@@ -70,6 +70,7 @@ Say what you actually have: screenshots, OCR output, and fresh verification capt
|
||||
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
|
||||
- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
|
||||
- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
|
||||
- Window rectangles from `GET /windows` are also in global desktop coordinates. Use them to sanity-check which monitor the app is really on before clicking.
|
||||
|
||||
### OCR usage
|
||||
|
||||
@@ -78,6 +79,8 @@ Say what you actually have: screenshots, OCR output, and fresh verification capt
|
||||
- Use `language_hint` when known (for example `eng`) to improve consistency.
|
||||
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
|
||||
- Treat OCR as one signal, not the only signal, before high-impact clicks.
|
||||
- Current response shape is nested under `result.blocks`, not top-level `blocks`. Parse the real payload before assuming the endpoint failed.
|
||||
- OCR can be noisy on dense shopping pages, streaming apps, and button-heavy sidebars. Re-crop tightly before escalating.
|
||||
|
||||
### Screenshot + `image` tool usage
|
||||
|
||||
@@ -87,25 +90,38 @@ This is especially useful for:
|
||||
- understanding dialog layout or pane structure
|
||||
- distinguishing similar nearby controls by icon, spacing, or emphasis
|
||||
- checking whether a visual state changed after a click
|
||||
- telling you where something is and where to click when text alone is not reliable
|
||||
|
||||
Good pattern:
|
||||
1. capture with `GET /screen` or `POST /zoom`
|
||||
2. hand that screenshot to the `image` tool
|
||||
3. ask a precise question about the visible UI
|
||||
4. convert the answer into a concrete Clickthrough target
|
||||
5. act once
|
||||
6. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
|
||||
4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
|
||||
5. convert the answer into a concrete Clickthrough target
|
||||
6. act once
|
||||
7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
|
||||
|
||||
Prefer vision over guessing.
|
||||
If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
|
||||
The model should help answer things like:
|
||||
- which visible button is the real primary action
|
||||
- whether the target is left/right/top/bottom within the crop
|
||||
- which of several similar buttons is the one to click
|
||||
- an approximate click point inside the provided image bounds
|
||||
|
||||
Ask narrow questions.
|
||||
Good:
|
||||
- "Which button in this dialog is the primary confirmation action?"
|
||||
- "Is the scan still running, or does this look complete?"
|
||||
- "Which of these tabs appears selected?"
|
||||
- "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
|
||||
- "Which visible control says Stop Recording, and where should I click?"
|
||||
|
||||
Bad:
|
||||
- "What should I click?"
|
||||
- "Use your eyes and do the task"
|
||||
- anything that assumes the model has live continuity without a new screenshot
|
||||
- requesting coordinates without telling the model the image bounds or expected output format
|
||||
|
||||
### Header requirements
|
||||
|
||||
@@ -179,17 +195,19 @@ Use `/exec` only when it is the cleanest available tool for the job, for example
|
||||
|
||||
Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
|
||||
Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
|
||||
When a task can be completed with window focus/restore, keyboard shortcuts, screenshots, OCR, and normal actions, stay out of `/exec` entirely.
|
||||
|
||||
## Core workflow (mandatory)
|
||||
|
||||
1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
|
||||
2. Identify likely target region and compute an initial confidence score.
|
||||
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
||||
4. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
||||
5. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
|
||||
6. Execute one minimal action via `POST /action`.
|
||||
7. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
|
||||
8. Repeat until objective is complete.
|
||||
1. Call `GET /windows` first when the task mentions a known app; focus/restore the right window before screen hunting.
|
||||
2. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
|
||||
3. Identify likely target region and compute an initial confidence score.
|
||||
4. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
||||
5. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
||||
6. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
|
||||
7. Execute one minimal action via `POST /action`.
|
||||
8. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
|
||||
9. Repeat until objective is complete.
|
||||
|
||||
## Verify-before-click rules
|
||||
|
||||
@@ -200,7 +218,9 @@ Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry wh
|
||||
- OCR text + matching button shape/icon nearby
|
||||
- dialog title text + expected button position within that dialog
|
||||
- known app/window focus + expected control location
|
||||
- OCR candidate + vision-model localization inside the same crop
|
||||
- If confidence is low, do not "test click"; zoom and re-localize first.
|
||||
- If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
|
||||
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
|
||||
1) preview intended coordinate + reason
|
||||
2) execute only after explicit confirmation.
|
||||
@@ -252,6 +272,7 @@ Example loop:
|
||||
|
||||
The key rule is simple: screenshot first, interpret second, click third, verify fourth.
|
||||
Do not collapse those steps into fake certainty.
|
||||
When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.
|
||||
|
||||
## App-specific playbooks (recommended)
|
||||
|
||||
@@ -317,6 +338,7 @@ Use for browser navigation, tab targeting, or web app recovery.
|
||||
3. verify tab or page identity with OCR on the tab strip or page heading
|
||||
4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs
|
||||
5. after navigation, wait for visual stability or expected text before taking the next action
|
||||
6. on shopping/checkouts, tighten crops around the buy box or checkout panel before reading button text; full-page OCR often misses the one thing that matters
|
||||
|
||||
Do not assume a page loaded just because the click landed. Verify it.
|
||||
|
||||
@@ -333,6 +355,19 @@ Use when the task involves toggles, dropdowns, sidebars, or nested settings pane
|
||||
|
||||
Settings UIs love hiding side effects. Assume nothing.
|
||||
|
||||
### Dense app / control-strip playbook
|
||||
|
||||
Use for apps like OBS, IDEs, mixers, dashboards, or anything with tiny bottom-right control clusters.
|
||||
|
||||
1. focus the exact app window with `POST /windows/action`
|
||||
2. capture the full target display once to confirm the window is actually frontmost
|
||||
3. crop tightly around the suspected control strip with `POST /zoom`
|
||||
4. run OCR on the crop, not the full screen
|
||||
5. if labels are still ambiguous, ask the `image` tool a narrow question about the specific buttons
|
||||
6. click once and immediately verify the control label changed (`Start Recording` -> `Stop Recording`, etc.)
|
||||
|
||||
Do not trust OCR taken from the wrong frontmost window. It will happily waste your time.
|
||||
|
||||
### Spotify playbook
|
||||
|
||||
- Focus app window before search/navigation.
|
||||
|
||||
Reference in New Issue
Block a user