docs(skill): prefer vision for target localization
All checks were successful
python-syntax / syntax-check (push) Successful in 11s
All checks were successful
python-syntax / syntax-check (push) Successful in 11s
This commit is contained in:
@@ -90,25 +90,38 @@ This is especially useful for:
|
||||
- understanding dialog layout or pane structure
|
||||
- distinguishing similar nearby controls by icon, spacing, or emphasis
|
||||
- checking whether a visual state changed after a click
|
||||
- telling you where something is and where to click when text alone is not reliable
|
||||
|
||||
Good pattern:
|
||||
1. capture with `GET /screen` or `POST /zoom`
|
||||
2. hand that screenshot to the `image` tool
|
||||
3. ask a precise question about the visible UI
|
||||
4. convert the answer into a concrete Clickthrough target
|
||||
5. act once
|
||||
6. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
|
||||
4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
|
||||
5. convert the answer into a concrete Clickthrough target
|
||||
6. act once
|
||||
7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
|
||||
|
||||
Prefer vision over guessing.
|
||||
If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
|
||||
The model should help answer things like:
|
||||
- which visible button is the real primary action
|
||||
- whether the target is left/right/top/bottom within the crop
|
||||
- which of several similar buttons is the one to click
|
||||
- an approximate click point inside the provided image bounds
|
||||
|
||||
Ask narrow questions.
|
||||
Good:
|
||||
- "Which button in this dialog is the primary confirmation action?"
|
||||
- "Is the scan still running, or does this look complete?"
|
||||
- "Which of these tabs appears selected?"
|
||||
- "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
|
||||
- "Which visible control says Stop Recording, and where should I click?"
|
||||
|
||||
Bad:
|
||||
- "What should I click?"
|
||||
- "Use your eyes and do the task"
|
||||
- anything that assumes the model has live continuity without a new screenshot
|
||||
- requesting coordinates without telling the model the image bounds or expected output format
|
||||
|
||||
### Header requirements
|
||||
|
||||
@@ -205,7 +218,9 @@ When a task can be completed with window focus/restore, keyboard shortcuts, scre
|
||||
- OCR text + matching button shape/icon nearby
|
||||
- dialog title text + expected button position within that dialog
|
||||
- known app/window focus + expected control location
|
||||
- OCR candidate + vision-model localization inside the same crop
|
||||
- If confidence is low, do not "test click"; zoom and re-localize first.
|
||||
- If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
|
||||
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
|
||||
1) preview intended coordinate + reason
|
||||
2) execute only after explicit confirmation.
|
||||
@@ -257,6 +272,7 @@ Example loop:
|
||||
|
||||
The key rule is simple: screenshot first, interpret second, click third, verify fourth.
|
||||
Do not collapse those steps into fake certainty.
|
||||
When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.
|
||||
|
||||
## App-specific playbooks (recommended)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user