docs(skill): prefer vision for target localization
All checks were successful
python-syntax / syntax-check (push) Successful in 11s
All checks were successful
python-syntax / syntax-check (push) Successful in 11s
This commit is contained in:
@@ -90,25 +90,38 @@ This is especially useful for:
|
|||||||
- understanding dialog layout or pane structure
|
- understanding dialog layout or pane structure
|
||||||
- distinguishing similar nearby controls by icon, spacing, or emphasis
|
- distinguishing similar nearby controls by icon, spacing, or emphasis
|
||||||
- checking whether a visual state changed after a click
|
- checking whether a visual state changed after a click
|
||||||
|
- telling you where something is and where to click when text alone is not reliable
|
||||||
|
|
||||||
Good pattern:
|
Good pattern:
|
||||||
1. capture with `GET /screen` or `POST /zoom`
|
1. capture with `GET /screen` or `POST /zoom`
|
||||||
2. hand that screenshot to the `image` tool
|
2. hand that screenshot to the `image` tool
|
||||||
3. ask a precise question about the visible UI
|
3. ask a precise question about the visible UI
|
||||||
4. convert the answer into a concrete Clickthrough target
|
4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
|
||||||
5. act once
|
5. convert the answer into a concrete Clickthrough target
|
||||||
6. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
|
6. act once
|
||||||
|
7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
|
||||||
|
|
||||||
|
Prefer vision over guessing.
|
||||||
|
If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
|
||||||
|
The model should help answer things like:
|
||||||
|
- which visible button is the real primary action
|
||||||
|
- whether the target is left/right/top/bottom within the crop
|
||||||
|
- which of several similar buttons is the one to click
|
||||||
|
- an approximate click point inside the provided image bounds
|
||||||
|
|
||||||
Ask narrow questions.
|
Ask narrow questions.
|
||||||
Good:
|
Good:
|
||||||
- "Which button in this dialog is the primary confirmation action?"
|
- "Which button in this dialog is the primary confirmation action?"
|
||||||
- "Is the scan still running, or does this look complete?"
|
- "Is the scan still running, or does this look complete?"
|
||||||
- "Which of these tabs appears selected?"
|
- "Which of these tabs appears selected?"
|
||||||
|
- "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
|
||||||
|
- "Which visible control says Stop Recording, and where should I click?"
|
||||||
|
|
||||||
Bad:
|
Bad:
|
||||||
- "What should I click?"
|
- "What should I click?"
|
||||||
- "Use your eyes and do the task"
|
- "Use your eyes and do the task"
|
||||||
- anything that assumes the model has live continuity without a new screenshot
|
- anything that assumes the model has live continuity without a new screenshot
|
||||||
|
- requesting coordinates without telling the model the image bounds or expected output format
|
||||||
|
|
||||||
### Header requirements
|
### Header requirements
|
||||||
|
|
||||||
@@ -205,7 +218,9 @@ When a task can be completed with window focus/restore, keyboard shortcuts, scre
|
|||||||
- OCR text + matching button shape/icon nearby
|
- OCR text + matching button shape/icon nearby
|
||||||
- dialog title text + expected button position within that dialog
|
- dialog title text + expected button position within that dialog
|
||||||
- known app/window focus + expected control location
|
- known app/window focus + expected control location
|
||||||
|
- OCR candidate + vision-model localization inside the same crop
|
||||||
- If confidence is low, do not "test click"; zoom and re-localize first.
|
- If confidence is low, do not "test click"; zoom and re-localize first.
|
||||||
|
- If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
|
||||||
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
|
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
|
||||||
1) preview intended coordinate + reason
|
1) preview intended coordinate + reason
|
||||||
2) execute only after explicit confirmation.
|
2) execute only after explicit confirmation.
|
||||||
@@ -257,6 +272,7 @@ Example loop:
|
|||||||
|
|
||||||
The key rule is simple: screenshot first, interpret second, click third, verify fourth.
|
The key rule is simple: screenshot first, interpret second, click third, verify fourth.
|
||||||
Do not collapse those steps into fake certainty.
|
Do not collapse those steps into fake certainty.
|
||||||
|
When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.
|
||||||
|
|
||||||
## App-specific playbooks (recommended)
|
## App-specific playbooks (recommended)
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user