docs(skill): prefer vision for target localization

2026-05-01 17:16:37 +02:00
parent 66615c8a81
commit 2585bc3a7c
1 changed files with 19 additions and 3 deletions
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -90,25 +90,38 @@ This is especially useful for:
 - understanding dialog layout or pane structure
 - distinguishing similar nearby controls by icon, spacing, or emphasis
 - checking whether a visual state changed after a click
 - telling you where something is and where to click when text alone is not reliable
 Good pattern:
 1. capture with `GET /screen` or `POST /zoom`
 2. hand that screenshot to the `image` tool
 3. ask a precise question about the visible UI
-4. convert the answer into a concrete Clickthrough target
+4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
-5. act once
+5. convert the answer into a concrete Clickthrough target
-6. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
+6. act once
 7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
 Prefer vision over guessing.
 If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
 The model should help answer things like:
 - which visible button is the real primary action
 - whether the target is left/right/top/bottom within the crop
 - which of several similar buttons is the one to click
 - an approximate click point inside the provided image bounds
 Ask narrow questions.
 Good:
 - "Which button in this dialog is the primary confirmation action?"
 - "Is the scan still running, or does this look complete?"
 - "Which of these tabs appears selected?"
 - "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
 - "Which visible control says Stop Recording, and where should I click?"
 Bad:
 - "What should I click?"
 - "Use your eyes and do the task"
 - anything that assumes the model has live continuity without a new screenshot
 - requesting coordinates without telling the model the image bounds or expected output format
 ### Header requirements
@@ -205,7 +218,9 @@ When a task can be completed with window focus/restore, keyboard shortcuts, scre
  - OCR text + matching button shape/icon nearby
  - dialog title text + expected button position within that dialog
  - known app/window focus + expected control location
  - OCR candidate + vision-model localization inside the same crop
 - If confidence is low, do not "test click"; zoom and re-localize first.
 - If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
 - For high-impact actions (close/delete/send/purchase), use two-phase flow:
  1) preview intended coordinate + reason
  2) execute only after explicit confirmation.
@@ -257,6 +272,7 @@ Example loop:
 The key rule is simple: screenshot first, interpret second, click third, verify fourth.
 Do not collapse those steps into fake certainty.
 When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.
 ## App-specific playbooks (recommended)