docs(skill): prefer vision for target localization

2026-05-01 17:16:37 +02:00
parent 66615c8a81
commit 2585bc3a7c
1 changed files with 19 additions and 3 deletions
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -90,25 +90,38 @@ This is especially useful for:
 - understanding dialog layout or pane structure
 - distinguishing similar nearby controls by icon, spacing, or emphasis
 - checking whether a visual state changed after a click
+- telling you where something is and where to click when text alone is not reliable

 Good pattern:
 1. capture with `GET /screen` or `POST /zoom`
 2. hand that screenshot to the `image` tool
 3. ask a precise question about the visible UI
-4. convert the answer into a concrete Clickthrough target
-5. act once
-6. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
+4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
+5. convert the answer into a concrete Clickthrough target
+6. act once
+7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
+
+Prefer vision over guessing.
+If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
+The model should help answer things like:
+- which visible button is the real primary action
+- whether the target is left/right/top/bottom within the crop
+- which of several similar buttons is the one to click
+- an approximate click point inside the provided image bounds

 Ask narrow questions.
 Good:
 - "Which button in this dialog is the primary confirmation action?"
 - "Is the scan still running, or does this look complete?"
 - "Which of these tabs appears selected?"
+- "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
+- "Which visible control says Stop Recording, and where should I click?"

 Bad:
 - "What should I click?"
 - "Use your eyes and do the task"
 - anything that assumes the model has live continuity without a new screenshot
+- requesting coordinates without telling the model the image bounds or expected output format

 ### Header requirements

@@ -205,7 +218,9 @@ When a task can be completed with window focus/restore, keyboard shortcuts, scre
  - OCR text + matching button shape/icon nearby
  - dialog title text + expected button position within that dialog
  - known app/window focus + expected control location
+  - OCR candidate + vision-model localization inside the same crop
 - If confidence is low, do not "test click"; zoom and re-localize first.
+- If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
 - For high-impact actions (close/delete/send/purchase), use two-phase flow:
  1) preview intended coordinate + reason
  2) execute only after explicit confirmation.
@@ -257,6 +272,7 @@ Example loop:

 The key rule is simple: screenshot first, interpret second, click third, verify fourth.
 Do not collapse those steps into fake certainty.
+When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.

 ## App-specific playbooks (recommended)