diff --git a/skill/SKILL.md b/skill/SKILL.md index 5b6646d..cc53f72 100644 --- a/skill/SKILL.md +++ b/skill/SKILL.md @@ -90,25 +90,38 @@ This is especially useful for: - understanding dialog layout or pane structure - distinguishing similar nearby controls by icon, spacing, or emphasis - checking whether a visual state changed after a click +- telling you where something is and where to click when text alone is not reliable Good pattern: 1. capture with `GET /screen` or `POST /zoom` 2. hand that screenshot to the `image` tool 3. ask a precise question about the visible UI -4. convert the answer into a concrete Clickthrough target -5. act once -6. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly +4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop +5. convert the answer into a concrete Clickthrough target +6. act once +7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly + +Prefer vision over guessing. +If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is. +The model should help answer things like: +- which visible button is the real primary action +- whether the target is left/right/top/bottom within the crop +- which of several similar buttons is the one to click +- an approximate click point inside the provided image bounds Ask narrow questions. Good: - "Which button in this dialog is the primary confirmation action?" - "Is the scan still running, or does this look complete?" - "Which of these tabs appears selected?" +- "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds." +- "Which visible control says Stop Recording, and where should I click?" Bad: - "What should I click?" - "Use your eyes and do the task" - anything that assumes the model has live continuity without a new screenshot +- requesting coordinates without telling the model the image bounds or expected output format ### Header requirements @@ -205,7 +218,9 @@ When a task can be completed with window focus/restore, keyboard shortcuts, scre - OCR text + matching button shape/icon nearby - dialog title text + expected button position within that dialog - known app/window focus + expected control location + - OCR candidate + vision-model localization inside the same crop - If confidence is low, do not "test click"; zoom and re-localize first. +- If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question. - For high-impact actions (close/delete/send/purchase), use two-phase flow: 1) preview intended coordinate + reason 2) execute only after explicit confirmation. @@ -257,6 +272,7 @@ Example loop: The key rule is simple: screenshot first, interpret second, click third, verify fourth. Do not collapse those steps into fake certainty. +When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes. ## App-specific playbooks (recommended)