docs(skill): explain screenshot analysis with image tool

2026-05-01 16:03:43 +02:00
parent 5122d416e8
commit b5fdd82494
3 changed files with 69 additions and 3 deletions
--- a/README.md
+++ b/README.md
@@ -44,6 +44,8 @@ For OCR support, install the native `tesseract` binary on the host (in addition
 Important:
 - `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields.
 - Pixel coordinates and OCR bounding boxes are always global desktop coordinates.
 - The agent does **not** inherently see the remote desktop; it reasons from screenshots, OCR, and window metadata.
 - When OCR is not enough, pair Clickthrough screenshots with OpenClaw's `image` tool for explicit screenshot interpretation.
 - Prefer structured GUI interaction first; use `/windows`, `/launch`, `/wait`, and `/action` before reaching for `/exec`.
 See:
--- a/docs/API.md
+++ b/docs/API.md
@@ -46,6 +46,9 @@ Query params:
 Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
 `meta.region` uses global desktop coordinates.
 These image-returning endpoints do not magically grant the agent live vision.
 If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI.
 ## `POST /zoom`
 Body:
@@ -72,6 +75,8 @@ Query params:
 Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.
 `POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout.
 ## `POST /action`
 Body: one action.
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -30,6 +30,20 @@ The agent should not assume it can self-install this stack.
 2. Refuse `/exec` attempts when exec secret is missing/invalid.
 3. Ask user for missing setup inputs instead of guessing infrastructure.
 ## What the agent can actually see
 The agent does **not** inherently see the remote desktop.
 Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision.
 That means:
 - `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly
 - `POST /ocr` returns machine-readable text blocks when text extraction is enough
 - the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues
 - every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected
 Do not write or think as if the agent is directly watching the screen in real time.
 Say what you actually have: screenshots, OCR output, and fresh verification captures.
 ## Mini API map
 - `GET /health` → server status + safety flags
@@ -61,6 +75,34 @@ The agent should not assume it can self-install this stack.
 - Filter noise with `min_confidence` (start around `0.4` and tune per app).
 - Treat OCR as one signal, not the only signal, before high-impact clicks.
 ### Screenshot + `image` tool usage
 Use the OpenClaw `image` tool when OCR is not enough.
 This is especially useful for:
 - identifying which visible button looks like the primary confirm action
 - understanding dialog layout or pane structure
 - distinguishing similar nearby controls by icon, spacing, or emphasis
 - checking whether a visual state changed after a click
 Good pattern:
 1. capture with `GET /screen` or `POST /zoom`
 2. hand that screenshot to the `image` tool
 3. ask a precise question about the visible UI
 4. convert the answer into a concrete Clickthrough target
 5. act once
 6. recapture and verify again
 Ask narrow questions.
 Good:
 - "Which button in this dialog is the primary confirmation action?"
 - "Is the scan still running, or does this look complete?"
 - "Which of these tabs appears selected?"
 Bad:
 - "What should I click?"
 - "Use your eyes and do the task"
 - anything that assumes the model has live continuity without a new screenshot
 ### Header requirements
 - Always send `x-clickthrough-token` when token auth is enabled.
@@ -140,9 +182,10 @@ Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry wh
 2. Identify likely target region and compute an initial confidence score.
 3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
 4. **Before any click**, verify target identity (OCR text/icon/location consistency).
-5. Execute one minimal action via `POST /action`.
+5. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
-6. Re-capture with `GET /screen` or use `POST /wait` to verify the expected state change.
+6. Execute one minimal action via `POST /action`.
-7. Repeat until objective is complete.
+7. Re-capture with `GET /screen` or use `POST /wait` to verify the expected state change.
 8. Repeat until objective is complete.
 ## Verify-before-click rules
@@ -190,6 +233,22 @@ Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry wh
 Do not skip from "uncertain click" straight to random retries.
 ## Concrete screenshot -> `image` -> action example
 Example loop:
 1. `GET /screen?screen=0` to capture the current app state
 2. if the UI is text-heavy, try `POST /ocr` first
 3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like:
   - "In this save dialog, which visible button is the primary action?"
   - "Is there a dismiss/close button in the top-right of this modal?"
 4. map the answer back to a Clickthrough target using the returned grid/region metadata
 5. click once with `POST /action`
 6. recapture the screen
 7. optionally use `POST /wait` or another `image`/OCR check to confirm the result
 The key rule is simple: screenshot first, interpret second, click third, verify fourth.
 Do not collapse those steps into fake certainty.
 ## App-specific playbooks (recommended)
 Build per-app routines for repetitive tasks instead of generic clicking.