docs(skill): explain screenshot analysis with image tool

2026-05-01 16:03:43 +02:00
parent 5122d416e8
commit b5fdd82494
3 changed files with 69 additions and 3 deletions
--- a/README.md
+++ b/README.md
@@ -44,6 +44,8 @@ For OCR support, install the native `tesseract` binary on the host (in addition
 Important:
 - `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields.
 - Pixel coordinates and OCR bounding boxes are always global desktop coordinates.
+- The agent does **not** inherently see the remote desktop; it reasons from screenshots, OCR, and window metadata.
+- When OCR is not enough, pair Clickthrough screenshots with OpenClaw's `image` tool for explicit screenshot interpretation.
 - Prefer structured GUI interaction first; use `/windows`, `/launch`, `/wait`, and `/action` before reaching for `/exec`.

 See:
--- a/docs/API.md
+++ b/docs/API.md
@@ -46,6 +46,9 @@ Query params:
 Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
 `meta.region` uses global desktop coordinates.

+These image-returning endpoints do not magically grant the agent live vision.
+If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI.
+
 ## `POST /zoom`

 Body:
@@ -72,6 +75,8 @@ Query params:

 Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.

+`POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout.
+
 ## `POST /action`

 Body: one action.
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -30,6 +30,20 @@ The agent should not assume it can self-install this stack.
 2. Refuse `/exec` attempts when exec secret is missing/invalid.
 3. Ask user for missing setup inputs instead of guessing infrastructure.

+## What the agent can actually see
+
+The agent does **not** inherently see the remote desktop.
+Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision.
+
+That means:
+- `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly
+- `POST /ocr` returns machine-readable text blocks when text extraction is enough
+- the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues
+- every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected
+
+Do not write or think as if the agent is directly watching the screen in real time.
+Say what you actually have: screenshots, OCR output, and fresh verification captures.
+
 ## Mini API map

 - `GET /health` → server status + safety flags
@@ -61,6 +75,34 @@ The agent should not assume it can self-install this stack.
 - Filter noise with `min_confidence` (start around `0.4` and tune per app).
 - Treat OCR as one signal, not the only signal, before high-impact clicks.

+### Screenshot + `image` tool usage
+
+Use the OpenClaw `image` tool when OCR is not enough.
+This is especially useful for:
+- identifying which visible button looks like the primary confirm action
+- understanding dialog layout or pane structure
+- distinguishing similar nearby controls by icon, spacing, or emphasis
+- checking whether a visual state changed after a click
+
+Good pattern:
+1. capture with `GET /screen` or `POST /zoom`
+2. hand that screenshot to the `image` tool
+3. ask a precise question about the visible UI
+4. convert the answer into a concrete Clickthrough target
+5. act once
+6. recapture and verify again
+
+Ask narrow questions.
+Good:
+- "Which button in this dialog is the primary confirmation action?"
+- "Is the scan still running, or does this look complete?"
+- "Which of these tabs appears selected?"
+
+Bad:
+- "What should I click?"
+- "Use your eyes and do the task"
+- anything that assumes the model has live continuity without a new screenshot
+
 ### Header requirements

 - Always send `x-clickthrough-token` when token auth is enabled.
@@ -140,9 +182,10 @@ Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry wh
 2. Identify likely target region and compute an initial confidence score.
 3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
 4. **Before any click**, verify target identity (OCR text/icon/location consistency).
-5. Execute one minimal action via `POST /action`.
-6. Re-capture with `GET /screen` or use `POST /wait` to verify the expected state change.
-7. Repeat until objective is complete.
+5. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
+6. Execute one minimal action via `POST /action`.
+7. Re-capture with `GET /screen` or use `POST /wait` to verify the expected state change.
+8. Repeat until objective is complete.

 ## Verify-before-click rules

@@ -190,6 +233,22 @@ Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry wh

 Do not skip from "uncertain click" straight to random retries.

+## Concrete screenshot -> `image` -> action example
+
+Example loop:
+1. `GET /screen?screen=0` to capture the current app state
+2. if the UI is text-heavy, try `POST /ocr` first
+3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like:
+   - "In this save dialog, which visible button is the primary action?"
+   - "Is there a dismiss/close button in the top-right of this modal?"
+4. map the answer back to a Clickthrough target using the returned grid/region metadata
+5. click once with `POST /action`
+6. recapture the screen
+7. optionally use `POST /wait` or another `image`/OCR check to confirm the result
+
+The key rule is simple: screenshot first, interpret second, click third, verify fourth.
+Do not collapse those steps into fake certainty.
+
 ## App-specific playbooks (recommended)

 Build per-app routines for repetitive tasks instead of generic clicking.