docs(skill): explain screenshot analysis with image tool
All checks were successful
python-syntax / syntax-check (push) Successful in 11s
All checks were successful
python-syntax / syntax-check (push) Successful in 11s
This commit is contained in:
@@ -44,6 +44,8 @@ For OCR support, install the native `tesseract` binary on the host (in addition
|
||||
Important:
|
||||
- `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields.
|
||||
- Pixel coordinates and OCR bounding boxes are always global desktop coordinates.
|
||||
- The agent does **not** inherently see the remote desktop; it reasons from screenshots, OCR, and window metadata.
|
||||
- When OCR is not enough, pair Clickthrough screenshots with OpenClaw's `image` tool for explicit screenshot interpretation.
|
||||
- Prefer structured GUI interaction first; use `/windows`, `/launch`, `/wait`, and `/action` before reaching for `/exec`.
|
||||
|
||||
See:
|
||||
|
||||
@@ -46,6 +46,9 @@ Query params:
|
||||
Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
|
||||
`meta.region` uses global desktop coordinates.
|
||||
|
||||
These image-returning endpoints do not magically grant the agent live vision.
|
||||
If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI.
|
||||
|
||||
## `POST /zoom`
|
||||
|
||||
Body:
|
||||
@@ -72,6 +75,8 @@ Query params:
|
||||
|
||||
Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.
|
||||
|
||||
`POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout.
|
||||
|
||||
## `POST /action`
|
||||
|
||||
Body: one action.
|
||||
|
||||
@@ -30,6 +30,20 @@ The agent should not assume it can self-install this stack.
|
||||
2. Refuse `/exec` attempts when exec secret is missing/invalid.
|
||||
3. Ask user for missing setup inputs instead of guessing infrastructure.
|
||||
|
||||
## What the agent can actually see
|
||||
|
||||
The agent does **not** inherently see the remote desktop.
|
||||
Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision.
|
||||
|
||||
That means:
|
||||
- `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly
|
||||
- `POST /ocr` returns machine-readable text blocks when text extraction is enough
|
||||
- the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues
|
||||
- every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected
|
||||
|
||||
Do not write or think as if the agent is directly watching the screen in real time.
|
||||
Say what you actually have: screenshots, OCR output, and fresh verification captures.
|
||||
|
||||
## Mini API map
|
||||
|
||||
- `GET /health` → server status + safety flags
|
||||
@@ -61,6 +75,34 @@ The agent should not assume it can self-install this stack.
|
||||
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
|
||||
- Treat OCR as one signal, not the only signal, before high-impact clicks.
|
||||
|
||||
### Screenshot + `image` tool usage
|
||||
|
||||
Use the OpenClaw `image` tool when OCR is not enough.
|
||||
This is especially useful for:
|
||||
- identifying which visible button looks like the primary confirm action
|
||||
- understanding dialog layout or pane structure
|
||||
- distinguishing similar nearby controls by icon, spacing, or emphasis
|
||||
- checking whether a visual state changed after a click
|
||||
|
||||
Good pattern:
|
||||
1. capture with `GET /screen` or `POST /zoom`
|
||||
2. hand that screenshot to the `image` tool
|
||||
3. ask a precise question about the visible UI
|
||||
4. convert the answer into a concrete Clickthrough target
|
||||
5. act once
|
||||
6. recapture and verify again
|
||||
|
||||
Ask narrow questions.
|
||||
Good:
|
||||
- "Which button in this dialog is the primary confirmation action?"
|
||||
- "Is the scan still running, or does this look complete?"
|
||||
- "Which of these tabs appears selected?"
|
||||
|
||||
Bad:
|
||||
- "What should I click?"
|
||||
- "Use your eyes and do the task"
|
||||
- anything that assumes the model has live continuity without a new screenshot
|
||||
|
||||
### Header requirements
|
||||
|
||||
- Always send `x-clickthrough-token` when token auth is enabled.
|
||||
@@ -140,9 +182,10 @@ Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry wh
|
||||
2. Identify likely target region and compute an initial confidence score.
|
||||
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
||||
4. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
||||
5. Execute one minimal action via `POST /action`.
|
||||
6. Re-capture with `GET /screen` or use `POST /wait` to verify the expected state change.
|
||||
7. Repeat until objective is complete.
|
||||
5. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
|
||||
6. Execute one minimal action via `POST /action`.
|
||||
7. Re-capture with `GET /screen` or use `POST /wait` to verify the expected state change.
|
||||
8. Repeat until objective is complete.
|
||||
|
||||
## Verify-before-click rules
|
||||
|
||||
@@ -190,6 +233,22 @@ Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry wh
|
||||
|
||||
Do not skip from "uncertain click" straight to random retries.
|
||||
|
||||
## Concrete screenshot -> `image` -> action example
|
||||
|
||||
Example loop:
|
||||
1. `GET /screen?screen=0` to capture the current app state
|
||||
2. if the UI is text-heavy, try `POST /ocr` first
|
||||
3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like:
|
||||
- "In this save dialog, which visible button is the primary action?"
|
||||
- "Is there a dismiss/close button in the top-right of this modal?"
|
||||
4. map the answer back to a Clickthrough target using the returned grid/region metadata
|
||||
5. click once with `POST /action`
|
||||
6. recapture the screen
|
||||
7. optionally use `POST /wait` or another `image`/OCR check to confirm the result
|
||||
|
||||
The key rule is simple: screenshot first, interpret second, click third, verify fourth.
|
||||
Do not collapse those steps into fake certainty.
|
||||
|
||||
## App-specific playbooks (recommended)
|
||||
|
||||
Build per-app routines for repetitive tasks instead of generic clicking.
|
||||
|
||||
Reference in New Issue
Block a user