diff --git a/README.md b/README.md index 5604177..2bbcffe 100644 --- a/README.md +++ b/README.md @@ -44,6 +44,8 @@ For OCR support, install the native `tesseract` binary on the host (in addition Important: - `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields. - Pixel coordinates and OCR bounding boxes are always global desktop coordinates. +- The agent does **not** inherently see the remote desktop; it reasons from screenshots, OCR, and window metadata. +- When OCR is not enough, pair Clickthrough screenshots with OpenClaw's `image` tool for explicit screenshot interpretation. - Prefer structured GUI interaction first; use `/windows`, `/launch`, `/wait`, and `/action` before reaching for `/exec`. See: diff --git a/docs/API.md b/docs/API.md index ff22f01..69680ef 100644 --- a/docs/API.md +++ b/docs/API.md @@ -46,6 +46,9 @@ Query params: Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`). `meta.region` uses global desktop coordinates. +These image-returning endpoints do not magically grant the agent live vision. +If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI. + ## `POST /zoom` Body: @@ -72,6 +75,8 @@ Query params: Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base. +`POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout. + ## `POST /action` Body: one action. diff --git a/skill/SKILL.md b/skill/SKILL.md index f4354a7..6e5b70c 100644 --- a/skill/SKILL.md +++ b/skill/SKILL.md @@ -30,6 +30,20 @@ The agent should not assume it can self-install this stack. 2. Refuse `/exec` attempts when exec secret is missing/invalid. 3. Ask user for missing setup inputs instead of guessing infrastructure. +## What the agent can actually see + +The agent does **not** inherently see the remote desktop. +Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision. + +That means: +- `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly +- `POST /ocr` returns machine-readable text blocks when text extraction is enough +- the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues +- every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected + +Do not write or think as if the agent is directly watching the screen in real time. +Say what you actually have: screenshots, OCR output, and fresh verification captures. + ## Mini API map - `GET /health` → server status + safety flags @@ -61,6 +75,34 @@ The agent should not assume it can self-install this stack. - Filter noise with `min_confidence` (start around `0.4` and tune per app). - Treat OCR as one signal, not the only signal, before high-impact clicks. +### Screenshot + `image` tool usage + +Use the OpenClaw `image` tool when OCR is not enough. +This is especially useful for: +- identifying which visible button looks like the primary confirm action +- understanding dialog layout or pane structure +- distinguishing similar nearby controls by icon, spacing, or emphasis +- checking whether a visual state changed after a click + +Good pattern: +1. capture with `GET /screen` or `POST /zoom` +2. hand that screenshot to the `image` tool +3. ask a precise question about the visible UI +4. convert the answer into a concrete Clickthrough target +5. act once +6. recapture and verify again + +Ask narrow questions. +Good: +- "Which button in this dialog is the primary confirmation action?" +- "Is the scan still running, or does this look complete?" +- "Which of these tabs appears selected?" + +Bad: +- "What should I click?" +- "Use your eyes and do the task" +- anything that assumes the model has live continuity without a new screenshot + ### Header requirements - Always send `x-clickthrough-token` when token auth is enabled. @@ -140,9 +182,10 @@ Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry wh 2. Identify likely target region and compute an initial confidence score. 3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate. 4. **Before any click**, verify target identity (OCR text/icon/location consistency). -5. Execute one minimal action via `POST /action`. -6. Re-capture with `GET /screen` or use `POST /wait` to verify the expected state change. -7. Repeat until objective is complete. +5. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough. +6. Execute one minimal action via `POST /action`. +7. Re-capture with `GET /screen` or use `POST /wait` to verify the expected state change. +8. Repeat until objective is complete. ## Verify-before-click rules @@ -190,6 +233,22 @@ Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry wh Do not skip from "uncertain click" straight to random retries. +## Concrete screenshot -> `image` -> action example + +Example loop: +1. `GET /screen?screen=0` to capture the current app state +2. if the UI is text-heavy, try `POST /ocr` first +3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like: + - "In this save dialog, which visible button is the primary action?" + - "Is there a dismiss/close button in the top-right of this modal?" +4. map the answer back to a Clickthrough target using the returned grid/region metadata +5. click once with `POST /action` +6. recapture the screen +7. optionally use `POST /wait` or another `image`/OCR check to confirm the result + +The key rule is simple: screenshot first, interpret second, click third, verify fourth. +Do not collapse those steps into fake certainty. + ## App-specific playbooks (recommended) Build per-app routines for repetitive tasks instead of generic clicking.