docs(skill): explain using OpenClaw image tool with screenshots #17

New Issue

luna · 2026-05-01T15:59:08+02:00

luna commented

2026-05-01 15:59:08 +02:00

Why

The Clickthrough skill currently assumes the agent can interpret screenshots directly through the normal observe-decide-act loop. In practice, the agent often needs to hand screenshots to OpenClaw's image tool for visual interpretation, because the agent does not actually see the remote desktop on its own.

Without explicit guidance, this creates confusion in two places:

the skill can sound like the agent natively sees the screen
screenshot analysis workflows are underspecified, especially when an OCR-free visual judgment is needed

Scope

Document and demonstrate how to use the OpenClaw image tool alongside Clickthrough screenshots, including:

when to use /screen or /zoom and pass the returned image to image
how to ask image precise questions about UI state, buttons, dialogs, and visual changes
the difference between OCR-driven targeting and image-model-assisted interpretation
caveats: static screenshot only, no true live vision, no hidden cursor intent, no continuity unless recaptured
examples of a good observe-decide-act-verify loop that explicitly uses screenshot -> image analysis -> action -> recapture

Done when

the skill clearly states that the agent does not inherently see the remote desktop
the workflow for using Clickthrough screenshots with the OpenClaw image tool is documented with concrete examples
agents are less likely to hallucinate visual certainty from screenshots they have not actually analyzed

## Why The Clickthrough skill currently assumes the agent can interpret screenshots directly through the normal observe-decide-act loop. In practice, the agent often needs to hand screenshots to OpenClaw's `image` tool for visual interpretation, because the agent does not actually see the remote desktop on its own. Without explicit guidance, this creates confusion in two places: - the skill can sound like the agent natively sees the screen - screenshot analysis workflows are underspecified, especially when an OCR-free visual judgment is needed ## Scope Document and demonstrate how to use the OpenClaw `image` tool alongside Clickthrough screenshots, including: - when to use `/screen` or `/zoom` and pass the returned image to `image` - how to ask `image` precise questions about UI state, buttons, dialogs, and visual changes - the difference between OCR-driven targeting and image-model-assisted interpretation - caveats: static screenshot only, no true live vision, no hidden cursor intent, no continuity unless recaptured - examples of a good observe-decide-act-verify loop that explicitly uses screenshot -> image analysis -> action -> recapture ## Done when - the skill clearly states that the agent does not inherently see the remote desktop - the workflow for using Clickthrough screenshots with the OpenClaw `image` tool is documented with concrete examples - agents are less likely to hallucinate visual certainty from screenshots they have not actually analyzed

luna closed this issue

2026-05-01 16:03:44 +02:00

This repo is archived. You cannot comment on issues.

1 Participants

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: space/clickthrough#17