docs(skill): explain using OpenClaw image tool with screenshots #17

Closed
opened 2026-05-01 15:59:08 +02:00 by luna · 0 comments
Collaborator

Why

The Clickthrough skill currently assumes the agent can interpret screenshots directly through the normal observe-decide-act loop. In practice, the agent often needs to hand screenshots to OpenClaw's image tool for visual interpretation, because the agent does not actually see the remote desktop on its own.

Without explicit guidance, this creates confusion in two places:

  • the skill can sound like the agent natively sees the screen
  • screenshot analysis workflows are underspecified, especially when an OCR-free visual judgment is needed

Scope

Document and demonstrate how to use the OpenClaw image tool alongside Clickthrough screenshots, including:

  • when to use /screen or /zoom and pass the returned image to image
  • how to ask image precise questions about UI state, buttons, dialogs, and visual changes
  • the difference between OCR-driven targeting and image-model-assisted interpretation
  • caveats: static screenshot only, no true live vision, no hidden cursor intent, no continuity unless recaptured
  • examples of a good observe-decide-act-verify loop that explicitly uses screenshot -> image analysis -> action -> recapture

Done when

  • the skill clearly states that the agent does not inherently see the remote desktop
  • the workflow for using Clickthrough screenshots with the OpenClaw image tool is documented with concrete examples
  • agents are less likely to hallucinate visual certainty from screenshots they have not actually analyzed
## Why The Clickthrough skill currently assumes the agent can interpret screenshots directly through the normal observe-decide-act loop. In practice, the agent often needs to hand screenshots to OpenClaw's `image` tool for visual interpretation, because the agent does not actually see the remote desktop on its own. Without explicit guidance, this creates confusion in two places: - the skill can sound like the agent natively sees the screen - screenshot analysis workflows are underspecified, especially when an OCR-free visual judgment is needed ## Scope Document and demonstrate how to use the OpenClaw `image` tool alongside Clickthrough screenshots, including: - when to use `/screen` or `/zoom` and pass the returned image to `image` - how to ask `image` precise questions about UI state, buttons, dialogs, and visual changes - the difference between OCR-driven targeting and image-model-assisted interpretation - caveats: static screenshot only, no true live vision, no hidden cursor intent, no continuity unless recaptured - examples of a good observe-decide-act-verify loop that explicitly uses screenshot -> image analysis -> action -> recapture ## Done when - the skill clearly states that the agent does not inherently see the remote desktop - the workflow for using Clickthrough screenshots with the OpenClaw `image` tool is documented with concrete examples - agents are less likely to hallucinate visual certainty from screenshots they have not actually analyzed
luna closed this issue 2026-05-01 16:03:44 +02:00
This repo is archived. You cannot comment on issues.
No Label
1 Participants
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: space/clickthrough#17