15 KiB
name, description
| name | description |
|---|---|
| clickthrough-http-control | Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification. |
Clickthrough HTTP Control
Use a strict observe-decide-act-verify loop.
Getting a computer instance (user-owned setup)
The user/operator is responsible for provisioning and exposing the target machine. The agent should not assume it can self-install this stack.
What the user must do
- Install dependencies and run Clickthrough on the target computer (default bind:
127.0.0.1:8123). - Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
- Configure secrets on target machine:
CLICKTHROUGH_TOKENfor general API authCLICKTHROUGH_EXEC_SECRETfor/execcalls
- Share connection details with the agent through a secure channel:
base_urlx-clickthrough-tokenx-clickthrough-exec-secret(only when/execis needed)
What the agent should do
- Validate connection with
GET /healthusing provided headers. - Refuse
/execattempts when exec secret is missing/invalid. - Ask user for missing setup inputs instead of guessing infrastructure.
What the agent can actually see
The agent does not inherently see the remote desktop. Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision.
That means:
GET /screenandPOST /zoomreturn image data the agent may need to inspect explicitlyPOST /ocrreturns machine-readable text blocks when text extraction is enough- the OpenClaw
imagetool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues - every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected
Do not write or think as if the agent is directly watching the screen in real time. Say what you actually have: screenshots, OCR output, and fresh verification captures.
Mini API map
GET /health→ server status + safety flagsGET /displays→ detected displays in zero-based API orderGET /screen?screen=0→ full screenshot (JSON with base64 by default, or raw image withasImage=true)POST /zoom?screen=0→ cropped screenshot around point/region (also supportsasImage=true)GET /windows→ discover visible desktop windows and their handles/processesPOST /windows/action→ focus/restore/minimize/maximize/close a matched windowPOST /launch→ start an app/process without dropping to a shellPOST /wait?screen=0→ wait for text, window, or visual state changesPOST /ocr→ text extraction with bounding boxes from full screen, region, or provided image bytesPOST /ocr/find?screen=0→ search OCR output for matching text candidatesPOST /action?screen=0→ single interaction (move,click,scroll,type,hotkey, ...)POST /batch?screen=0→ sequential action listPOST /exec→ PowerShell/Bash/CMD command execution (requires configured exec secret + header)
Display selection
- Use
GET /displaysbefore operating on multi-monitor systems. - Use
?screen=Xon/screen,/zoom,/ocr,/action, and/batch; invalid values fall back toscreen=0. - Treat returned
regionand OCR bounding boxes as global desktop coordinates, not screen-local coordinates. - Do not assume
screen=1starts at(0,0); it may start at(1920,0),(-1920,0), or another global offset. - If a screenshot came from
/screen?screen=1, keep using that response'sregionmetadata when forming later/actiontargets.
OCR usage
- Prefer
POST /ocrwhen targeting text-heavy UI (menus, labels, buttons, dialogs). - Use
mode=screenfor discovery, thenmode=regionfor precision and speed. - Use
language_hintwhen known (for exampleeng) to improve consistency. - Filter noise with
min_confidence(start around0.4and tune per app). - Treat OCR as one signal, not the only signal, before high-impact clicks.
Screenshot + image tool usage
Use the OpenClaw image tool when OCR is not enough.
This is especially useful for:
- identifying which visible button looks like the primary confirm action
- understanding dialog layout or pane structure
- distinguishing similar nearby controls by icon, spacing, or emphasis
- checking whether a visual state changed after a click
Good pattern:
- capture with
GET /screenorPOST /zoom - hand that screenshot to the
imagetool - ask a precise question about the visible UI
- convert the answer into a concrete Clickthrough target
- act once
- recapture and verify again
Ask narrow questions. Good:
- "Which button in this dialog is the primary confirmation action?"
- "Is the scan still running, or does this look complete?"
- "Which of these tabs appears selected?"
Bad:
- "What should I click?"
- "Use your eyes and do the task"
- anything that assumes the model has live continuity without a new screenshot
Header requirements
- Always send
x-clickthrough-tokenwhen token auth is enabled. - For
/exec, also sendx-clickthrough-exec-secret.
POST /action request shape (important)
/action always expects an action plus an optional target object.
Do not invent top-level x / y fields.
Minimal pixel click:
{
"action": "click",
"target": {"mode": "pixel", "x": 100, "y": 200},
"button": "left",
"clicks": 1
}
Minimal grid click:
{
"action": "click",
"target": {
"mode": "grid",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"rows": 12,
"cols": 12,
"row": 6,
"col": 8,
"dx": 0.0,
"dy": 0.0
}
}
Other canonical examples:
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
{"action": "type", "text": "hello world", "interval_ms": 20}
{"action": "hotkey", "keys": ["ctrl", "l"]}
Rules:
dx/dybelong insidetarget, not beside it.typeandhotkeyusually do not need atarget.- For pixel targets,
x/yare global desktop coordinates. - For grid targets, copy the exact
region_*,rows, andcolsbasis from the screenshot/zoom you actually used.
When to use /exec
Prefer structured GUI control first:
/screen,/zoom,/ocrto observe/actionor/batchto interact
Use /exec only when it is the cleanest available tool for the job, for example:
- querying machine state that the GUI does not expose well
- performing an explicit user-requested shell/system task
- recovering from a blocked GUI flow when normal interaction failed
Prefer GET /windows, POST /windows/action, and POST /launch for app lifecycle tasks before falling back to /exec.
Avoid using /exec for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
Core workflow (mandatory)
- Call
GET /screen?screen=0with coarse grid (e.g., 12x12), or another selected display. - Identify likely target region and compute an initial confidence score.
- If confidence < 0.85, call
POST /zoomwith denser grid (e.g., 20x20) and re-evaluate. - Before any click, verify target identity (OCR text/icon/location consistency).
- If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw
imagetool instead of pretending you can already see enough. - Execute one minimal action via
POST /action. - Re-capture with
GET /screenor usePOST /waitto verify the expected state change. - Repeat until objective is complete.
Verify-before-click rules
- Never click if target identity is ambiguous.
- Require at least two matching signals before click.
- Good signal pairs include:
- OCR text + expected UI region
- OCR text + matching button shape/icon nearby
- dialog title text + expected button position within that dialog
- known app/window focus + expected control location
- If confidence is low, do not "test click"; zoom and re-localize first.
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
- preview intended coordinate + reason
- execute only after explicit confirmation.
Precision rules
- Prefer grid targets first, then use
dx/dyfor subcell precision. - Keep
dx/dyin[-1,1]; start at0,0and only offset when needed. - Use zoom before guessing offsets.
- Avoid stale coordinates: re-capture before action if UI moved/scrolled.
Safety rules
- Respect
dry_runandallowed_regionrestrictions from/health. - Respect
/execsecurity requirements (CLICKTHROUGH_EXEC_SECRET+x-clickthrough-exec-secret). - Avoid destructive shortcuts unless explicitly requested.
- Send one action at a time unless deterministic; then use
/batch.
Reliability rules
- After every meaningful action, verify with a fresh screenshot.
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
- Prefer short, reversible actions over long macros.
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
Fallback ladder for uncertain targeting
- Full-screen capture with a coarse grid.
- Zoom into the candidate area with a denser grid.
- OCR the full screen or the tighter region.
- Re-anchor on a more reliable nearby control, title, or label.
- Try a keyboard-first flow if the app supports it.
- Use
/execonly if GUI control is blocked and shell-level intervention is genuinely cleaner.
Do not skip from "uncertain click" straight to random retries.
Concrete screenshot -> image -> action example
Example loop:
GET /screen?screen=0to capture the current app state- if the UI is text-heavy, try
POST /ocrfirst - if OCR does not answer the real question, pass the screenshot to the OpenClaw
imagetool with a narrow prompt like:- "In this save dialog, which visible button is the primary action?"
- "Is there a dismiss/close button in the top-right of this modal?"
- map the answer back to a Clickthrough target using the returned grid/region metadata
- click once with
POST /action - recapture the screen
- optionally use
POST /waitor anotherimage/OCR check to confirm the result
The key rule is simple: screenshot first, interpret second, click third, verify fourth. Do not collapse those steps into fake certainty.
App-specific playbooks (recommended)
Build per-app routines for repetitive tasks instead of generic clicking.
Launcher / search / start app playbook
Use this when the goal is "open app X" or "bring up tool Y".
- check
GET /windowsfirst in case the app is already open - if present, use
POST /windows/actionto focus or restore it - if absent, prefer
POST /launchwhen you know the executable path - if launch path is unknown but the OS launcher/search UI is available, use a keyboard-first flow:
- open launcher (
win,cmd+space, or app-specific shortcut depending on host) - type exact app name
- wait for stable results with
POST /waitor recapture - verify the result text with OCR or the
imagetool - press Enter or click the exact result once
- open launcher (
- verify the app window now exists or is focused
Do not keep relaunching if the window already exists; that’s sloppy.
Dialog confirmation playbook
Use for modals like save/discard, delete confirmation, permission prompts, and installer dialogs.
- capture the dialog region with
POST /zoom - use OCR first for title/body/button labels
- if button hierarchy or emphasis matters, inspect the zoomed screenshot with the
imagetool - identify the exact intended action (
Cancel,Save,Allow,Delete, etc.) - for destructive actions, require explicit user confirmation unless already requested
- click once and verify the dialog disappeared or changed state
Good verification targets:
- dialog title vanished
- expected next window appeared
- destructive side effect is visible and confirmed
File picker playbook
Use for open/save dialogs.
- verify the file picker window is focused
- OCR the visible breadcrumb/path area, filename field, and button row
- prefer keyboard-first entry when possible:
- type or paste the target path/name into the focused field
- use
tab/shift+tabto move predictably between filename and action buttons
- if the target path is uncertain, use OCR plus the
imagetool to identify the active field and selected folder/file row - verify the intended filename/path is visible before confirming
- activate
Open/Saveonce and verify the picker closes
If the picker stays open, stop and inspect why instead of hammering Enter like a maniac.
Browser tab / window playbook
Use for browser navigation, tab targeting, or web app recovery.
- use
GET /windowsto focus the correct browser window first - prefer keyboard-first navigation:
ctrl+l/cmd+lto focus the address barctrl+tab/ctrl+shift+tabfor tab movement when order is knownctrl+wonly for explicitly requested close actions
- verify tab or page identity with OCR on the tab strip or page heading
- if multiple similar tabs are open, zoom into the tab strip and use the
imagetool to distinguish active vs inactive tabs - after navigation, wait for visual stability or expected text before taking the next action
Do not assume a page loaded just because the click landed. Verify it.
Settings / preferences navigation playbook
Use when the task involves toggles, dropdowns, sidebars, or nested settings panels.
- identify the current settings page with OCR on the heading/sidebar
- use OCR to find the specific section label before trying to toggle anything
- if the layout is dense, zoom into the relevant pane and use the
imagetool to distinguish labels from controls - prefer small reversible actions: one toggle, one dropdown, one field edit at a time
- after each change, verify the control state changed visually or via visible text
- if a save/apply button exists, treat it as a separate confirmation step and verify completion
Settings UIs love hiding side effects. Assume nothing.
Spotify playbook
- Focus app window before search/navigation.
- Prefer keyboard-first flow for song start:
Ctrl+L(search)- type exact query
- Enter
- verify exact song+artist text
- click/double-click row
- verify now-playing bar
- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.