---
name: clickthrough-http-control
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
---

# Clickthrough HTTP Control

Use a strict observe-decide-act-verify loop.

## Getting a computer instance (user-owned setup)

The **user/operator** is responsible for provisioning and exposing the target machine.
The agent should not assume it can self-install this stack.

### What the user must do

1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
3. Configure secrets on target machine:
   - `CLICKTHROUGH_TOKEN` for general API auth
   - `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
4. Share connection details with the agent through a secure channel:
   - `base_url`
   - `x-clickthrough-token`
   - `x-clickthrough-exec-secret` (only when `/exec` is needed)

### What the agent should do

1. Validate connection with `GET /health` using provided headers.
2. Refuse `/exec` attempts when exec secret is missing/invalid.
3. Ask user for missing setup inputs instead of guessing infrastructure.

## What the agent can actually see

The agent does **not** inherently see the remote desktop.
Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision.

That means:
- `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly
- `POST /ocr` returns machine-readable text blocks when text extraction is enough
- the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues
- every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected

Do not write or think as if the agent is directly watching the screen in real time.
Say what you actually have: screenshots, OCR output, and fresh verification captures.

## Mini API map

- `GET /health` → server status + safety flags
- `GET /displays` → detected displays in zero-based API order
- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
- `GET /windows` → discover visible desktop windows and their handles/processes
- `POST /windows/action` → focus/restore/minimize/maximize/close a matched window
- `POST /launch` → start an app/process without dropping to a shell
- `POST /wait?screen=0` → wait for text, window, or visual state changes
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
- `POST /batch?screen=0` → sequential action list
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)

### Display selection

- Use `GET /displays` before operating on multi-monitor systems.
- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.

### OCR usage

- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
- Use `language_hint` when known (for example `eng`) to improve consistency.
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
- Treat OCR as one signal, not the only signal, before high-impact clicks.

### Screenshot + `image` tool usage

Use the OpenClaw `image` tool when OCR is not enough.
This is especially useful for:
- identifying which visible button looks like the primary confirm action
- understanding dialog layout or pane structure
- distinguishing similar nearby controls by icon, spacing, or emphasis
- checking whether a visual state changed after a click

Good pattern:
1. capture with `GET /screen` or `POST /zoom`
2. hand that screenshot to the `image` tool
3. ask a precise question about the visible UI
4. convert the answer into a concrete Clickthrough target
5. act once
6. recapture and verify again

Ask narrow questions.
Good:
- "Which button in this dialog is the primary confirmation action?"
- "Is the scan still running, or does this look complete?"
- "Which of these tabs appears selected?"

Bad:
- "What should I click?"
- "Use your eyes and do the task"
- anything that assumes the model has live continuity without a new screenshot

### Header requirements

- Always send `x-clickthrough-token` when token auth is enabled.
- For `/exec`, also send `x-clickthrough-exec-secret`.

## `POST /action` request shape (important)

`/action` always expects an `action` plus an optional `target` object.
Do **not** invent top-level `x` / `y` fields.

Minimal pixel click:

```json
{
  "action": "click",
  "target": {"mode": "pixel", "x": 100, "y": 200},
  "button": "left",
  "clicks": 1
}
```

Minimal grid click:

```json
{
  "action": "click",
  "target": {
    "mode": "grid",
    "region_x": 0,
    "region_y": 0,
    "region_width": 1920,
    "region_height": 1080,
    "rows": 12,
    "cols": 12,
    "row": 6,
    "col": 8,
    "dx": 0.0,
    "dy": 0.0
  }
}
```

Other canonical examples:

```json
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
{"action": "type", "text": "hello world", "interval_ms": 20}
{"action": "hotkey", "keys": ["ctrl", "l"]}
```

Rules:
- `dx` / `dy` belong inside `target`, not beside it.
- `type` and `hotkey` usually do not need a `target`.
- For pixel targets, `x` / `y` are global desktop coordinates.
- For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used.

## When to use `/exec`

Prefer structured GUI control first:
- `/screen`, `/zoom`, `/ocr` to observe
- `/action` or `/batch` to interact

Use `/exec` only when it is the cleanest available tool for the job, for example:
- querying machine state that the GUI does not expose well
- performing an explicit user-requested shell/system task
- recovering from a blocked GUI flow when normal interaction failed

Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.

## Core workflow (mandatory)

1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
2. Identify likely target region and compute an initial confidence score.
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
4. **Before any click**, verify target identity (OCR text/icon/location consistency).
5. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
6. Execute one minimal action via `POST /action`.
7. Re-capture with `GET /screen` or use `POST /wait` to verify the expected state change.
8. Repeat until objective is complete.

## Verify-before-click rules

- Never click if target identity is ambiguous.
- Require at least two matching signals before click.
- Good signal pairs include:
  - OCR text + expected UI region
  - OCR text + matching button shape/icon nearby
  - dialog title text + expected button position within that dialog
  - known app/window focus + expected control location
- If confidence is low, do not "test click"; zoom and re-localize first.
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
  1) preview intended coordinate + reason
  2) execute only after explicit confirmation.

## Precision rules

- Prefer grid targets first, then use `dx/dy` for subcell precision.
- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
- Use zoom before guessing offsets.
- Avoid stale coordinates: re-capture before action if UI moved/scrolled.

## Safety rules

- Respect `dry_run` and `allowed_region` restrictions from `/health`.
- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
- Avoid destructive shortcuts unless explicitly requested.
- Send one action at a time unless deterministic; then use `/batch`.

## Reliability rules

- After every meaningful action, verify with a fresh screenshot.
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
- Prefer short, reversible actions over long macros.
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.

## Fallback ladder for uncertain targeting

1. Full-screen capture with a coarse grid.
2. Zoom into the candidate area with a denser grid.
3. OCR the full screen or the tighter region.
4. Re-anchor on a more reliable nearby control, title, or label.
5. Try a keyboard-first flow if the app supports it.
6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner.

Do not skip from "uncertain click" straight to random retries.

## Concrete screenshot -> `image` -> action example

Example loop:
1. `GET /screen?screen=0` to capture the current app state
2. if the UI is text-heavy, try `POST /ocr` first
3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like:
   - "In this save dialog, which visible button is the primary action?"
   - "Is there a dismiss/close button in the top-right of this modal?"
4. map the answer back to a Clickthrough target using the returned grid/region metadata
5. click once with `POST /action`
6. recapture the screen
7. optionally use `POST /wait` or another `image`/OCR check to confirm the result

The key rule is simple: screenshot first, interpret second, click third, verify fourth.
Do not collapse those steps into fake certainty.

## App-specific playbooks (recommended)

Build per-app routines for repetitive tasks instead of generic clicking.

### Spotify playbook

- Focus app window before search/navigation.
- Prefer keyboard-first flow for song start:
  1) `Ctrl+L` (search)
  2) type exact query
  3) Enter
  4) verify exact song+artist text
  5) click/double-click row
  6) verify now-playing bar
- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.