clickthrough/docs/API.md

# API Reference (v0.1)

Base URL: `http://127.0.0.1:8123`

If `CLICKTHROUGH_TOKEN` is set, include header:

```http
x-clickthrough-token: <token>
```

## `GET /health`

Returns status and runtime safety flags, including `exec` capability config.

## `GET /displays`

Returns detected displays in API screen order.

```json
{
  "ok": true,
  "default_screen": 0,
  "displays": [
    {"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
    {"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
  ]
}
```

`screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend.
Invalid `screen` values fall back to `0`.

## `GET /screen`

Query params:

- `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0`
- `with_grid` (bool, default `true`)
- `grid_rows` (int, default env or `12`)
- `grid_cols` (int, default env or `12`)
- `include_labels` (bool, default `true`)
- `image_format` (`png`|`jpeg`, default `png`)
- `jpeg_quality` (1-100, default `85`)
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)

Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
`meta.region` uses global desktop coordinates.

These image-returning endpoints do not magically grant the agent live vision.
If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI.

## `POST /zoom`

Body:

```json
{
  "center_x": 1200,
  "center_y": 700,
  "width": 500,
  "height": 350,
  "with_grid": true,
  "grid_rows": 20,
  "grid_cols": 20,
  "include_labels": true,
  "image_format": "png",
  "jpeg_quality": 90
}
```

Query params:

- `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0`
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)

Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.

`POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout.

## `POST /action`

Body: one action.

Important:
- the request body uses `action` plus an optional `target`
- pixel coordinates live inside `target` when `target.mode="pixel"`
- do **not** send top-level `x` / `y` fields

Query params:

- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`

Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target.

### Pointer target modes

#### Pixel target

```json
{
  "mode": "pixel",
  "x": 100,
  "y": 200,
  "dx": 0,
  "dy": 0
}
```

#### Grid target

```json
{
  "mode": "grid",
  "region_x": 0,
  "region_y": 0,
  "region_width": 1920,
  "region_height": 1080,
  "rows": 12,
  "cols": 12,
  "row": 5,
  "col": 9,
  "dx": 0.0,
  "dy": 0.0
}
```

`dx`/`dy` are normalized offsets in `[-1, 1]` inside the selected cell.

### Action examples

Click:

```json
{
  "action": "click",
  "target": {
    "mode": "grid",
    "region_x": 0,
    "region_y": 0,
    "region_width": 1920,
    "region_height": 1080,
    "rows": 12,
    "cols": 12,
    "row": 7,
    "col": 3,
    "dx": 0.2,
    "dy": -0.1
  },
  "clicks": 1,
  "button": "left"
}
```

Scroll:

```json
{
  "action": "scroll",
  "target": {"mode": "pixel", "x": 1300, "y": 740},
  "scroll_amount": -500
}
```

Type text:

```json
{
  "action": "type",
  "text": "hello world",
  "interval_ms": 20
}
```

Hotkey:

```json
{
  "action": "hotkey",
  "keys": ["ctrl", "l"]
}
```

Right click:

```json
{
  "action": "right_click",
  "target": {"mode": "pixel", "x": 1300, "y": 740}
}
```

Move only:

```json
{
  "action": "move",
  "target": {"mode": "pixel", "x": 1300, "y": 740},
  "duration_ms": 150
}
```

## `GET /windows`

List desktop windows using structured filters instead of shelling out.

Query params:

- `title_contains` (optional substring match)
- `title_regex` (optional case-insensitive regex)
- `process_name` (optional exact process name, e.g. `explorer.exe`)
- `hwnd` (optional exact window handle)
- `visible_only` (bool, default `true`)

```json
{
  "ok": true,
  "count": 1,
  "windows": [
    {
      "hwnd": 132640,
      "title": "WinDirStat",
      "class_name": "WinDirStatMainWindow",
      "pid": 18420,
      "process_name": "windirstat.exe",
      "visible": true,
      "enabled": true,
      "minimized": false,
      "maximized": false,
      "foreground": true,
      "rect": {"x": 194, "y": 116, "width": 1532, "height": 870}
    }
  ]
}
```

Notes:
- Currently supported on Windows hosts only.
- Returns `409` for ambiguous write-target matches when a mutation endpoint would affect multiple windows.

## `POST /windows/action`

Perform a structured window action against exactly one matched window.

```json
{
  "action": "focus",
  "title_contains": "WinDirStat",
  "visible_only": true,
  "timeout_ms": 3000
}
```

Supported actions:
- `focus`
- `restore`
- `minimize`
- `maximize`
- `close`

The response includes the matched pre-action window and the final observed window state (or `closed=true` if it disappeared).

## `POST /launch`

Start an app/process without invoking a shell.

```json
{
  "executable": "C:/Program Files/WinDirStat/WinDirStat.exe",
  "args": [],
  "cwd": "C:/Program Files/WinDirStat",
  "wait_for_window": true,
  "match": {
    "title_contains": "WinDirStat",
    "visible_only": true
  },
  "timeout_ms": 8000
}
```

Notes:
- Launch uses direct process execution (`subprocess.Popen`) rather than PowerShell/CMD.
- If `wait_for_window=true`, the server polls for a matching window and returns `window_found`.
- `dry_run=true` returns the resolved argv/cwd without launching.

## `POST /wait`

Wait on a structured UI condition instead of guessing sleep durations.

Query params:

- `screen` (int, default `0`) - used for text and visual waits

### Wait for text to appear

```json
{
  "condition": {
    "kind": "text",
    "mode": "screen",
    "text": "Scan complete",
    "match": "contains",
    "present": true,
    "language_hint": "eng",
    "min_confidence": 0.4
  },
  "timeout_ms": 15000,
  "poll_interval_ms": 400
}
```

### Wait for a window state

```json
{
  "condition": {
    "kind": "window",
    "title_contains": "WinDirStat",
    "visible_only": true,
    "state": "focused"
  },
  "timeout_ms": 5000,
  "poll_interval_ms": 200
}
```

Window states:
- `exists`
- `focused`
- `closed`

### Wait for visual change or stability

```json
{
  "condition": {
    "kind": "visual",
    "state": "stable",
    "region_x": 0,
    "region_y": 0,
    "region_width": 1920,
    "region_height": 1080,
    "diff_threshold": 0.005,
    "stable_for_ms": 1000
  },
  "timeout_ms": 12000,
  "poll_interval_ms": 300
}
```

Visual states:
- `change` — succeeds when the average pixel diff crosses `diff_threshold`
- `stable` — succeeds when the diff stays at or below `diff_threshold` for `stable_for_ms`

Notes:
- Text waits reuse the OCR pipeline and return matching OCR blocks on success.
- Window waits build on the structured window discovery endpoint.
- Visual waits compare repeated captures of either the full selected display or an explicit region.

## `POST /ocr`

Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.

Query params:

- `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0`

Body:

```json
{
  "mode": "screen",
  "language_hint": "eng",
  "min_confidence": 0.4
}
```

Modes:
- `screen` (default): OCR over full selected monitor
- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)

Region mode example:

```json
{
  "mode": "region",
  "region_x": 220,
  "region_y": 160,
  "region_width": 900,
  "region_height": 400,
  "language_hint": "eng",
  "min_confidence": 0.5
}
```

Image mode example:

```json
{
  "mode": "image",
  "image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
  "language_hint": "eng"
}
```

Response shape:

```json
{
  "ok": true,
  "request_id": "...",
  "time_ms": 1710000000000,
  "result": {
    "mode": "screen",
    "language_hint": "eng",
    "min_confidence": 0.4,
    "region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
    "blocks": [
      {
        "text": "Settings",
        "confidence": 0.9821,
        "bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
      }
    ]
  }
}
```

Notes:
- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
- Requires `tesseract` executable plus Python package `pytesseract`.
- If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.

## `POST /ocr/find`

Search OCR output for matching text instead of post-processing raw OCR blocks client-side.

Query params:

- `screen` (int, default `0`) - used for `mode=screen` and `mode=region`

```json
{
  "mode": "screen",
  "query": "Settings",
  "match": "contains",
  "group_lines": true,
  "max_results": 10,
  "language_hint": "eng",
  "min_confidence": 0.4
}
```

Modes:
- `screen`
- `region`
- `image`

Options:
- `match`: `contains`, `exact`, or `regex`
- `group_lines=true`: combine nearby OCR words into line-level candidates before matching
- `max_results`: result cap after confidence sorting

Response includes:
- `matches` — confidence-sorted candidate matches
- `match_count`
- `blocks_considered`

## `POST /exec`

Execute a shell command on the host running Clickthrough.

Requirements:
- `CLICKTHROUGH_EXEC_SECRET` must be configured on the server
- send header `x-clickthrough-exec-secret: <secret>`

```json
{
  "command": "Get-Process | Select-Object -First 5",
  "shell": "powershell",
  "timeout_s": 20,
  "cwd": "C:/Users/Paul",
  "dry_run": false
}
```

Notes:
- `shell` supports `powershell`, `bash`, `cmd`
- if `shell` is omitted, server uses `CLICKTHROUGH_EXEC_DEFAULT_SHELL`
- output is truncated based on `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS`
- endpoint can be disabled with `CLICKTHROUGH_EXEC_ENABLED=false`
- if `CLICKTHROUGH_EXEC_SECRET` is missing, `/exec` is blocked (`403`)

Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution metadata.

## `POST /batch`

Runs multiple `action` payloads sequentially.

Query params:

- `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0`

```json
{
  "actions": [
    {"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}},
    {"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}}
  ],
  "stop_on_error": true
}
```