Archived

This repository has been archived on 2026-05-20. You can view files and clone it. You cannot open issues or pull requests or push a commit.

Files

Luna c66779d929

python-syntax / syntax-check (push) Successful in 9s

Details

feat(verify): add compound action+verify flows

2026-05-01 16:26:57 +02:00

13 KiB

Raw Blame History

API Reference (v0.1)

Base URL: http://127.0.0.1:8123

If CLICKTHROUGH_TOKEN is set, include header:

x-clickthrough-token: <token>

`GET /health`

Returns status and runtime safety flags, including exec capability config.

`GET /displays`

Returns detected displays in API screen order.

{
  "ok": true,
  "default_screen": 0,
  "displays": [
    {"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
    {"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
  ]
}

screen is zero-based. screen=0 is the primary display when detectable, falling back to the first monitor reported by the capture backend. Invalid screen values fall back to 0.

`GET /screen`

Query params:

screen (int, default 0) — zero-based display selector; invalid values fall back to 0
with_grid (bool, default true)
grid_rows (int, default env or 12)
grid_cols (int, default env or 12)
include_labels (bool, default true)
image_format (png|jpeg, default png)
jpeg_quality (1-100, default 85)
asImage (bool, default false) - if true, return raw image bytes only (image/png or image/jpeg)

Default response includes base64 image and metadata (meta.region, meta.screen, meta.displays, optional meta.grid). meta.region uses global desktop coordinates.

These image-returning endpoints do not magically grant the agent live vision. If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's image tool and ask a narrow question about the visible UI.

`POST /zoom`

Body:

{
  "center_x": 1200,
  "center_y": 700,
  "width": 500,
  "height": 350,
  "with_grid": true,
  "grid_rows": 20,
  "grid_cols": 20,
  "include_labels": true,
  "image_format": "png",
  "jpeg_quality": 90
}

Query params:

screen (int, default 0) - zero-based display selector; invalid values fall back to 0
asImage (bool, default false) - if true, return raw image bytes only (image/png or image/jpeg)

Default response returns cropped image + region metadata in global pixel coordinates. center_x and center_y are also global coordinates; use the selected display's meta.region from /screen?screen=X as the coordinate base.

POST /zoom is often the best screenshot to hand to the image tool when the agent needs help judging a specific button, icon, or dialog layout.

`POST /action`

Body: one action.

Important:

the request body uses action plus an optional target
pixel coordinates live inside target when target.mode="pixel"
do not send top-level x / y fields

Query params:

screen (int, default 0) - zero-based display selector included in the response metadata; invalid values fall back to 0

Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture /screen?screen=X and use that response's meta.region or grid metadata to compute the target.

Pointer target modes

Pixel target

{
  "mode": "pixel",
  "x": 100,
  "y": 200,
  "dx": 0,
  "dy": 0
}

Grid target

{
  "mode": "grid",
  "region_x": 0,
  "region_y": 0,
  "region_width": 1920,
  "region_height": 1080,
  "rows": 12,
  "cols": 12,
  "row": 5,
  "col": 9,
  "dx": 0.0,
  "dy": 0.0
}

dx/dy are normalized offsets in [-1, 1] inside the selected cell.

Action examples

Click:

{
  "action": "click",
  "target": {
    "mode": "grid",
    "region_x": 0,
    "region_y": 0,
    "region_width": 1920,
    "region_height": 1080,
    "rows": 12,
    "cols": 12,
    "row": 7,
    "col": 3,
    "dx": 0.2,
    "dy": -0.1
  },
  "clicks": 1,
  "button": "left"
}

Scroll:

{
  "action": "scroll",
  "target": {"mode": "pixel", "x": 1300, "y": 740},
  "scroll_amount": -500
}

Type text:

{
  "action": "type",
  "text": "hello world",
  "interval_ms": 20
}

Hotkey:

{
  "action": "hotkey",
  "keys": ["ctrl", "l"]
}

Right click:

{
  "action": "right_click",
  "target": {"mode": "pixel", "x": 1300, "y": 740}
}

Move only:

{
  "action": "move",
  "target": {"mode": "pixel", "x": 1300, "y": 740},
  "duration_ms": 150
}

`GET /windows`

List desktop windows using structured filters instead of shelling out.

Query params:

title_contains (optional substring match)
title_regex (optional case-insensitive regex)
process_name (optional exact process name, e.g. explorer.exe)
hwnd (optional exact window handle)
visible_only (bool, default true)

{
  "ok": true,
  "count": 1,
  "windows": [
    {
      "hwnd": 132640,
      "title": "WinDirStat",
      "class_name": "WinDirStatMainWindow",
      "pid": 18420,
      "process_name": "windirstat.exe",
      "visible": true,
      "enabled": true,
      "minimized": false,
      "maximized": false,
      "foreground": true,
      "rect": {"x": 194, "y": 116, "width": 1532, "height": 870}
    }
  ]
}

Notes:

Currently supported on Windows hosts only.
Returns 409 for ambiguous write-target matches when a mutation endpoint would affect multiple windows.

`POST /windows/action`

Perform a structured window action against exactly one matched window.

{
  "action": "focus",
  "title_contains": "WinDirStat",
  "visible_only": true,
  "timeout_ms": 3000
}

Supported actions:

focus
restore
minimize
maximize
close

The response includes the matched pre-action window and the final observed window state (or closed=true if it disappeared).

`POST /launch`

Start an app/process without invoking a shell.

{
  "executable": "C:/Program Files/WinDirStat/WinDirStat.exe",
  "args": [],
  "cwd": "C:/Program Files/WinDirStat",
  "wait_for_window": true,
  "match": {
    "title_contains": "WinDirStat",
    "visible_only": true
  },
  "timeout_ms": 8000
}

Notes:

Launch uses direct process execution (subprocess.Popen) rather than PowerShell/CMD.
If wait_for_window=true, the server polls for a matching window and returns window_found.
dry_run=true returns the resolved argv/cwd without launching.

`POST /vision/diff`

Measure whether a screen region changed meaningfully between two captures.

Query params:

screen (int, default 0) - used for mode=screen and mode=region

Compare live captures:

{
  "mode": "region",
  "region_x": 120,
  "region_y": 80,
  "region_width": 600,
  "region_height": 300,
  "delay_ms": 400,
  "diff_threshold": 0.01
}

Compare provided images:

{
  "mode": "image",
  "before_image_base64": "iVBORw0KGgoAAA...",
  "after_image_base64": "iVBORw0KGgoBBB...",
  "diff_threshold": 0.01
}

Response includes:

diff_ratio — average normalized pixel difference
changed — whether diff_ratio >= diff_threshold
region — compared region

`POST /vision/stability`

Measure whether a screen region stays visually stable over a short interval.

Query params:

screen (int, default 0)

{
  "region_x": 0,
  "region_y": 0,
  "region_width": 1920,
  "region_height": 1080,
  "sample_interval_ms": 250,
  "duration_ms": 1200,
  "diff_threshold": 0.005
}

Response includes:

stable
sample_count
max_diff_ratio
avg_diff_ratio

`POST /wait`

Wait on a structured UI condition instead of guessing sleep durations.

Query params:

screen (int, default 0) - used for text and visual waits

Wait for text to appear

{
  "condition": {
    "kind": "text",
    "mode": "screen",
    "text": "Scan complete",
    "match": "contains",
    "present": true,
    "language_hint": "eng",
    "min_confidence": 0.4
  },
  "timeout_ms": 15000,
  "poll_interval_ms": 400
}

Wait for a window state

{
  "condition": {
    "kind": "window",
    "title_contains": "WinDirStat",
    "visible_only": true,
    "state": "focused"
  },
  "timeout_ms": 5000,
  "poll_interval_ms": 200
}

Window states:

exists
focused
closed

Wait for visual change or stability

{
  "condition": {
    "kind": "visual",
    "state": "stable",
    "region_x": 0,
    "region_y": 0,
    "region_width": 1920,
    "region_height": 1080,
    "diff_threshold": 0.005,
    "stable_for_ms": 1000
  },
  "timeout_ms": 12000,
  "poll_interval_ms": 300
}

Visual states:

change — succeeds when the average pixel diff crosses diff_threshold
stable — succeeds when the diff stays at or below diff_threshold for stable_for_ms

Notes:

Text waits reuse the OCR pipeline and return matching OCR blocks on success.
Window waits build on the structured window discovery endpoint.
Visual waits compare repeated captures of either the full selected display or an explicit region.

`POST /action/verify`

Execute one action and wait for a structured success condition.

Query params:

screen (int, default 0)

{
  "action": {
    "action": "click",
    "target": {"mode": "pixel", "x": 1300, "y": 740}
  },
  "condition": {
    "kind": "text",
    "mode": "screen",
    "text": "Settings",
    "match": "contains",
    "present": true,
    "language_hint": "eng",
    "min_confidence": 0.4
  },
  "retries": 1,
  "timeout_ms": 4000,
  "poll_interval_ms": 250,
  "retry_delay_ms": 250
}

Condition kinds mirror POST /wait:

text
window
visual

The response returns per-attempt action output plus structured verification output.

`POST /ocr`

Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.

Query params:

screen (int, default 0) - zero-based display selector for mode=screen and mode=region; invalid values fall back to 0

Body:

{
  "mode": "screen",
  "language_hint": "eng",
  "min_confidence": 0.4
}

Modes:

screen (default): OCR over full selected monitor
region: OCR over explicit region (region_x, region_y, region_width, region_height)
image: OCR over provided image_base64 (supports plain base64 or data URL)

Region mode example:

{
  "mode": "region",
  "region_x": 220,
  "region_y": 160,
  "region_width": 900,
  "region_height": 400,
  "language_hint": "eng",
  "min_confidence": 0.5
}

Image mode example:

{
  "mode": "image",
  "image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
  "language_hint": "eng"
}

Response shape:

{
  "ok": true,
  "request_id": "...",
  "time_ms": 1710000000000,
  "result": {
    "mode": "screen",
    "language_hint": "eng",
    "min_confidence": 0.4,
    "region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
    "blocks": [
      {
        "text": "Settings",
        "confidence": 0.9821,
        "bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
      }
    ]
  }
}

Notes:

Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
bbox coordinates are in global screen space for screen/region, and image-local for image.
Requires tesseract executable plus Python package pytesseract.
If tesseract is not on PATH, set CLICKTHROUGH_TESSERACT_CMD to the full executable path.

`POST /ocr/find`

Search OCR output for matching text instead of post-processing raw OCR blocks client-side.

Query params:

screen (int, default 0) - used for mode=screen and mode=region

{
  "mode": "screen",
  "query": "Settings",
  "match": "contains",
  "group_lines": true,
  "max_results": 10,
  "language_hint": "eng",
  "min_confidence": 0.4
}

Modes:

screen
region
image

Options:

match: contains, exact, or regex
group_lines=true: combine nearby OCR words into line-level candidates before matching
max_results: result cap after confidence sorting

Response includes:

matches — confidence-sorted candidate matches
match_count
blocks_considered

`POST /exec`

Execute a shell command on the host running Clickthrough.

Requirements:

CLICKTHROUGH_EXEC_SECRET must be configured on the server
send header x-clickthrough-exec-secret: <secret>

{
  "command": "Get-Process | Select-Object -First 5",
  "shell": "powershell",
  "timeout_s": 20,
  "cwd": "C:/Users/Paul",
  "dry_run": false
}

Notes:

shell supports powershell, bash, cmd
if shell is omitted, server uses CLICKTHROUGH_EXEC_DEFAULT_SHELL
output is truncated based on CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS
endpoint can be disabled with CLICKTHROUGH_EXEC_ENABLED=false
if CLICKTHROUGH_EXEC_SECRET is missing, /exec is blocked (403)

Response includes stdout, stderr, exit_code, timeout state, and execution metadata.

`POST /batch`

Runs multiple action payloads sequentially.

Query params:

screen (int, default 0) - zero-based display selector applied to each action response; invalid values fall back to 0

{
  "actions": [
    {"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}},
    {"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}}
  ],
  "stop_on_error": true
}

13 KiB Raw Blame History

API Reference (v0.1)

GET /health

GET /displays

GET /screen

POST /zoom

POST /action

Pointer target modes

Pixel target

Grid target

Action examples

GET /windows

POST /windows/action

POST /launch

POST /vision/diff

POST /vision/stability

POST /wait

Wait for text to appear

Wait for a window state

Wait for visual change or stability

POST /action/verify

POST /ocr

POST /ocr/find

POST /exec

POST /batch

13 KiB

Raw Blame History

`GET /health`

`GET /displays`

`GET /screen`

`POST /zoom`

`POST /action`

`GET /windows`

`POST /windows/action`

`POST /launch`

`POST /vision/diff`

`POST /vision/stability`

`POST /wait`

`POST /action/verify`

`POST /ocr`

`POST /ocr/find`

`POST /exec`

`POST /batch`