This repository has been archived on 2026-05-20. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
clickthrough/docs/API.md
Luna 5122d416e8
All checks were successful
python-syntax / syntax-check (push) Successful in 7s
feat(wait): add structured wait endpoint
2026-05-01 15:55:29 +02:00

10 KiB

API Reference (v0.1)

Base URL: http://127.0.0.1:8123

If CLICKTHROUGH_TOKEN is set, include header:

x-clickthrough-token: <token>

GET /health

Returns status and runtime safety flags, including exec capability config.

GET /displays

Returns detected displays in API screen order.

{
  "ok": true,
  "default_screen": 0,
  "displays": [
    {"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
    {"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
  ]
}

screen is zero-based. screen=0 is the primary display when detectable, falling back to the first monitor reported by the capture backend. Invalid screen values fall back to 0.

GET /screen

Query params:

  • screen (int, default 0) — zero-based display selector; invalid values fall back to 0
  • with_grid (bool, default true)
  • grid_rows (int, default env or 12)
  • grid_cols (int, default env or 12)
  • include_labels (bool, default true)
  • image_format (png|jpeg, default png)
  • jpeg_quality (1-100, default 85)
  • asImage (bool, default false) - if true, return raw image bytes only (image/png or image/jpeg)

Default response includes base64 image and metadata (meta.region, meta.screen, meta.displays, optional meta.grid). meta.region uses global desktop coordinates.

POST /zoom

Body:

{
  "center_x": 1200,
  "center_y": 700,
  "width": 500,
  "height": 350,
  "with_grid": true,
  "grid_rows": 20,
  "grid_cols": 20,
  "include_labels": true,
  "image_format": "png",
  "jpeg_quality": 90
}

Query params:

  • screen (int, default 0) - zero-based display selector; invalid values fall back to 0
  • asImage (bool, default false) - if true, return raw image bytes only (image/png or image/jpeg)

Default response returns cropped image + region metadata in global pixel coordinates. center_x and center_y are also global coordinates; use the selected display's meta.region from /screen?screen=X as the coordinate base.

POST /action

Body: one action.

Important:

  • the request body uses action plus an optional target
  • pixel coordinates live inside target when target.mode="pixel"
  • do not send top-level x / y fields

Query params:

  • screen (int, default 0) - zero-based display selector included in the response metadata; invalid values fall back to 0

Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture /screen?screen=X and use that response's meta.region or grid metadata to compute the target.

Pointer target modes

Pixel target

{
  "mode": "pixel",
  "x": 100,
  "y": 200,
  "dx": 0,
  "dy": 0
}

Grid target

{
  "mode": "grid",
  "region_x": 0,
  "region_y": 0,
  "region_width": 1920,
  "region_height": 1080,
  "rows": 12,
  "cols": 12,
  "row": 5,
  "col": 9,
  "dx": 0.0,
  "dy": 0.0
}

dx/dy are normalized offsets in [-1, 1] inside the selected cell.

Action examples

Click:

{
  "action": "click",
  "target": {
    "mode": "grid",
    "region_x": 0,
    "region_y": 0,
    "region_width": 1920,
    "region_height": 1080,
    "rows": 12,
    "cols": 12,
    "row": 7,
    "col": 3,
    "dx": 0.2,
    "dy": -0.1
  },
  "clicks": 1,
  "button": "left"
}

Scroll:

{
  "action": "scroll",
  "target": {"mode": "pixel", "x": 1300, "y": 740},
  "scroll_amount": -500
}

Type text:

{
  "action": "type",
  "text": "hello world",
  "interval_ms": 20
}

Hotkey:

{
  "action": "hotkey",
  "keys": ["ctrl", "l"]
}

Right click:

{
  "action": "right_click",
  "target": {"mode": "pixel", "x": 1300, "y": 740}
}

Move only:

{
  "action": "move",
  "target": {"mode": "pixel", "x": 1300, "y": 740},
  "duration_ms": 150
}

GET /windows

List desktop windows using structured filters instead of shelling out.

Query params:

  • title_contains (optional substring match)
  • title_regex (optional case-insensitive regex)
  • process_name (optional exact process name, e.g. explorer.exe)
  • hwnd (optional exact window handle)
  • visible_only (bool, default true)
{
  "ok": true,
  "count": 1,
  "windows": [
    {
      "hwnd": 132640,
      "title": "WinDirStat",
      "class_name": "WinDirStatMainWindow",
      "pid": 18420,
      "process_name": "windirstat.exe",
      "visible": true,
      "enabled": true,
      "minimized": false,
      "maximized": false,
      "foreground": true,
      "rect": {"x": 194, "y": 116, "width": 1532, "height": 870}
    }
  ]
}

Notes:

  • Currently supported on Windows hosts only.
  • Returns 409 for ambiguous write-target matches when a mutation endpoint would affect multiple windows.

POST /windows/action

Perform a structured window action against exactly one matched window.

{
  "action": "focus",
  "title_contains": "WinDirStat",
  "visible_only": true,
  "timeout_ms": 3000
}

Supported actions:

  • focus
  • restore
  • minimize
  • maximize
  • close

The response includes the matched pre-action window and the final observed window state (or closed=true if it disappeared).

POST /launch

Start an app/process without invoking a shell.

{
  "executable": "C:/Program Files/WinDirStat/WinDirStat.exe",
  "args": [],
  "cwd": "C:/Program Files/WinDirStat",
  "wait_for_window": true,
  "match": {
    "title_contains": "WinDirStat",
    "visible_only": true
  },
  "timeout_ms": 8000
}

Notes:

  • Launch uses direct process execution (subprocess.Popen) rather than PowerShell/CMD.
  • If wait_for_window=true, the server polls for a matching window and returns window_found.
  • dry_run=true returns the resolved argv/cwd without launching.

POST /wait

Wait on a structured UI condition instead of guessing sleep durations.

Query params:

  • screen (int, default 0) - used for text and visual waits

Wait for text to appear

{
  "condition": {
    "kind": "text",
    "mode": "screen",
    "text": "Scan complete",
    "match": "contains",
    "present": true,
    "language_hint": "eng",
    "min_confidence": 0.4
  },
  "timeout_ms": 15000,
  "poll_interval_ms": 400
}

Wait for a window state

{
  "condition": {
    "kind": "window",
    "title_contains": "WinDirStat",
    "visible_only": true,
    "state": "focused"
  },
  "timeout_ms": 5000,
  "poll_interval_ms": 200
}

Window states:

  • exists
  • focused
  • closed

Wait for visual change or stability

{
  "condition": {
    "kind": "visual",
    "state": "stable",
    "region_x": 0,
    "region_y": 0,
    "region_width": 1920,
    "region_height": 1080,
    "diff_threshold": 0.005,
    "stable_for_ms": 1000
  },
  "timeout_ms": 12000,
  "poll_interval_ms": 300
}

Visual states:

  • change — succeeds when the average pixel diff crosses diff_threshold
  • stable — succeeds when the diff stays at or below diff_threshold for stable_for_ms

Notes:

  • Text waits reuse the OCR pipeline and return matching OCR blocks on success.
  • Window waits build on the structured window discovery endpoint.
  • Visual waits compare repeated captures of either the full selected display or an explicit region.

POST /ocr

Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.

Query params:

  • screen (int, default 0) - zero-based display selector for mode=screen and mode=region; invalid values fall back to 0

Body:

{
  "mode": "screen",
  "language_hint": "eng",
  "min_confidence": 0.4
}

Modes:

  • screen (default): OCR over full selected monitor
  • region: OCR over explicit region (region_x, region_y, region_width, region_height)
  • image: OCR over provided image_base64 (supports plain base64 or data URL)

Region mode example:

{
  "mode": "region",
  "region_x": 220,
  "region_y": 160,
  "region_width": 900,
  "region_height": 400,
  "language_hint": "eng",
  "min_confidence": 0.5
}

Image mode example:

{
  "mode": "image",
  "image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
  "language_hint": "eng"
}

Response shape:

{
  "ok": true,
  "request_id": "...",
  "time_ms": 1710000000000,
  "result": {
    "mode": "screen",
    "language_hint": "eng",
    "min_confidence": 0.4,
    "region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
    "blocks": [
      {
        "text": "Settings",
        "confidence": 0.9821,
        "bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
      }
    ]
  }
}

Notes:

  • Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
  • bbox coordinates are in global screen space for screen/region, and image-local for image.
  • Requires tesseract executable plus Python package pytesseract.
  • If tesseract is not on PATH, set CLICKTHROUGH_TESSERACT_CMD to the full executable path.

POST /exec

Execute a shell command on the host running Clickthrough.

Requirements:

  • CLICKTHROUGH_EXEC_SECRET must be configured on the server
  • send header x-clickthrough-exec-secret: <secret>
{
  "command": "Get-Process | Select-Object -First 5",
  "shell": "powershell",
  "timeout_s": 20,
  "cwd": "C:/Users/Paul",
  "dry_run": false
}

Notes:

  • shell supports powershell, bash, cmd
  • if shell is omitted, server uses CLICKTHROUGH_EXEC_DEFAULT_SHELL
  • output is truncated based on CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS
  • endpoint can be disabled with CLICKTHROUGH_EXEC_ENABLED=false
  • if CLICKTHROUGH_EXEC_SECRET is missing, /exec is blocked (403)

Response includes stdout, stderr, exit_code, timeout state, and execution metadata.

POST /batch

Runs multiple action payloads sequentially.

Query params:

  • screen (int, default 0) - zero-based display selector applied to each action response; invalid values fall back to 0
{
  "actions": [
    {"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}},
    {"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}}
  ],
  "stop_on_error": true
}