Visual endpoints: full-screen capture with optional grid overlay and labeled cells (asImage=true can return raw image bytes)
Zoom endpoint: crop around a point with denser grid for fine targeting (asImage=true supported)
Multi-display support: list displays with GET /displays and select one with ?screen=0, ?screen=1, ...
Action endpoints: move/click/right-click/double-click/middle-click/scroll/type/hotkey
Window lifecycle endpoints: list/focus/restore/minimize/maximize/close windows via GET /windows + POST /windows/action
Structured launch endpoint: start an app/process without dropping to a shell via POST /launch
Wait/sync endpoint: poll for text, window, or visual state changes via POST /wait
Vision helper endpoints: compare screenshots and measure stability via POST /vision/diff and POST /vision/stability
OCR endpoints: extract text blocks or search for matching text via POST /ocr and POST /ocr/find
Compound verify endpoint: execute an action and wait for a structured success condition via POST /action/verify
Command execution endpoint: run PowerShell/Bash/CMD commands via POST /exec
Coordinate transform metadata in visual responses so agents can map grid cells to real pixels
Safety knobs: token auth, dry-run mode, optional allowed-region restriction

Quick start

cd /root/external-projects/clickthrough
python3 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
CLICKTHROUGH_TOKEN=change-me python -m server.app

Server defaults to 127.0.0.1:8123.

For OCR support, install the native tesseract binary on the host (in addition to Python deps), or point CLICKTHROUGH_TESSERACT_CMD at the executable if it lives somewhere weird.

python-dotenv is enabled, so values from a repo-root .env file are loaded automatically.

Minimal API flow

GET /displays if you need a non-primary monitor
GET /screen?screen=0 with grid
Decide cell / target
Optional POST /zoom?screen=0 for finer targeting
POST /action?screen=0 to execute (or POST /action/verify?screen=0 for a bundled action+wait flow)
GET /screen?screen=0 again to verify result, or use POST /wait, POST /vision/diff, or POST /ocr/find

Important:

POST /action expects an action plus a target object; do not send raw top-level x / y fields.
Pixel coordinates and OCR bounding boxes are always global desktop coordinates.
The agent does not inherently see the remote desktop; it reasons from screenshots, OCR, and window metadata.
When OCR is not enough, pair Clickthrough screenshots with OpenClaw's image tool for explicit screenshot interpretation.
Prefer structured GUI interaction first; use /windows, /launch, /wait, and /action before reaching for /exec.

See:

docs/API.md
docs/coordinate-system.md
skill/SKILL.md

Configuration

Environment variables:

CLICKTHROUGH_HOST (default 127.0.0.1)
CLICKTHROUGH_PORT (default 8123)
CLICKTHROUGH_TOKEN (optional; if set, require x-clickthrough-token header)
CLICKTHROUGH_DRY_RUN (true/false; default false)
CLICKTHROUGH_GRID_ROWS (default 12)
CLICKTHROUGH_GRID_COLS (default 12)
CLICKTHROUGH_ALLOWED_REGION (optional x,y,width,height)
CLICKTHROUGH_EXEC_ENABLED (default true)
CLICKTHROUGH_EXEC_SECRET (required for /exec to run)
CLICKTHROUGH_EXEC_DEFAULT_SHELL (default powershell; one of powershell, bash, cmd)
CLICKTHROUGH_EXEC_TIMEOUT_S (default 30)
CLICKTHROUGH_EXEC_MAX_TIMEOUT_S (default 120)
CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS (default 20000)
CLICKTHROUGH_TESSERACT_CMD (optional path to the tesseract executable)

Window management endpoints currently target Windows hosts. On non-Windows hosts they return 501 instead of guessing.

Gitea CI

A Gitea Actions workflow is included at .gitea/workflows/python-syntax.yml. It runs Python syntax checks (py_compile) on every push and pull request.