Compare commits

...

7 Commits

Author SHA1 Message Date
Space-Banane
22ca0097d1 Remove interact verify endpoint
All checks were successful
python-syntax / syntax-check (push) Successful in 31s
2026-05-04 15:59:43 +02:00
f05e0c56e6 Add AGENTS playbook and prioritize automation backlog
All checks were successful
python-syntax / syntax-check (push) Successful in 6s
2026-05-04 12:27:31 +02:00
211f38003e Align interact verify test interval with schema minimum
All checks were successful
python-syntax / syntax-check (push) Successful in 10s
2026-05-03 21:04:39 +02:00
c1fc97e198 Fix auth headers in OCR/interact endpoint tests
All checks were successful
python-syntax / syntax-check (push) Successful in 1m7s
2026-05-03 21:01:29 +02:00
9e816e0417 Add pytesseract OCR, click_text interact action, and interact verify endpoint
All checks were successful
python-syntax / syntax-check (push) Successful in 6s
2026-05-03 20:57:34 +02:00
1c03cab457 refactor: simplify to see/interact/exec and split server modules
All checks were successful
python-syntax / syntax-check (push) Successful in 6s
2026-05-03 20:07:12 +02:00
aced5be25e feat: migrate to v2-only API and unified response envelope
All checks were successful
python-syntax / syntax-check (push) Successful in 7s
2026-05-03 19:11:11 +02:00
11 changed files with 1361 additions and 2611 deletions

48
AGENTS.md Normal file
View File

@@ -0,0 +1,48 @@
# AGENTS
## Purpose
This file defines how agents and contributors should work in this repository.
It is a codebase playbook, not a product roadmap.
## Repository Map
- `server/app.py`: FastAPI routes, auth checks, response envelope, exception handling.
- `server/services.py`: screenshot/OCR/input/window/exec behavior and safety enforcement.
- `server/models.py`: request schemas and validation rules.
- `server/config.py`: environment loading and runtime settings.
- `tests/`: unit and API contract tests (monkeypatched where host behavior is nondeterministic).
- `docs/API.md`: public API request/response reference.
- `skill/SKILL.md`: operational method for agent usage of the API.
## Local Workflow
- Install dependencies: `pip install -r requirements.txt`
- Run server: `python -m server.app`
- Run tests: `pytest -q`
- Basic health check: `GET /health` with `x-clickthrough-token` when token auth is enabled.
## Non-Negotiable Contracts
- Keep the response envelope shape stable:
- `ok`, `request_id`, `time_ms`, `data`, `error`
- Preserve one-action-per-request semantics in `/interact`.
- Keep coordinate behavior in global desktop coordinates unless an explicit versioned change is introduced.
- All new request fields must be represented in `server/models.py` with explicit validation constraints.
## Safety and Security Rules
- Do not weaken `x-clickthrough-token` validation.
- Do not weaken `/exec` secret validation (`x-clickthrough-exec-secret`).
- Preserve `dry_run` behavior for non-destructive execution paths.
- Preserve allowed-region enforcement for pointer-target actions.
- Keep `/exec` constraints explicit:
- shell allowlist
- timeout limits
- output truncation limits
## Testing Expectations
- Add tests for each new behavior in services/models/routes.
- Cover success and failure paths, especially validation and ambiguity branches.
- Prefer deterministic tests with monkeypatches over host/UI-dependent flakiness.
- If API behavior changes, update tests in the same change.
## Documentation Policy
- When API behavior changes, update `docs/API.md` in the same change.
- Keep examples aligned with current schema in `docs/API.md` and `examples/quickstart.py`.
- Keep `skill/SKILL.md` aligned with current safe usage flow.

View File

@@ -1,82 +1,37 @@
# Clickthrough # Clickthrough
Let an Agent interact with your computer over HTTP, with grid-aware screenshots and precise input actions. Clickthrough is a lightweight HTTP control layer that lets an AI safely operate a real computer by repeatedly capturing structured screenshots with coordinate-aware grids (`see`), executing precise mouse/keyboard actions from those coordinates (`interact`), and optionally running authenticated shell commands for system-level tasks (`exec`) under a consistent response contract.
## What this provides ## Core Methods
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes) - `POST /see`: Capture a full screen or region, optionally with a click-ready grid overlay.
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported) - `POST /see/zoom`: Capture a tighter crop around a point and draw a denser grid for precise targeting.
- **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ... - `POST /interact`: Perform one mouse or keyboard action (`click`, `scroll`, `type`, `hotkey`, etc.).
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey - `POST /exec`: Run PowerShell/Bash/CMD commands when shell-level control is needed.
- **Window lifecycle endpoints**: list/focus/restore/minimize/maximize/close windows via `GET /windows` + `POST /windows/action`
- **Structured launch endpoint**: start an app/process without dropping to a shell via `POST /launch`
- **Wait/sync endpoint**: poll for text, window, or visual state changes via `POST /wait`
- **Vision helper endpoints**: compare screenshots and measure stability via `POST /vision/diff` and `POST /vision/stability`
- **OCR endpoints**: extract text blocks or search for matching text via `POST /ocr` and `POST /ocr/find`
- **Compound verify endpoint**: execute an action and wait for a structured success condition via `POST /action/verify`
- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
- **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
## Quick start ## Why this works for AI agents
```bash - Agents do not need live vision; they iterate on snapshots.
cd /root/external-projects/clickthrough - Grid metadata bridges image understanding to deterministic click coordinates.
python3 -m venv .venv - Interaction stays explicit and auditable (one action per request).
. .venv/bin/activate - A unified response envelope (`ok`, `data`, `error`) reduces agent-side branching.
pip install -r requirements.txt
CLICKTHROUGH_TOKEN=change-me python -m server.app
```
Server defaults to `127.0.0.1:8123`. ## Minimal Agent Loop
For OCR support, install the native `tesseract` binary on the host (in addition to Python deps), or point `CLICKTHROUGH_TESSERACT_CMD` at the executable if it lives somewhere weird. 1. Call `see` with a coarse grid.
2. If uncertain, call `see/zoom` with a denser grid.
3. Call `interact` once.
4. Call `see` again to verify state change.
5. Use `exec` only for explicit shell/system tasks.
`python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically. ## Safety and Auth
## Minimal API flow - `x-clickthrough-token` protects API access when enabled.
- `x-clickthrough-exec-secret` is required for `/exec`.
- Optional dry-run and allowed-region constraints reduce accidental risk.
1. `GET /displays` if you need a non-primary monitor ## Docs
2. `GET /screen?screen=0` with grid
3. Decide cell / target
4. Optional `POST /zoom?screen=0` for finer targeting
5. `POST /action?screen=0` to execute (or `POST /action/verify?screen=0` for a bundled action+wait flow)
6. `GET /screen?screen=0` again to verify result, or use `POST /wait`, `POST /vision/diff`, or `POST /ocr/find`
Important: - API: `docs/API.md`
- `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields. - Agent procedure: `skill/SKILL.md`
- Pixel coordinates and OCR bounding boxes are always global desktop coordinates. - Coordinate system details: `docs/coordinate-system.md`
- The agent does **not** inherently see the remote desktop; it reasons from screenshots, OCR, and window metadata.
- When OCR is not enough, pair Clickthrough screenshots with OpenClaw's `image` tool for explicit screenshot interpretation.
- Prefer structured GUI interaction first; use `/windows`, `/launch`, `/wait`, and `/action` before reaching for `/exec`.
See:
- `docs/API.md`
- `docs/coordinate-system.md`
- `skill/SKILL.md`
## Configuration
Environment variables:
- `CLICKTHROUGH_HOST` (default `127.0.0.1`)
- `CLICKTHROUGH_PORT` (default `8123`)
- `CLICKTHROUGH_TOKEN` (optional; if set, require `x-clickthrough-token` header)
- `CLICKTHROUGH_DRY_RUN` (`true`/`false`; default `false`)
- `CLICKTHROUGH_GRID_ROWS` (default `12`)
- `CLICKTHROUGH_GRID_COLS` (default `12`)
- `CLICKTHROUGH_ALLOWED_REGION` (optional `x,y,width,height`)
- `CLICKTHROUGH_EXEC_ENABLED` (default `true`)
- `CLICKTHROUGH_EXEC_SECRET` (**required for `/exec` to run**)
- `CLICKTHROUGH_EXEC_DEFAULT_SHELL` (default `powershell`; one of `powershell`, `bash`, `cmd`)
- `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`)
- `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`)
- `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`)
- `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable)
Window management endpoints currently target Windows hosts. On non-Windows hosts they return `501` instead of guessing.
## Gitea CI
A Gitea Actions workflow is included at `.gitea/workflows/python-syntax.yml`.
It runs Python syntax checks (`py_compile`) on every push and pull request.

View File

@@ -26,3 +26,10 @@
- [x] Upgrade skill with verify-before-click rules, confidence thresholds, two-phase risky actions, and Spotify playbook - [x] Upgrade skill with verify-before-click rules, confidence thresholds, two-phase risky actions, and Spotify playbook
- [x] Add top-level skill section for instance setup + mini API docs - [x] Add top-level skill section for instance setup + mini API docs
- [x] Clarify user-owned setup responsibilities vs agent responsibilities in skill docs - [x] Clarify user-owned setup responsibilities vs agent responsibilities in skill docs
## Deferred Backlog (Prioritized)
1. [ ] Higher-level automation macros composed from `see` + `interact`
2. [ ] Reusable workflow templates (for example: find text -> zoom fallback -> click -> verify)
3. [ ] Batch-safe orchestration primitives with explicit per-step results and auditability
4. [ ] Additional verify primitives for post-action validation (image diff region, window title/process state, color/pixel checks)
5. [ ] Broader API simplification pass to reduce payload overlap and consolidate shared OCR options

View File

@@ -1,62 +1,96 @@
# API Reference (v0.1) # API Reference
Base URL: `http://127.0.0.1:8123` Base URL: `http://127.0.0.1:8123`
If `CLICKTHROUGH_TOKEN` is set, include header: Auth header when enabled:
```http ```http
x-clickthrough-token: <token> x-clickthrough-token: <token>
``` ```
## `GET /health` This API is intended for AI computer control through these methods:
- `see`
- `interact`
- `exec`
Returns status and runtime safety flags, including `exec` capability config. All responses use one envelope.
## `GET /displays` ## Response Envelope
Returns detected displays in API screen order. Success:
```json ```json
{ {
"ok": true, "ok": true,
"default_screen": 0, "request_id": "...",
"displays": [ "time_ms": 1710000000000,
{"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080}, "data": {},
{"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080} "error": null
]
} }
``` ```
`screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend. Error:
Invalid `screen` values fall back to `0`.
## `GET /screen`
Query params:
- `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0`
- `with_grid` (bool, default `true`)
- `grid_rows` (int, default env or `12`)
- `grid_cols` (int, default env or `12`)
- `include_labels` (bool, default `true`)
- `image_format` (`png`|`jpeg`, default `png`)
- `jpeg_quality` (1-100, default `85`)
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
`meta.region` uses global desktop coordinates.
These image-returning endpoints do not magically grant the agent live vision.
If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI.
## `POST /zoom`
Body:
```json ```json
{ {
"ok": false,
"request_id": "...",
"time_ms": 1710000000000,
"data": null,
"error": {
"code": "validation_error",
"message": "request validation failed",
"details": []
}
}
```
## 1) See
### `POST /see`
Capture a full screen or a region. Optional grid overlay returns coordinate metadata for click mapping.
```json
{
"screen": 0,
"region_x": null,
"region_y": null,
"region_width": null,
"region_height": null,
"with_grid": true,
"grid_rows": 12,
"grid_cols": 12,
"include_labels": true,
"image_format": "png",
"jpeg_quality": 85,
"ocr": false,
"ocr_min_confidence": 0,
"ocr_lang": "eng",
"ocr_psm": null
}
```
Returns:
- `data.image.base64`
- `data.meta.region` (global desktop coords)
- `data.meta.grid` (rows/cols/cell size + formula)
- `data.meta.ocr` (when `ocr=true`)
OCR item shape:
- `text`
- `confidence`
- `bbox` (global coords)
- `center`
- `region_relative_bbox`
### `POST /see/zoom`
Capture a tighter crop around a global point and draw another grid over that crop.
```json
{
"screen": 0,
"center_x": 1200, "center_x": 1200,
"center_y": 700, "center_y": 720,
"width": 500, "width": 500,
"height": 350, "height": 350,
"with_grid": true, "with_grid": true,
@@ -68,70 +102,17 @@ Body:
} }
``` ```
Query params: Use this for precision before clicking tiny controls.
- `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0` ## 2) Interact
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base. ### `POST /interact`
Mouse/keyboard action execution.
`POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout.
## `POST /action`
Body: one action.
Important:
- the request body uses `action` plus an optional `target`
- pixel coordinates live inside `target` when `target.mode="pixel"`
- do **not** send top-level `x` / `y` fields
Query params:
- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`
Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target.
### Pointer target modes
#### Pixel target
```json
{
"mode": "pixel",
"x": 100,
"y": 200,
"dx": 0,
"dy": 0
}
```
#### Grid target
```json
{
"mode": "grid",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"rows": 12,
"cols": 12,
"row": 5,
"col": 9,
"dx": 0.0,
"dy": 0.0
}
```
`dx`/`dy` are normalized offsets in `[-1, 1]` inside the selected cell.
### Action examples
Click:
```json ```json
{ {
"screen": 0,
"action": {
"action": "click", "action": "click",
"target": { "target": {
"mode": "grid", "mode": "grid",
@@ -143,438 +124,63 @@ Click:
"cols": 12, "cols": 12,
"row": 7, "row": 7,
"col": 3, "col": 3,
"dx": 0.2, "dx": 0.0,
"dy": -0.1 "dy": 0.0
}, },
"clicks": 1, "button": "left",
"button": "left" "clicks": 1
} }
```
Scroll:
```json
{
"action": "scroll",
"target": {"mode": "pixel", "x": 1300, "y": 740},
"scroll_amount": -500
}
```
Type text:
```json
{
"action": "type",
"text": "hello world",
"interval_ms": 20
}
```
Hotkey:
```json
{
"action": "hotkey",
"keys": ["ctrl", "l"]
}
```
Right click:
```json
{
"action": "right_click",
"target": {"mode": "pixel", "x": 1300, "y": 740}
}
```
Move only:
```json
{
"action": "move",
"target": {"mode": "pixel", "x": 1300, "y": 740},
"duration_ms": 150
}
```
## `GET /windows`
List desktop windows using structured filters instead of shelling out.
Query params:
- `title_contains` (optional substring match)
- `title_regex` (optional case-insensitive regex)
- `process_name` (optional exact process name, e.g. `explorer.exe`)
- `hwnd` (optional exact window handle)
- `visible_only` (bool, default `true`)
```json
{
"ok": true,
"count": 1,
"windows": [
{
"hwnd": 132640,
"title": "WinDirStat",
"class_name": "WinDirStatMainWindow",
"pid": 18420,
"process_name": "windirstat.exe",
"visible": true,
"enabled": true,
"minimized": false,
"maximized": false,
"foreground": true,
"rect": {"x": 194, "y": 116, "width": 1532, "height": 870}
}
]
}
```
Notes:
- Currently supported on Windows hosts only.
- Returns `409` for ambiguous write-target matches when a mutation endpoint would affect multiple windows.
## `POST /windows/action`
Perform a structured window action against exactly one matched window.
```json
{
"action": "focus",
"title_contains": "WinDirStat",
"visible_only": true,
"timeout_ms": 3000
} }
``` ```
Supported actions: Supported actions:
- `focus` - `move`, `click`, `right_click`, `double_click`, `middle_click`
- `restore` - `scroll` (`scroll_amount`)
- `minimize` - `type` (`text`, `interval_ms`)
- `maximize` - `hotkey` (`keys`)
- `close` - `click_text` (OCR-driven text click with optional region)
The response includes the matched pre-action window and the final observed window state (or `closed=true` if it disappeared). Target modes:
- `pixel`: absolute global `x,y`
## `POST /launch` - `grid`: grid cell from a `see`/`see/zoom` response
Start an app/process without invoking a shell.
```json
{
"executable": "C:/Program Files/WinDirStat/WinDirStat.exe",
"args": [],
"cwd": "C:/Program Files/WinDirStat",
"wait_for_window": true,
"match": {
"title_contains": "WinDirStat",
"visible_only": true
},
"timeout_ms": 8000
}
```
Notes:
- Launch uses direct process execution (`subprocess.Popen`) rather than PowerShell/CMD.
- If `wait_for_window=true`, the server polls for a matching window and returns `window_found`.
- `dry_run=true` returns the resolved argv/cwd without launching.
## `POST /vision/diff`
Measure whether a screen region changed meaningfully between two captures.
Query params:
- `screen` (int, default `0`) - used for `mode=screen` and `mode=region`
Compare live captures:
```json
{
"mode": "region",
"region_x": 120,
"region_y": 80,
"region_width": 600,
"region_height": 300,
"delay_ms": 400,
"diff_threshold": 0.01
}
```
Compare provided images:
```json
{
"mode": "image",
"before_image_base64": "iVBORw0KGgoAAA...",
"after_image_base64": "iVBORw0KGgoBBB...",
"diff_threshold": 0.01
}
```
Response includes:
- `diff_ratio` — average normalized pixel difference
- `changed` — whether `diff_ratio >= diff_threshold`
- `region` — compared region
## `POST /vision/stability`
Measure whether a screen region stays visually stable over a short interval.
Query params:
- `screen` (int, default `0`)
```json
{
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"sample_interval_ms": 250,
"duration_ms": 1200,
"diff_threshold": 0.005
}
```
Response includes:
- `stable`
- `sample_count`
- `max_diff_ratio`
- `avg_diff_ratio`
## `POST /wait`
Wait on a structured UI condition instead of guessing sleep durations.
Query params:
- `screen` (int, default `0`) - used for text and visual waits
### Wait for text to appear
```json
{
"condition": {
"kind": "text",
"mode": "screen",
"text": "Scan complete",
"match": "contains",
"present": true,
"language_hint": "eng",
"min_confidence": 0.4
},
"timeout_ms": 15000,
"poll_interval_ms": 400
}
```
### Wait for a window state
```json
{
"condition": {
"kind": "window",
"title_contains": "WinDirStat",
"visible_only": true,
"state": "focused"
},
"timeout_ms": 5000,
"poll_interval_ms": 200
}
```
Window states:
- `exists`
- `focused`
- `closed`
### Wait for visual change or stability
```json
{
"condition": {
"kind": "visual",
"state": "stable",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"diff_threshold": 0.005,
"stable_for_ms": 1000
},
"timeout_ms": 12000,
"poll_interval_ms": 300
}
```
Visual states:
- `change` — succeeds when the average pixel diff crosses `diff_threshold`
- `stable` — succeeds when the diff stays at or below `diff_threshold` for `stable_for_ms`
Notes:
- Text waits reuse the OCR pipeline and return matching OCR blocks on success.
- Window waits build on the structured window discovery endpoint.
- Visual waits compare repeated captures of either the full selected display or an explicit region.
## `POST /action/verify`
Execute one action and wait for a structured success condition.
Query params:
- `screen` (int, default `0`)
### `click_text` example (full screen OCR)
```json ```json
{ {
"screen": 0,
"action": { "action": {
"action": "click", "action": "click_text",
"target": {"mode": "pixel", "x": 1300, "y": 740} "click_text": {
}, "text": "Sign in",
"condition": {
"kind": "text",
"mode": "screen",
"text": "Settings",
"match": "contains", "match": "contains",
"present": true, "case_sensitive": false,
"language_hint": "eng", "min_confidence": 45,
"min_confidence": 0.4 "occurrence": "best"
}, }
"retries": 1, }
"timeout_ms": 4000,
"poll_interval_ms": 250,
"retry_delay_ms": 250
} }
``` ```
Condition kinds mirror `POST /wait`: ### `click_text` example (region OCR)
- `text`
- `window`
- `visual`
The response returns per-attempt action output plus structured verification output.
## `POST /ocr`
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
Query params:
- `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0`
Body:
```json ```json
{ {
"mode": "screen", "screen": 0,
"language_hint": "eng", "action": {
"min_confidence": 0.4 "action": "click_text",
"click_text": {
"text": "Continue",
"match": "exact",
"region": { "x": 940, "y": 520, "width": 400, "height": 260 },
"occurrence": "first"
} }
```
Modes:
- `screen` (default): OCR over full selected monitor
- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
Region mode example:
```json
{
"mode": "region",
"region_x": 220,
"region_y": 160,
"region_width": 900,
"region_height": 400,
"language_hint": "eng",
"min_confidence": 0.5
}
```
Image mode example:
```json
{
"mode": "image",
"image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
"language_hint": "eng"
}
```
Response shape:
```json
{
"ok": true,
"request_id": "...",
"time_ms": 1710000000000,
"result": {
"mode": "screen",
"language_hint": "eng",
"min_confidence": 0.4,
"region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
"blocks": [
{
"text": "Settings",
"confidence": 0.9821,
"bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
}
]
} }
} }
``` ```
Notes: ## 3) Exec
- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
- Requires `tesseract` executable plus Python package `pytesseract`.
- If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.
## `POST /ocr/find` ### `POST /exec`
Run host shell commands (PowerShell/Bash/CMD).
Search OCR output for matching text instead of post-processing raw OCR blocks client-side.
Query params:
- `screen` (int, default `0`) - used for `mode=screen` and `mode=region`
```json
{
"mode": "screen",
"query": "Settings",
"match": "contains",
"group_lines": true,
"max_results": 10,
"language_hint": "eng",
"min_confidence": 0.4
}
```
Modes:
- `screen`
- `region`
- `image`
Options:
- `match`: `contains`, `exact`, or `regex`
- `group_lines=true`: combine nearby OCR words into line-level candidates before matching
- `max_results`: result cap after confidence sorting
Response includes:
- `matches` — confidence-sorted candidate matches
- `match_count`
- `blocks_considered`
## `POST /exec`
Execute a shell command on the host running Clickthrough.
Requirements:
- `CLICKTHROUGH_EXEC_SECRET` must be configured on the server
- send header `x-clickthrough-exec-secret: <secret>`
```json ```json
{ {
@@ -586,29 +192,16 @@ Requirements:
} }
``` ```
Notes: Required header:
- `shell` supports `powershell`, `bash`, `cmd`
- if `shell` is omitted, server uses `CLICKTHROUGH_EXEC_DEFAULT_SHELL`
- output is truncated based on `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS`
- endpoint can be disabled with `CLICKTHROUGH_EXEC_ENABLED=false`
- if `CLICKTHROUGH_EXEC_SECRET` is missing, `/exec` is blocked (`403`)
Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution metadata. ```http
x-clickthrough-exec-secret: <secret>
## `POST /batch`
Runs multiple `action` payloads sequentially.
Query params:
- `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0`
```json
{
"actions": [
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}},
{"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}}
],
"stop_on_error": true
}
``` ```
## Minimal Procedure for Agents
1. `see` full screen with coarse grid.
2. If uncertain, `see/zoom` target area with denser grid.
3. `interact` one action.
4. `see` again to confirm state change.
5. Use `exec` only when GUI interaction is not the right tool.

View File

@@ -13,23 +13,52 @@ if TOKEN:
def main(): def main():
r = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10) health = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10)
r.raise_for_status() health.raise_for_status()
print("health:", r.json()) print("health:", health.json()["data"])
d = requests.get(f"{BASE_URL}/displays", headers=headers, timeout=10) see = requests.post(
d.raise_for_status() f"{BASE_URL}/see",
print("displays:", d.json().get("displays", []))
s = requests.get(
f"{BASE_URL}/screen",
headers=headers, headers=headers,
params={"screen": SCREEN, "with_grid": True, "grid_rows": 12, "grid_cols": 12}, json={
"screen": SCREEN,
"with_grid": True,
"grid_rows": 12,
"grid_cols": 12,
"image_format": "jpeg",
"jpeg_quality": 70,
},
timeout=30, timeout=30,
) )
s.raise_for_status() see.raise_for_status()
payload = s.json() payload = see.json()["data"]
print("screen meta:", payload.get("meta", {})) print("region:", payload["meta"]["region"])
print("grid:", payload["meta"].get("grid", {}))
see_ocr = requests.post(
f"{BASE_URL}/see",
headers=headers,
json={"screen": SCREEN, "ocr": True, "with_grid": False, "ocr_min_confidence": 40},
timeout=30,
)
see_ocr.raise_for_status()
ocr_items = see_ocr.json()["data"]["meta"].get("ocr", [])
print("ocr_items:", len(ocr_items))
if ocr_items:
label = ocr_items[0]["text"]
click_text = requests.post(
f"{BASE_URL}/interact",
headers=headers,
json={
"screen": SCREEN,
"action": {"action": "click_text", "click_text": {"text": label, "match": "exact", "occurrence": "first"}},
},
timeout=30,
)
click_text.raise_for_status()
click_data = click_text.json()["data"]
print("clicked:", click_data["resolved_target"])
if __name__ == "__main__": if __name__ == "__main__":

File diff suppressed because it is too large Load Diff

42
server/config.py Normal file
View File

@@ -0,0 +1,42 @@
import os
from typing import Optional
from dotenv import load_dotenv
load_dotenv(dotenv_path=".env", override=False)
def _env_bool(name: str, default: bool) -> bool:
raw = os.getenv(name)
if raw is None:
return default
return raw.strip().lower() in {"1", "true", "yes", "on"}
def _parse_allowed_region() -> Optional[tuple[int, int, int, int]]:
raw = os.getenv("CLICKTHROUGH_ALLOWED_REGION")
if not raw:
return None
parts = [p.strip() for p in raw.split(",")]
if len(parts) != 4:
raise ValueError("CLICKTHROUGH_ALLOWED_REGION must be x,y,width,height")
x, y, w, h = (int(p) for p in parts)
if w <= 0 or h <= 0:
raise ValueError("CLICKTHROUGH_ALLOWED_REGION width/height must be > 0")
return x, y, w, h
SETTINGS = {
"host": os.getenv("CLICKTHROUGH_HOST", "127.0.0.1"),
"port": int(os.getenv("CLICKTHROUGH_PORT", "8123")),
"token": os.getenv("CLICKTHROUGH_TOKEN", "").strip(),
"dry_run": _env_bool("CLICKTHROUGH_DRY_RUN", False),
"allowed_region": _parse_allowed_region(),
"exec_enabled": _env_bool("CLICKTHROUGH_EXEC_ENABLED", True),
"exec_default_shell": os.getenv("CLICKTHROUGH_EXEC_DEFAULT_SHELL", "powershell").strip().lower(),
"exec_default_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_TIMEOUT_S", "30")),
"exec_max_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_MAX_TIMEOUT_S", "120")),
"exec_max_output_chars": int(os.getenv("CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS", "20000")),
"exec_secret": os.getenv("CLICKTHROUGH_EXEC_SECRET", "").strip(),
}

167
server/models.py Normal file
View File

@@ -0,0 +1,167 @@
from typing import Literal, Optional
from pydantic import BaseModel, Field, model_validator
class PixelTarget(BaseModel):
mode: Literal["pixel"]
x: int
y: int
dx: int = 0
dy: int = 0
class GridTarget(BaseModel):
mode: Literal["grid"]
region_x: int
region_y: int
region_width: int = Field(gt=0)
region_height: int = Field(gt=0)
rows: int = Field(gt=0)
cols: int = Field(gt=0)
row: int = Field(ge=0)
col: int = Field(ge=0)
dx: float = 0.0
dy: float = 0.0
@model_validator(mode="after")
def _validate_indices(self):
if self.row >= self.rows or self.col >= self.cols:
raise ValueError("row/col must be inside rows/cols")
if not -1.0 <= self.dx <= 1.0:
raise ValueError("dx must be in [-1, 1]")
if not -1.0 <= self.dy <= 1.0:
raise ValueError("dy must be in [-1, 1]")
return self
Target = PixelTarget | GridTarget
class ActionRequest(BaseModel):
action: Literal[
"move",
"click",
"right_click",
"double_click",
"middle_click",
"scroll",
"type",
"hotkey",
"click_text",
]
target: Optional[Target] = None
duration_ms: int = Field(default=0, ge=0, le=20000)
button: Literal["left", "right", "middle"] = "left"
clicks: int = Field(default=1, ge=1, le=10)
scroll_amount: int = 0
text: str = ""
keys: list[str] = Field(default_factory=list)
interval_ms: int = Field(default=20, ge=0, le=5000)
dry_run: bool = False
click_text: "ClickTextAction | None" = None
@model_validator(mode="after")
def _validate_click_text(self):
if self.action == "click_text" and self.click_text is None:
raise ValueError("click_text payload is required when action=click_text")
return self
class ExecRequest(BaseModel):
command: str = Field(min_length=1, max_length=10000)
shell: Literal["powershell", "bash", "cmd"] | None = None
timeout_s: int | None = Field(default=None, ge=1, le=600)
cwd: str | None = None
dry_run: bool = False
class WindowQuery(BaseModel):
title_contains: str | None = Field(default=None, max_length=512)
title_regex: str | None = Field(default=None, max_length=512)
process_name: str | None = Field(default=None, max_length=260)
hwnd: int | None = Field(default=None, ge=1)
visible_only: bool = True
class WindowActionRequest(WindowQuery):
action: Literal["focus", "restore", "minimize", "maximize", "close"]
timeout_ms: int = Field(default=3000, ge=0, le=60000)
class LaunchRequest(BaseModel):
executable: str = Field(min_length=1, max_length=2048)
args: list[str] = Field(default_factory=list, max_length=100)
cwd: str | None = None
wait_for_window: bool = False
match: WindowQuery | None = None
timeout_ms: int = Field(default=5000, ge=0, le=120000)
dry_run: bool = False
class SeeRequest(BaseModel):
screen: int = 0
region_x: int | None = Field(default=None, ge=0)
region_y: int | None = Field(default=None, ge=0)
region_width: int | None = Field(default=None, gt=0)
region_height: int | None = Field(default=None, gt=0)
with_grid: bool = True
grid_rows: int = Field(default=12, ge=1, le=300)
grid_cols: int = Field(default=12, ge=1, le=300)
include_labels: bool = True
image_format: Literal["png", "jpeg"] = "png"
jpeg_quality: int = Field(default=85, ge=1, le=100)
ocr: bool = False
ocr_min_confidence: float = Field(default=0.0, ge=0.0, le=100.0)
ocr_lang: str = Field(default="eng", min_length=1, max_length=64)
ocr_psm: int | None = Field(default=None, ge=0, le=13)
class SeeZoomRequest(BaseModel):
screen: int = 0
center_x: int = Field(ge=0)
center_y: int = Field(ge=0)
width: int = Field(default=500, ge=10)
height: int = Field(default=350, ge=10)
with_grid: bool = True
grid_rows: int = Field(default=20, ge=1, le=300)
grid_cols: int = Field(default=20, ge=1, le=300)
include_labels: bool = True
image_format: Literal["png", "jpeg"] = "png"
jpeg_quality: int = Field(default=90, ge=1, le=100)
class InteractRequest(BaseModel):
screen: int = 0
action: ActionRequest
class OCRRegion(BaseModel):
x: int = Field(ge=0)
y: int = Field(ge=0)
width: int = Field(gt=0)
height: int = Field(gt=0)
class ClickTextAction(BaseModel):
text: str = Field(min_length=1, max_length=1000)
match: Literal["contains", "exact", "regex"] = "contains"
region: OCRRegion | None = None
screen: int | None = None
case_sensitive: bool = False
min_confidence: float = Field(default=0.0, ge=0.0, le=100.0)
occurrence: Literal["first", "best", "nth"] = "first"
nth: int | None = Field(default=None, ge=1, le=10000)
ocr_lang: str = Field(default="eng", min_length=1, max_length=64)
ocr_psm: int | None = Field(default=None, ge=0, le=13)
@model_validator(mode="after")
def _validate_nth(self):
if self.occurrence == "nth" and self.nth is None:
raise ValueError("nth is required when occurrence=nth")
if self.occurrence != "nth" and self.nth is not None:
raise ValueError("nth is only allowed when occurrence=nth")
return self
ActionRequest.model_rebuild()

602
server/services.py Normal file
View File

@@ -0,0 +1,602 @@
import ctypes
import io
import os
import re
import subprocess
import sys
import time
from typing import Literal
from fastapi import HTTPException
from PIL import ImageChops, ImageStat
from .config import SETTINGS
from .models import (
ActionRequest,
ClickTextAction,
GridTarget,
LaunchRequest,
PixelTarget,
Target,
WindowActionRequest,
WindowQuery,
)
def api_error(status_code: int, code: str, message: str, details=None):
raise HTTPException(status_code=status_code, detail={"code": code, "message": message, "details": details})
def import_capture_libs():
try:
from PIL import Image, ImageDraw
import mss
return Image, ImageDraw, mss
except Exception as exc:
raise HTTPException(status_code=500, detail=f"capture backend unavailable: {exc}") from exc
def display_region(mon: dict, screen: int, mss_index: int, primary: bool) -> dict:
return {
"screen": screen,
"mss_index": mss_index,
"primary": primary,
"x": mon["left"],
"y": mon["top"],
"width": mon["width"],
"height": mon["height"],
}
def ordered_displays(sct) -> list[dict]:
raw_monitors = list(enumerate(sct.monitors[1:], start=1))
if not raw_monitors:
raise HTTPException(status_code=500, detail="no displays detected")
primary_pos = next((idx for idx, (_, mon) in enumerate(raw_monitors) if mon["left"] == 0 and mon["top"] == 0), 0)
ordered = [raw_monitors[primary_pos]] + [item for idx, item in enumerate(raw_monitors) if idx != primary_pos]
return [display_region(mon, screen=index, mss_index=mss_index, primary=(index == 0)) for index, (mss_index, mon) in enumerate(ordered)]
def get_displays() -> list[dict]:
_, _, mss = import_capture_libs()
with mss.mss() as sct:
return ordered_displays(sct)
def select_display(screen: int) -> tuple[dict, list[dict], dict]:
displays = get_displays()
selected = displays[screen] if 0 <= screen < len(displays) else displays[0]
return selected, displays, {"requested": screen, "selected": selected["screen"], "fallback": selected["screen"] != screen}
def capture_screen(screen: int = 0):
Image, _, mss = import_capture_libs()
with mss.mss() as sct:
displays = ordered_displays(sct)
mon = displays[screen] if 0 <= screen < len(displays) else displays[0]
shot = sct.grab({"left": mon["x"], "top": mon["y"], "width": mon["width"], "height": mon["height"]})
image = Image.frombytes("RGB", shot.size, shot.rgb)
selection = {"requested": screen, "selected": mon["screen"], "fallback": mon["screen"] != screen}
return image, mon, displays, selection
def capture_region_image(screen: int, region_x: int | None, region_y: int | None, region_width: int | None, region_height: int | None):
base_img, mon, displays, screen_selection = capture_screen(screen)
if None in {region_x, region_y, region_width, region_height}:
return base_img, {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}, mon, displays, screen_selection
left = region_x - mon["x"]
top = region_y - mon["y"]
right = left + region_width
bottom = top + region_height
if left < 0 or top < 0 or right > base_img.size[0] or bottom > base_img.size[1]:
raise HTTPException(status_code=400, detail="requested region is outside the captured monitor")
crop = base_img.crop((left, top, right, bottom))
return crop, {"x": region_x, "y": region_y, "width": region_width, "height": region_height}, mon, displays, screen_selection
def extract_ocr_items(image, origin_x: int, origin_y: int, min_confidence: float, lang: str, psm: int | None) -> list[dict]:
try:
import pytesseract
except Exception as exc:
api_error(503, "ocr_unavailable", f"pytesseract unavailable: {exc}")
config = ""
if psm is not None:
config = f"--psm {psm}"
try:
data = pytesseract.image_to_data(image, lang=lang, config=config, output_type=pytesseract.Output.DICT)
except Exception as exc:
api_error(503, "ocr_failed", f"ocr failed: {exc}")
out: list[dict] = []
n = len(data.get("text", []))
for i in range(n):
text = (data["text"][i] or "").strip()
if not text:
continue
try:
confidence = float(data["conf"][i])
except Exception:
continue
if confidence < min_confidence:
continue
left = int(data["left"][i])
top = int(data["top"][i])
width = int(data["width"][i])
height = int(data["height"][i])
bbox = {"x": origin_x + left, "y": origin_y + top, "width": width, "height": height}
center = {"x": bbox["x"] + (width // 2), "y": bbox["y"] + (height // 2)}
out.append(
{
"text": text,
"confidence": confidence,
"bbox": bbox,
"center": center,
"region_relative_bbox": {"x": left, "y": top, "width": width, "height": height},
}
)
return out
def serialize_image(image, image_format: str, jpeg_quality: int) -> bytes:
buf = io.BytesIO()
if image_format == "jpeg":
image.save(buf, format="JPEG", quality=jpeg_quality)
else:
image.save(buf, format="PNG")
return buf.getvalue()
def encode_image(image, image_format: str, jpeg_quality: int) -> str:
import base64
return base64.b64encode(serialize_image(image, image_format, jpeg_quality)).decode("ascii")
def draw_grid(image, region_x: int, region_y: int, rows: int, cols: int, include_labels: bool):
_, ImageDraw, _ = import_capture_libs()
out = image.copy()
draw = ImageDraw.Draw(out)
w, h = out.size
cell_w = w / cols
cell_h = h / rows
for c in range(1, cols):
x = int(round(c * cell_w))
draw.line([(x, 0), (x, h)], fill=(255, 0, 0), width=1)
for r in range(1, rows):
y = int(round(r * cell_h))
draw.line([(0, y), (w, y)], fill=(255, 0, 0), width=1)
draw.rectangle([(0, 0), (w - 1, h - 1)], outline=(255, 0, 0), width=2)
if include_labels:
for r in range(rows):
for c in range(cols):
cx = int((c + 0.5) * cell_w)
cy = int((r + 0.5) * cell_h)
draw.text((cx - 12, cy - 6), f"{r},{c}", fill=(255, 255, 0))
meta = {
"region": {"x": region_x, "y": region_y, "width": w, "height": h},
"grid": {
"rows": rows,
"cols": cols,
"cell_width": cell_w,
"cell_height": cell_h,
"indexing": "zero-based",
"point_formula": {
"pixel_x": "region.x + ((col + 0.5 + dx*0.5) * cell_width)",
"pixel_y": "region.y + ((row + 0.5 + dy*0.5) * cell_height)",
"dx_range": "[-1,1]",
"dy_range": "[-1,1]",
},
},
}
return out, meta
def resolve_target(target: Target) -> tuple[int, int, dict]:
if isinstance(target, PixelTarget):
x = target.x + target.dx
y = target.y + target.dy
return x, y, {"mode": "pixel", "source": target.model_dump()}
cell_w = target.region_width / target.cols
cell_h = target.region_height / target.rows
x = target.region_x + int(round((target.col + 0.5 + (target.dx * 0.5)) * cell_w))
y = target.region_y + int(round((target.row + 0.5 + (target.dy * 0.5)) * cell_h))
return x, y, {"mode": "grid", "source": target.model_dump(), "derived": {"cell_width": cell_w, "cell_height": cell_h}}
def enforce_allowed_region(x: int, y: int):
region = SETTINGS["allowed_region"]
if region is None:
return
rx, ry, rw, rh = region
if not (rx <= x < rx + rw and ry <= y < ry + rh):
raise HTTPException(status_code=403, detail="point outside allowed region")
def _text_matches(candidate: str, needle: str, mode: str, case_sensitive: bool) -> bool:
hay = candidate if case_sensitive else candidate.lower()
ndl = needle if case_sensitive else needle.lower()
if mode == "contains":
return ndl in hay
if mode == "exact":
return hay == ndl
flags = 0 if case_sensitive else re.IGNORECASE
return re.search(needle, candidate, flags=flags) is not None
def _resolve_text_match(click_text: ClickTextAction, items: list[dict]) -> dict:
matches = [item for item in items if _text_matches(item["text"], click_text.text, click_text.match, click_text.case_sensitive)]
if not matches:
candidates = [item["text"] for item in sorted(items, key=lambda v: v["confidence"], reverse=True)[:8]]
api_error(404, "ocr_text_not_found", "no OCR text matched", {"query": click_text.text, "candidates": candidates})
if click_text.occurrence == "best":
return max(matches, key=lambda item: item["confidence"])
if click_text.occurrence == "nth":
idx = (click_text.nth or 1) - 1
if idx >= len(matches):
api_error(409, "ocr_nth_out_of_range", "requested nth match is out of range", {"match_count": len(matches), "nth": click_text.nth})
return matches[idx]
if len(matches) > 1 and click_text.match == "exact":
api_error(
409,
"ocr_text_ambiguous",
"multiple OCR entries matched",
{"match_count": len(matches), "candidates": [item["text"] for item in matches[:8]]},
)
return matches[0]
def import_input_lib():
try:
import pyautogui
pyautogui.FAILSAFE = True
return pyautogui
except Exception as exc:
raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc
def exec_action(req: ActionRequest, screen: int = 0) -> dict:
run_dry = SETTINGS["dry_run"] or req.dry_run
action_screen = screen
if req.action == "click_text" and req.click_text and req.click_text.screen is not None:
action_screen = req.click_text.screen
selected_display, _, screen_selection = select_display(action_screen)
pyautogui = None if run_dry else import_input_lib()
resolved_target = None
if req.target is not None:
x, y, info = resolve_target(req.target)
enforce_allowed_region(x, y)
resolved_target = {"x": x, "y": y, "target_info": info}
duration_sec = req.duration_ms / 1000.0
if req.action in {"move", "click", "right_click", "double_click", "middle_click"} and resolved_target is None:
raise HTTPException(status_code=400, detail="target is required for pointer actions")
if req.action == "scroll" and resolved_target is None:
raise HTTPException(status_code=400, detail="target is required for scroll")
click_text_match = None
if req.action == "click_text":
if req.click_text is None:
api_error(400, "click_text_payload_required", "click_text payload is required")
region = req.click_text.region
img, captured_region, _, _, _ = capture_region_image(
action_screen,
None if region is None else region.x,
None if region is None else region.y,
None if region is None else region.width,
None if region is None else region.height,
)
items = extract_ocr_items(
img,
captured_region["x"],
captured_region["y"],
req.click_text.min_confidence,
req.click_text.ocr_lang,
req.click_text.ocr_psm,
)
matched = _resolve_text_match(req.click_text, items)
enforce_allowed_region(matched["center"]["x"], matched["center"]["y"])
click_text_match = {
"query": req.click_text.model_dump(),
"matched": matched,
"capture_region": captured_region,
"screen": screen_selection,
}
resolved_target = {"x": matched["center"]["x"], "y": matched["center"]["y"], "target_info": {"mode": "ocr_text"}}
if not run_dry:
if req.action == "move":
pyautogui.moveTo(resolved_target["x"], resolved_target["y"], duration=duration_sec)
elif req.action == "click":
pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], clicks=req.clicks, interval=req.interval_ms / 1000.0, button=req.button, duration=duration_sec)
elif req.action == "right_click":
pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], button="right", duration=duration_sec)
elif req.action == "double_click":
pyautogui.doubleClick(x=resolved_target["x"], y=resolved_target["y"], interval=req.interval_ms / 1000.0)
elif req.action == "middle_click":
pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], button="middle", duration=duration_sec)
elif req.action == "scroll":
pyautogui.moveTo(resolved_target["x"], resolved_target["y"], duration=duration_sec)
pyautogui.scroll(req.scroll_amount)
elif req.action == "type":
pyautogui.write(req.text, interval=req.interval_ms / 1000.0)
elif req.action == "hotkey":
if len(req.keys) < 1:
raise HTTPException(status_code=400, detail="keys is required for hotkey")
pyautogui.hotkey(*req.keys)
elif req.action == "click_text":
pyautogui.click(
x=resolved_target["x"],
y=resolved_target["y"],
clicks=req.clicks,
interval=req.interval_ms / 1000.0,
button=req.button,
duration=duration_sec,
)
return {
"action": req.action,
"executed": not run_dry,
"dry_run": run_dry,
"screen": screen_selection,
"display": selected_display,
"resolved_target": resolved_target,
"click_text_match": click_text_match,
}
def windows_only(feature: str):
if sys.platform != "win32":
raise HTTPException(status_code=501, detail=f"{feature} is currently supported on Windows hosts only")
def tasklist_process_name(pid: int) -> str | None:
try:
completed = subprocess.run(["tasklist", "/FI", f"PID eq {pid}", "/FO", "CSV", "/NH"], capture_output=True, text=True, timeout=5, check=False)
except Exception:
return None
line = (completed.stdout or "").strip().splitlines()
if not line:
return None
row = line[0].strip()
if not row or row.startswith("INFO:"):
return None
if row.startswith('"') and '","' in row:
return row.split('","', 1)[0].strip('"')
return None
def list_windows(query: WindowQuery | None = None) -> list[dict]:
windows_only("window endpoints")
query = query or WindowQuery()
user32 = ctypes.windll.user32
kernel32 = ctypes.windll.kernel32
psapi = ctypes.windll.psapi
user32.GetWindowTextLengthW.argtypes = [ctypes.c_void_p]
user32.GetWindowTextLengthW.restype = ctypes.c_int
user32.GetWindowTextW.argtypes = [ctypes.c_void_p, ctypes.c_wchar_p, ctypes.c_int]
user32.GetWindowTextW.restype = ctypes.c_int
user32.IsWindowVisible.argtypes = [ctypes.c_void_p]
user32.IsWindowVisible.restype = ctypes.c_bool
user32.IsWindowEnabled.argtypes = [ctypes.c_void_p]
user32.IsWindowEnabled.restype = ctypes.c_bool
user32.IsIconic.argtypes = [ctypes.c_void_p]
user32.IsIconic.restype = ctypes.c_bool
user32.IsZoomed.argtypes = [ctypes.c_void_p]
user32.IsZoomed.restype = ctypes.c_bool
user32.GetForegroundWindow.restype = ctypes.c_void_p
user32.GetWindowRect.argtypes = [ctypes.c_void_p, ctypes.POINTER(ctypes.wintypes.RECT)]
user32.GetWindowRect.restype = ctypes.c_bool
user32.GetClassNameW.argtypes = [ctypes.c_void_p, ctypes.c_wchar_p, ctypes.c_int]
user32.GetClassNameW.restype = ctypes.c_int
kernel32.OpenProcess.argtypes = [ctypes.wintypes.DWORD, ctypes.wintypes.BOOL, ctypes.wintypes.DWORD]
kernel32.OpenProcess.restype = ctypes.wintypes.HANDLE
kernel32.CloseHandle.argtypes = [ctypes.wintypes.HANDLE]
kernel32.CloseHandle.restype = ctypes.wintypes.BOOL
psapi.GetModuleBaseNameW.argtypes = [ctypes.wintypes.HANDLE, ctypes.wintypes.HMODULE, ctypes.c_wchar_p, ctypes.wintypes.DWORD]
psapi.GetModuleBaseNameW.restype = ctypes.wintypes.DWORD
foreground = int(user32.GetForegroundWindow() or 0)
results: list[dict] = []
def callback(hwnd, _lparam):
hwnd_int = int(hwnd)
if query.hwnd and hwnd_int != query.hwnd:
return True
visible = bool(user32.IsWindowVisible(hwnd))
if query.visible_only and not visible:
return True
length = user32.GetWindowTextLengthW(hwnd)
title_buf = ctypes.create_unicode_buffer(max(1, length + 1))
user32.GetWindowTextW(hwnd, title_buf, len(title_buf))
title = title_buf.value or ""
if query.title_contains and query.title_contains.lower() not in title.lower():
return True
if query.title_regex and re.search(query.title_regex, title, flags=re.IGNORECASE) is None:
return True
pid = ctypes.wintypes.DWORD(0)
user32.GetWindowThreadProcessId(hwnd, ctypes.byref(pid))
process_name = tasklist_process_name(pid.value)
if query.process_name and (process_name or "").lower() != query.process_name.lower():
return True
class_buf = ctypes.create_unicode_buffer(256)
user32.GetClassNameW(hwnd, class_buf, len(class_buf))
rect = ctypes.wintypes.RECT()
user32.GetWindowRect(hwnd, ctypes.byref(rect))
results.append(
{
"hwnd": hwnd_int,
"title": title,
"class_name": class_buf.value,
"pid": int(pid.value),
"process_name": process_name,
"visible": visible,
"enabled": bool(user32.IsWindowEnabled(hwnd)),
"minimized": bool(user32.IsIconic(hwnd)),
"maximized": bool(user32.IsZoomed(hwnd)),
"foreground": hwnd_int == foreground,
"rect": {"x": int(rect.left), "y": int(rect.top), "width": int(rect.right - rect.left), "height": int(rect.bottom - rect.top)},
}
)
return True
enum_proc = ctypes.WINFUNCTYPE(ctypes.c_bool, ctypes.c_void_p, ctypes.c_void_p)(callback)
user32.EnumWindows(enum_proc, 0)
results.sort(key=lambda item: (not item["foreground"], item["title"].lower(), item["hwnd"]))
return results
def _pick_single_window(query: WindowQuery) -> dict:
matches = list_windows(query)
if not matches:
raise HTTPException(status_code=404, detail="no window matched")
if len(matches) > 1:
raise HTTPException(status_code=409, detail={"message": "multiple windows matched", "matches": matches[:10]})
return matches[0]
def apply_window_action(req: WindowActionRequest) -> dict:
windows_only("window endpoints")
match = _pick_single_window(req)
hwnd = match["hwnd"]
user32 = ctypes.windll.user32
SW_RESTORE, SW_MINIMIZE, SW_MAXIMIZE = 9, 6, 3
WM_CLOSE = 0x0010
if req.action == "focus":
user32.ShowWindow(hwnd, SW_RESTORE)
ok = bool(user32.SetForegroundWindow(hwnd))
if not ok:
raise HTTPException(status_code=500, detail="failed to focus window")
elif req.action == "restore":
user32.ShowWindow(hwnd, SW_RESTORE)
elif req.action == "minimize":
user32.ShowWindow(hwnd, SW_MINIMIZE)
elif req.action == "maximize":
user32.ShowWindow(hwnd, SW_MAXIMIZE)
elif req.action == "close":
user32.PostMessageW(hwnd, WM_CLOSE, 0, 0)
deadline = time.time() + (req.timeout_ms / 1000.0)
final = None
while time.time() <= deadline:
current = list_windows(WindowQuery(hwnd=hwnd, visible_only=False))
if not current:
if req.action == "close":
return {"matched": match, "closed": True, "final": None}
time.sleep(0.05)
continue
final = current[0]
if req.action == "focus" and final.get("foreground"):
break
if req.action in {"restore", "minimize", "maximize"}:
break
time.sleep(0.05)
return {"matched": match, "closed": False, "final": final}
def launch_app(req: LaunchRequest) -> dict:
if req.cwd and not os.path.isdir(req.cwd):
raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
argv = [req.executable, *req.args]
cwd = req.cwd or None
if req.dry_run or SETTINGS["dry_run"]:
return {"executed": False, "dry_run": True, "argv": argv, "cwd": cwd}
try:
proc = subprocess.Popen(argv, cwd=cwd)
except FileNotFoundError as exc:
raise HTTPException(status_code=400, detail=f"executable not found: {exc}") from exc
except OSError as exc:
raise HTTPException(status_code=400, detail=f"failed to launch process: {exc}") from exc
result = {"executed": True, "dry_run": False, "argv": argv, "cwd": cwd, "pid": proc.pid}
if req.wait_for_window:
query = req.match or WindowQuery(process_name=os.path.basename(req.executable), visible_only=True)
deadline = time.time() + (req.timeout_ms / 1000.0)
match = None
while time.time() <= deadline:
matches = list_windows(query)
if matches:
match = matches[0]
break
time.sleep(0.2)
result["window"] = match
result["window_found"] = match is not None
return result
def _truncate_text(text: str, limit: int) -> tuple[str, bool]:
if len(text) <= limit:
return text, False
return text[:limit], True
def _resolve_exec_program(shell_name: str, command: str) -> list[str]:
if shell_name == "powershell":
return ["powershell", "-NoProfile", "-NonInteractive", "-ExecutionPolicy", "Bypass", "-Command", command]
if shell_name == "bash":
return ["bash", "-lc", command]
if shell_name == "cmd":
return ["cmd", "/c", command]
raise HTTPException(status_code=400, detail="unsupported shell")
def exec_command(req):
if not SETTINGS["exec_enabled"]:
raise HTTPException(status_code=403, detail="exec endpoint disabled")
if not SETTINGS["exec_secret"]:
raise HTTPException(status_code=403, detail="exec secret not configured")
shell_name = (req.shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip()
if shell_name not in {"powershell", "bash", "cmd"}:
raise HTTPException(status_code=400, detail="unsupported shell")
run_dry = SETTINGS["dry_run"] or req.dry_run
timeout_s = req.timeout_s if req.timeout_s is not None else SETTINGS["exec_default_timeout_s"]
timeout_s = min(timeout_s, SETTINGS["exec_max_timeout_s"])
cwd = None
if req.cwd:
cwd = os.path.abspath(req.cwd)
if not os.path.isdir(cwd):
raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
argv = _resolve_exec_program(shell_name, req.command)
if run_dry:
return {"executed": False, "dry_run": True, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd}
start = time.time()
try:
completed = subprocess.run(argv, cwd=cwd, capture_output=True, text=True, timeout=timeout_s, check=False)
except subprocess.TimeoutExpired as exc:
stdout, stdout_truncated = _truncate_text(str(exc.stdout or ""), SETTINGS["exec_max_output_chars"])
stderr, stderr_truncated = _truncate_text(str(exc.stderr or ""), SETTINGS["exec_max_output_chars"])
return {"executed": True, "timed_out": True, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd, "duration_ms": int((time.time() - start) * 1000), "exit_code": None, "stdout": stdout, "stderr": stderr, "stdout_truncated": stdout_truncated, "stderr_truncated": stderr_truncated}
except FileNotFoundError as exc:
raise HTTPException(status_code=400, detail=f"shell executable not found: {exc}") from exc
stdout, stdout_truncated = _truncate_text(completed.stdout or "", SETTINGS["exec_max_output_chars"])
stderr, stderr_truncated = _truncate_text(completed.stderr or "", SETTINGS["exec_max_output_chars"])
return {"executed": True, "timed_out": False, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd, "duration_ms": int((time.time() - start) * 1000), "exit_code": completed.returncode, "stdout": stdout, "stderr": stderr, "stdout_truncated": stdout_truncated, "stderr_truncated": stderr_truncated}

View File

@@ -1,381 +1,64 @@
--- ---
name: clickthrough-http-control name: clickthrough-http-control
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification. description: Use 3 methods to control a computer: see (screenshot+grid), interact (mouse/keyboard), and exec (shell).
--- ---
# Clickthrough HTTP Control # Clickthrough Computer Control
Use a strict observe-decide-act-verify loop. Use these methods:
- `see`
- `interact`
- `exec`
## Getting a computer instance (user-owned setup) ## Method 1: See
The **user/operator** is responsible for provisioning and exposing the target machine. Use `POST /see` to capture full screen or a region with a grid overlay.
The agent should not assume it can self-install this stack. Use `POST /see/zoom` to capture a tighter crop with a denser grid.
Use `POST /see` with `ocr=true` when text localization is needed.
### What the user must do
1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
3. Configure secrets on target machine:
- `CLICKTHROUGH_TOKEN` for general API auth
- `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
4. Share connection details with the agent through a secure channel:
- `base_url`
- `x-clickthrough-token`
- `x-clickthrough-exec-secret` (only when `/exec` is needed)
### What the agent should do
1. Validate connection with `GET /health` using provided headers.
2. Refuse `/exec` attempts when exec secret is missing/invalid.
3. Ask user for missing setup inputs instead of guessing infrastructure.
## What the agent can actually see
The agent does **not** inherently see the remote desktop.
Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision.
That means:
- `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly
- `POST /ocr` returns machine-readable text blocks when text extraction is enough
- the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues
- every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected
Do not write or think as if the agent is directly watching the screen in real time.
Say what you actually have: screenshots, OCR output, and fresh verification captures.
## Mini API map
- `GET /health` → server status + safety flags
- `GET /displays` → detected displays in zero-based API order
- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
- `GET /windows` → discover visible desktop windows and their handles/processes
- `POST /windows/action` → focus/restore/minimize/maximize/close a matched window
- `POST /launch` → start an app/process without dropping to a shell
- `POST /wait?screen=0` → wait for text, window, or visual state changes
- `POST /vision/diff?screen=0` → compare screenshots or regions for meaningful visual change
- `POST /vision/stability?screen=0` → measure short-interval visual stability
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
- `POST /ocr/find?screen=0` → search OCR output for matching text candidates
- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
- `POST /action/verify?screen=0` → execute one action plus structured success verification
- `POST /batch?screen=0` → sequential action list
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
### Display selection
- Use `GET /displays` before operating on multi-monitor systems.
- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
- Window rectangles from `GET /windows` are also in global desktop coordinates. Use them to sanity-check which monitor the app is really on before clicking.
### OCR usage
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
- Use `language_hint` when known (for example `eng`) to improve consistency.
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
- Treat OCR as one signal, not the only signal, before high-impact clicks.
- Current response shape is nested under `result.blocks`, not top-level `blocks`. Parse the real payload before assuming the endpoint failed.
- OCR can be noisy on dense shopping pages, streaming apps, and button-heavy sidebars. Re-crop tightly before escalating.
### Screenshot + `image` tool usage
Use the OpenClaw `image` tool when OCR is not enough.
This is especially useful for:
- identifying which visible button looks like the primary confirm action
- understanding dialog layout or pane structure
- distinguishing similar nearby controls by icon, spacing, or emphasis
- checking whether a visual state changed after a click
- telling you where something is and where to click when text alone is not reliable
Good pattern:
1. capture with `GET /screen` or `POST /zoom`
2. hand that screenshot to the `image` tool
3. ask a precise question about the visible UI
4. when click targeting matters, ask the model to describe **where the target is** or provide an approximate click point inside the crop
5. convert the answer into a concrete Clickthrough target
6. act once
7. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
Prefer vision over guessing.
If OCR is fragmented, partial, or ambiguous, stop inferring and ask the vision model where the control is.
The model should help answer things like:
- which visible button is the real primary action
- whether the target is left/right/top/bottom within the crop
- which of several similar buttons is the one to click
- an approximate click point inside the provided image bounds
Ask narrow questions.
Good:
- "Which button in this dialog is the primary confirmation action?"
- "Is the scan still running, or does this look complete?"
- "Which of these tabs appears selected?"
- "Where is the orange Buy Now button in this 620x890 crop? Return one x,y coordinate inside the image bounds."
- "Which visible control says Stop Recording, and where should I click?"
Bad:
- "What should I click?"
- "Use your eyes and do the task"
- anything that assumes the model has live continuity without a new screenshot
- requesting coordinates without telling the model the image bounds or expected output format
### Header requirements
- Always send `x-clickthrough-token` when token auth is enabled.
- For `/exec`, also send `x-clickthrough-exec-secret`.
## `POST /action` request shape (important)
`/action` always expects an `action` plus an optional `target` object.
Do **not** invent top-level `x` / `y` fields.
Minimal pixel click:
```json
{
"action": "click",
"target": {"mode": "pixel", "x": 100, "y": 200},
"button": "left",
"clicks": 1
}
```
Minimal grid click:
```json
{
"action": "click",
"target": {
"mode": "grid",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"rows": 12,
"cols": 12,
"row": 6,
"col": 8,
"dx": 0.0,
"dy": 0.0
}
}
```
Other canonical examples:
```json
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
{"action": "type", "text": "hello world", "interval_ms": 20}
{"action": "hotkey", "keys": ["ctrl", "l"]}
```
Rules: Rules:
- `dx` / `dy` belong inside `target`, not beside it. - Start with coarse grid (`12x12`).
- `type` and `hotkey` usually do not need a `target`. - For precision, zoom and use denser grid (`20x20` or higher).
- For pixel targets, `x` / `y` are global desktop coordinates. - Always use returned `meta.region` and `meta.grid` when computing click targets.
- For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used. - Coordinates are global desktop coordinates.
- OCR results are in `data.meta.ocr` and include confidence, bbox, and center.
## When to use `/exec` ## Method 2: Interact
Prefer structured GUI control first: Use `POST /interact` for one action at a time.
- `/screen`, `/zoom`, `/ocr` to observe
- `/action` or `/batch` to interact
Use `/exec` only when it is the cleanest available tool for the job, for example: Mouse actions:
- querying machine state that the GUI does not expose well - `move`, `click`, `right_click`, `double_click`, `middle_click`, `scroll`
- performing an explicit user-requested shell/system task - `click_text` (OCR-driven click; optionally scope with `click_text.region`)
- recovering from a blocked GUI flow when normal interaction failed
Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`. Keyboard actions:
Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly. - `type`, `hotkey`
When a task can be completed with window focus/restore, keyboard shortcuts, screenshots, OCR, and normal actions, stay out of `/exec` entirely.
## Core workflow (mandatory) Rules:
- Prefer `grid` targets derived from fresh `see`/`see/zoom` captures.
- For text buttons/labels, prefer `click_text` and bound OCR with a region when possible.
- Use `pixel` only when you already have reliable coordinates.
- After each important action, call `see` again before continuing.
1. Call `GET /windows` first when the task mentions a known app; focus/restore the right window before screen hunting. ## Method 3: Exec
2. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
3. Identify likely target region and compute an initial confidence score.
4. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
5. **Before any click**, verify target identity (OCR text/icon/location consistency).
6. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
7. Execute one minimal action via `POST /action`.
8. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
9. Repeat until objective is complete.
## Verify-before-click rules Use `POST /exec` only for shell/system tasks.
- Never click if target identity is ambiguous. Rules:
- Require at least two matching signals before click. - Requires `x-clickthrough-exec-secret`.
- Good signal pairs include: - Do not use exec for normal clicking/typing flows.
- OCR text + expected UI region - Prefer GUI interaction first; exec is fallback or explicit shell task.
- OCR text + matching button shape/icon nearby
- dialog title text + expected button position within that dialog
- known app/window focus + expected control location
- OCR candidate + vision-model localization inside the same crop
- If confidence is low, do not "test click"; zoom and re-localize first.
- If OCR and layout disagree, trust neither blindly; recrop and ask vision a narrower localization question.
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
1) preview intended coordinate + reason
2) execute only after explicit confirmation.
## Precision rules ## Lightweight Procedure
- Prefer grid targets first, then use `dx/dy` for subcell precision. 1. `see` capture.
- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed. 2. If needed, `see/zoom` refine.
- Use zoom before guessing offsets. 3. `interact` one step (`click_text` for text UI targets).
- Avoid stale coordinates: re-capture before action if UI moved/scrolled. 4. `see` verify.
5. Repeat.
## Safety rules ## Quick Safety Rules
- Respect `dry_run` and `allowed_region` restrictions from `/health`. - Never click with stale screenshots.
- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`). - Never send multiple uncertain clicks in a row.
- Avoid destructive shortcuts unless explicitly requested. - If localization is ambiguous, re-capture with a tighter zoom.
- Send one action at a time unless deterministic; then use `/batch`.
## Reliability rules
- After every meaningful action, verify with a fresh screenshot.
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
- Prefer short, reversible actions over long macros.
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
## Fallback ladder for uncertain targeting
1. Full-screen capture with a coarse grid.
2. Zoom into the candidate area with a denser grid.
3. OCR the full screen or the tighter region.
4. Re-anchor on a more reliable nearby control, title, or label.
5. Try a keyboard-first flow if the app supports it.
6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner.
Do not skip from "uncertain click" straight to random retries.
## Concrete screenshot -> `image` -> action example
Example loop:
1. `GET /screen?screen=0` to capture the current app state
2. if the UI is text-heavy, try `POST /ocr` first
3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like:
- "In this save dialog, which visible button is the primary action?"
- "Is there a dismiss/close button in the top-right of this modal?"
4. map the answer back to a Clickthrough target using the returned grid/region metadata
5. click once with `POST /action`
6. recapture the screen
7. optionally use `POST /wait` or another `image`/OCR check to confirm the result
The key rule is simple: screenshot first, interpret second, click third, verify fourth.
Do not collapse those steps into fake certainty.
When in doubt about location, use vision to localize the target instead of inventing coordinates from vibes.
## App-specific playbooks (recommended)
Build per-app routines for repetitive tasks instead of generic clicking.
### Launcher / search / start app playbook
Use this when the goal is "open app X" or "bring up tool Y".
1. check `GET /windows` first in case the app is already open
2. if present, use `POST /windows/action` to focus or restore it
3. if absent, prefer `POST /launch` when you know the executable path
4. if launch path is unknown but the OS launcher/search UI is available, use a keyboard-first flow:
- open launcher (`win`, `cmd+space`, or app-specific shortcut depending on host)
- type exact app name
- wait for stable results with `POST /wait` or recapture
- verify the result text with OCR or the `image` tool
- press Enter or click the exact result once
5. verify the app window now exists or is focused
Do not keep relaunching if the window already exists; thats sloppy.
### Dialog confirmation playbook
Use for modals like save/discard, delete confirmation, permission prompts, and installer dialogs.
1. capture the dialog region with `POST /zoom`
2. use OCR first for title/body/button labels
3. if button hierarchy or emphasis matters, inspect the zoomed screenshot with the `image` tool
4. identify the exact intended action (`Cancel`, `Save`, `Allow`, `Delete`, etc.)
5. for destructive actions, require explicit user confirmation unless already requested
6. click once and verify the dialog disappeared or changed state
Good verification targets:
- dialog title vanished
- expected next window appeared
- destructive side effect is visible and confirmed
### File picker playbook
Use for open/save dialogs.
1. verify the file picker window is focused
2. OCR the visible breadcrumb/path area, filename field, and button row
3. prefer keyboard-first entry when possible:
- type or paste the target path/name into the focused field
- use `tab` / `shift+tab` to move predictably between filename and action buttons
4. if the target path is uncertain, use OCR plus the `image` tool to identify the active field and selected folder/file row
5. verify the intended filename/path is visible before confirming
6. activate `Open` / `Save` once and verify the picker closes
If the picker stays open, stop and inspect why instead of hammering Enter like a maniac.
### Browser tab / window playbook
Use for browser navigation, tab targeting, or web app recovery.
1. use `GET /windows` to focus the correct browser window first
2. prefer keyboard-first navigation:
- `ctrl+l` / `cmd+l` to focus the address bar
- `ctrl+tab` / `ctrl+shift+tab` for tab movement when order is known
- `ctrl+w` only for explicitly requested close actions
3. verify tab or page identity with OCR on the tab strip or page heading
4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs
5. after navigation, wait for visual stability or expected text before taking the next action
6. on shopping/checkouts, tighten crops around the buy box or checkout panel before reading button text; full-page OCR often misses the one thing that matters
Do not assume a page loaded just because the click landed. Verify it.
### Settings / preferences navigation playbook
Use when the task involves toggles, dropdowns, sidebars, or nested settings panels.
1. identify the current settings page with OCR on the heading/sidebar
2. use OCR to find the specific section label before trying to toggle anything
3. if the layout is dense, zoom into the relevant pane and use the `image` tool to distinguish labels from controls
4. prefer small reversible actions: one toggle, one dropdown, one field edit at a time
5. after each change, verify the control state changed visually or via visible text
6. if a save/apply button exists, treat it as a separate confirmation step and verify completion
Settings UIs love hiding side effects. Assume nothing.
### Dense app / control-strip playbook
Use for apps like OBS, IDEs, mixers, dashboards, or anything with tiny bottom-right control clusters.
1. focus the exact app window with `POST /windows/action`
2. capture the full target display once to confirm the window is actually frontmost
3. crop tightly around the suspected control strip with `POST /zoom`
4. run OCR on the crop, not the full screen
5. if labels are still ambiguous, ask the `image` tool a narrow question about the specific buttons
6. click once and immediately verify the control label changed (`Start Recording` -> `Stop Recording`, etc.)
Do not trust OCR taken from the wrong frontmost window. It will happily waste your time.
### Spotify playbook
- Focus app window before search/navigation.
- Prefer keyboard-first flow for song start:
1) `Ctrl+L` (search)
2) type exact query
3) Enter
4) verify exact song+artist text
5) click/double-click row
6) verify now-playing bar
- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.

View File

@@ -0,0 +1,114 @@
import sys
from PIL import Image
from fastapi.testclient import TestClient
from server import services
from server.app import app
from server.config import SETTINGS
from server.models import ClickTextAction
def _auth_headers() -> dict:
token = SETTINGS.get("token", "")
if not token:
return {}
return {"x-clickthrough-token": token}
def test_extract_ocr_items_normalization(monkeypatch):
class FakeOutput:
DICT = "DICT"
class FakeTesseract:
Output = FakeOutput
@staticmethod
def image_to_data(_image, lang, config, output_type):
assert lang == "eng"
assert output_type == "DICT"
return {
"text": ["hello", " ", "world"],
"conf": ["95.0", "-1", "62.5"],
"left": [10, 12, 40],
"top": [20, 25, 60],
"width": [30, 10, 50],
"height": [10, 10, 12],
}
monkeypatch.setitem(sys.modules, "pytesseract", FakeTesseract)
items = services.extract_ocr_items(Image.new("RGB", (100, 100)), origin_x=100, origin_y=200, min_confidence=60, lang="eng", psm=None)
assert len(items) == 2
assert items[0]["text"] == "hello"
assert items[0]["bbox"]["x"] == 110
assert items[0]["center"]["y"] == 225
assert items[1]["text"] == "world"
def test_resolve_text_match_contains_exact_regex_and_nth():
items = [
{"text": "Save", "confidence": 70},
{"text": "Save as", "confidence": 96},
{"text": "SAVE", "confidence": 88},
]
contains = services._resolve_text_match(ClickTextAction(text="save", match="contains", occurrence="first"), items)
assert contains["text"] == "Save"
best = services._resolve_text_match(ClickTextAction(text="save", match="contains", occurrence="best"), items)
assert best["text"] == "Save as"
exact_case = services._resolve_text_match(
ClickTextAction(text="SAVE", match="exact", case_sensitive=True, occurrence="first"),
items,
)
assert exact_case["text"] == "SAVE"
regex_nth = services._resolve_text_match(ClickTextAction(text="^Save", match="regex", occurrence="nth", nth=2), items)
assert regex_nth["text"] == "Save as"
def test_interact_click_text_region_optional(monkeypatch):
monkeypatch.setattr(services, "select_display", lambda screen: ({"screen": screen}, [], {"requested": screen, "selected": screen, "fallback": False}))
monkeypatch.setattr(
services,
"capture_region_image",
lambda screen, x, y, w, h: (Image.new("RGB", (20, 20)), {"x": x or 0, "y": y or 0, "width": w or 20, "height": h or 20}, {}, [], {}),
)
monkeypatch.setattr(
services,
"extract_ocr_items",
lambda *args, **kwargs: [
{
"text": "Apply",
"confidence": 93.0,
"bbox": {"x": 10, "y": 20, "width": 20, "height": 10},
"center": {"x": 20, "y": 25},
"region_relative_bbox": {"x": 10, "y": 20, "width": 20, "height": 10},
}
],
)
client = TestClient(app)
response = client.post(
"/interact",
json={"screen": 0, "action": {"action": "click_text", "dry_run": True, "click_text": {"text": "Apply", "match": "contains"}}},
headers=_auth_headers(),
)
assert response.status_code == 200
body = response.json()["data"]
assert body["resolved_target"]["x"] == 20
assert body["click_text_match"]["matched"]["text"] == "Apply"
def test_see_ocr_off_on_contract(monkeypatch):
monkeypatch.setattr(
"server.app.capture_region_image",
lambda *args, **kwargs: (Image.new("RGB", (10, 10)), {"x": 0, "y": 0, "width": 10, "height": 10}, {"screen": 0}, [], {}),
)
monkeypatch.setattr("server.app.encode_image", lambda *args, **kwargs: "abc")
monkeypatch.setattr("server.app.extract_ocr_items", lambda *args, **kwargs: [{"text": "x"}])
client = TestClient(app)
off = client.post("/see", json={"ocr": False, "with_grid": False}, headers=_auth_headers())
assert off.status_code == 200
assert "ocr" not in off.json()["data"]["meta"]
on = client.post("/see", json={"ocr": True, "with_grid": False}, headers=_auth_headers())
assert on.status_code == 200
assert on.json()["data"]["meta"]["ocr"][0]["text"] == "x"