Compare commits
9 Commits
c66779d929
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
22ca0097d1 | ||
| f05e0c56e6 | |||
| 211f38003e | |||
| c1fc97e198 | |||
| 9e816e0417 | |||
| 1c03cab457 | |||
| aced5be25e | |||
| 2585bc3a7c | |||
| 66615c8a81 |
48
AGENTS.md
Normal file
48
AGENTS.md
Normal file
@@ -0,0 +1,48 @@
|
|||||||
|
# AGENTS
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
This file defines how agents and contributors should work in this repository.
|
||||||
|
It is a codebase playbook, not a product roadmap.
|
||||||
|
|
||||||
|
## Repository Map
|
||||||
|
- `server/app.py`: FastAPI routes, auth checks, response envelope, exception handling.
|
||||||
|
- `server/services.py`: screenshot/OCR/input/window/exec behavior and safety enforcement.
|
||||||
|
- `server/models.py`: request schemas and validation rules.
|
||||||
|
- `server/config.py`: environment loading and runtime settings.
|
||||||
|
- `tests/`: unit and API contract tests (monkeypatched where host behavior is nondeterministic).
|
||||||
|
- `docs/API.md`: public API request/response reference.
|
||||||
|
- `skill/SKILL.md`: operational method for agent usage of the API.
|
||||||
|
|
||||||
|
## Local Workflow
|
||||||
|
- Install dependencies: `pip install -r requirements.txt`
|
||||||
|
- Run server: `python -m server.app`
|
||||||
|
- Run tests: `pytest -q`
|
||||||
|
- Basic health check: `GET /health` with `x-clickthrough-token` when token auth is enabled.
|
||||||
|
|
||||||
|
## Non-Negotiable Contracts
|
||||||
|
- Keep the response envelope shape stable:
|
||||||
|
- `ok`, `request_id`, `time_ms`, `data`, `error`
|
||||||
|
- Preserve one-action-per-request semantics in `/interact`.
|
||||||
|
- Keep coordinate behavior in global desktop coordinates unless an explicit versioned change is introduced.
|
||||||
|
- All new request fields must be represented in `server/models.py` with explicit validation constraints.
|
||||||
|
|
||||||
|
## Safety and Security Rules
|
||||||
|
- Do not weaken `x-clickthrough-token` validation.
|
||||||
|
- Do not weaken `/exec` secret validation (`x-clickthrough-exec-secret`).
|
||||||
|
- Preserve `dry_run` behavior for non-destructive execution paths.
|
||||||
|
- Preserve allowed-region enforcement for pointer-target actions.
|
||||||
|
- Keep `/exec` constraints explicit:
|
||||||
|
- shell allowlist
|
||||||
|
- timeout limits
|
||||||
|
- output truncation limits
|
||||||
|
|
||||||
|
## Testing Expectations
|
||||||
|
- Add tests for each new behavior in services/models/routes.
|
||||||
|
- Cover success and failure paths, especially validation and ambiguity branches.
|
||||||
|
- Prefer deterministic tests with monkeypatches over host/UI-dependent flakiness.
|
||||||
|
- If API behavior changes, update tests in the same change.
|
||||||
|
|
||||||
|
## Documentation Policy
|
||||||
|
- When API behavior changes, update `docs/API.md` in the same change.
|
||||||
|
- Keep examples aligned with current schema in `docs/API.md` and `examples/quickstart.py`.
|
||||||
|
- Keep `skill/SKILL.md` aligned with current safe usage flow.
|
||||||
95
README.md
95
README.md
@@ -1,82 +1,37 @@
|
|||||||
# Clickthrough
|
# Clickthrough
|
||||||
|
|
||||||
Let an Agent interact with your computer over HTTP, with grid-aware screenshots and precise input actions.
|
Clickthrough is a lightweight HTTP control layer that lets an AI safely operate a real computer by repeatedly capturing structured screenshots with coordinate-aware grids (`see`), executing precise mouse/keyboard actions from those coordinates (`interact`), and optionally running authenticated shell commands for system-level tasks (`exec`) under a consistent response contract.
|
||||||
|
|
||||||
## What this provides
|
## Core Methods
|
||||||
|
|
||||||
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
|
- `POST /see`: Capture a full screen or region, optionally with a click-ready grid overlay.
|
||||||
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
|
- `POST /see/zoom`: Capture a tighter crop around a point and draw a denser grid for precise targeting.
|
||||||
- **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ...
|
- `POST /interact`: Perform one mouse or keyboard action (`click`, `scroll`, `type`, `hotkey`, etc.).
|
||||||
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
|
- `POST /exec`: Run PowerShell/Bash/CMD commands when shell-level control is needed.
|
||||||
- **Window lifecycle endpoints**: list/focus/restore/minimize/maximize/close windows via `GET /windows` + `POST /windows/action`
|
|
||||||
- **Structured launch endpoint**: start an app/process without dropping to a shell via `POST /launch`
|
|
||||||
- **Wait/sync endpoint**: poll for text, window, or visual state changes via `POST /wait`
|
|
||||||
- **Vision helper endpoints**: compare screenshots and measure stability via `POST /vision/diff` and `POST /vision/stability`
|
|
||||||
- **OCR endpoints**: extract text blocks or search for matching text via `POST /ocr` and `POST /ocr/find`
|
|
||||||
- **Compound verify endpoint**: execute an action and wait for a structured success condition via `POST /action/verify`
|
|
||||||
- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
|
|
||||||
- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
|
|
||||||
- **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
|
|
||||||
|
|
||||||
## Quick start
|
## Why this works for AI agents
|
||||||
|
|
||||||
```bash
|
- Agents do not need live vision; they iterate on snapshots.
|
||||||
cd /root/external-projects/clickthrough
|
- Grid metadata bridges image understanding to deterministic click coordinates.
|
||||||
python3 -m venv .venv
|
- Interaction stays explicit and auditable (one action per request).
|
||||||
. .venv/bin/activate
|
- A unified response envelope (`ok`, `data`, `error`) reduces agent-side branching.
|
||||||
pip install -r requirements.txt
|
|
||||||
CLICKTHROUGH_TOKEN=change-me python -m server.app
|
|
||||||
```
|
|
||||||
|
|
||||||
Server defaults to `127.0.0.1:8123`.
|
## Minimal Agent Loop
|
||||||
|
|
||||||
For OCR support, install the native `tesseract` binary on the host (in addition to Python deps), or point `CLICKTHROUGH_TESSERACT_CMD` at the executable if it lives somewhere weird.
|
1. Call `see` with a coarse grid.
|
||||||
|
2. If uncertain, call `see/zoom` with a denser grid.
|
||||||
|
3. Call `interact` once.
|
||||||
|
4. Call `see` again to verify state change.
|
||||||
|
5. Use `exec` only for explicit shell/system tasks.
|
||||||
|
|
||||||
`python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically.
|
## Safety and Auth
|
||||||
|
|
||||||
## Minimal API flow
|
- `x-clickthrough-token` protects API access when enabled.
|
||||||
|
- `x-clickthrough-exec-secret` is required for `/exec`.
|
||||||
|
- Optional dry-run and allowed-region constraints reduce accidental risk.
|
||||||
|
|
||||||
1. `GET /displays` if you need a non-primary monitor
|
## Docs
|
||||||
2. `GET /screen?screen=0` with grid
|
|
||||||
3. Decide cell / target
|
|
||||||
4. Optional `POST /zoom?screen=0` for finer targeting
|
|
||||||
5. `POST /action?screen=0` to execute (or `POST /action/verify?screen=0` for a bundled action+wait flow)
|
|
||||||
6. `GET /screen?screen=0` again to verify result, or use `POST /wait`, `POST /vision/diff`, or `POST /ocr/find`
|
|
||||||
|
|
||||||
Important:
|
- API: `docs/API.md`
|
||||||
- `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields.
|
- Agent procedure: `skill/SKILL.md`
|
||||||
- Pixel coordinates and OCR bounding boxes are always global desktop coordinates.
|
- Coordinate system details: `docs/coordinate-system.md`
|
||||||
- The agent does **not** inherently see the remote desktop; it reasons from screenshots, OCR, and window metadata.
|
|
||||||
- When OCR is not enough, pair Clickthrough screenshots with OpenClaw's `image` tool for explicit screenshot interpretation.
|
|
||||||
- Prefer structured GUI interaction first; use `/windows`, `/launch`, `/wait`, and `/action` before reaching for `/exec`.
|
|
||||||
|
|
||||||
See:
|
|
||||||
- `docs/API.md`
|
|
||||||
- `docs/coordinate-system.md`
|
|
||||||
- `skill/SKILL.md`
|
|
||||||
|
|
||||||
## Configuration
|
|
||||||
|
|
||||||
Environment variables:
|
|
||||||
|
|
||||||
- `CLICKTHROUGH_HOST` (default `127.0.0.1`)
|
|
||||||
- `CLICKTHROUGH_PORT` (default `8123`)
|
|
||||||
- `CLICKTHROUGH_TOKEN` (optional; if set, require `x-clickthrough-token` header)
|
|
||||||
- `CLICKTHROUGH_DRY_RUN` (`true`/`false`; default `false`)
|
|
||||||
- `CLICKTHROUGH_GRID_ROWS` (default `12`)
|
|
||||||
- `CLICKTHROUGH_GRID_COLS` (default `12`)
|
|
||||||
- `CLICKTHROUGH_ALLOWED_REGION` (optional `x,y,width,height`)
|
|
||||||
- `CLICKTHROUGH_EXEC_ENABLED` (default `true`)
|
|
||||||
- `CLICKTHROUGH_EXEC_SECRET` (**required for `/exec` to run**)
|
|
||||||
- `CLICKTHROUGH_EXEC_DEFAULT_SHELL` (default `powershell`; one of `powershell`, `bash`, `cmd`)
|
|
||||||
- `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`)
|
|
||||||
- `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`)
|
|
||||||
- `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`)
|
|
||||||
- `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable)
|
|
||||||
|
|
||||||
Window management endpoints currently target Windows hosts. On non-Windows hosts they return `501` instead of guessing.
|
|
||||||
|
|
||||||
## Gitea CI
|
|
||||||
|
|
||||||
A Gitea Actions workflow is included at `.gitea/workflows/python-syntax.yml`.
|
|
||||||
It runs Python syntax checks (`py_compile`) on every push and pull request.
|
|
||||||
|
|||||||
7
TODO.md
7
TODO.md
@@ -26,3 +26,10 @@
|
|||||||
- [x] Upgrade skill with verify-before-click rules, confidence thresholds, two-phase risky actions, and Spotify playbook
|
- [x] Upgrade skill with verify-before-click rules, confidence thresholds, two-phase risky actions, and Spotify playbook
|
||||||
- [x] Add top-level skill section for instance setup + mini API docs
|
- [x] Add top-level skill section for instance setup + mini API docs
|
||||||
- [x] Clarify user-owned setup responsibilities vs agent responsibilities in skill docs
|
- [x] Clarify user-owned setup responsibilities vs agent responsibilities in skill docs
|
||||||
|
|
||||||
|
## Deferred Backlog (Prioritized)
|
||||||
|
1. [ ] Higher-level automation macros composed from `see` + `interact`
|
||||||
|
2. [ ] Reusable workflow templates (for example: find text -> zoom fallback -> click -> verify)
|
||||||
|
3. [ ] Batch-safe orchestration primitives with explicit per-step results and auditability
|
||||||
|
4. [ ] Additional verify primitives for post-action validation (image diff region, window title/process state, color/pixel checks)
|
||||||
|
5. [ ] Broader API simplification pass to reduce payload overlap and consolidate shared OCR options
|
||||||
|
|||||||
651
docs/API.md
651
docs/API.md
@@ -1,62 +1,96 @@
|
|||||||
# API Reference (v0.1)
|
# API Reference
|
||||||
|
|
||||||
Base URL: `http://127.0.0.1:8123`
|
Base URL: `http://127.0.0.1:8123`
|
||||||
|
|
||||||
If `CLICKTHROUGH_TOKEN` is set, include header:
|
Auth header when enabled:
|
||||||
|
|
||||||
```http
|
```http
|
||||||
x-clickthrough-token: <token>
|
x-clickthrough-token: <token>
|
||||||
```
|
```
|
||||||
|
|
||||||
## `GET /health`
|
This API is intended for AI computer control through these methods:
|
||||||
|
- `see`
|
||||||
|
- `interact`
|
||||||
|
- `exec`
|
||||||
|
|
||||||
Returns status and runtime safety flags, including `exec` capability config.
|
All responses use one envelope.
|
||||||
|
|
||||||
## `GET /displays`
|
## Response Envelope
|
||||||
|
|
||||||
Returns detected displays in API screen order.
|
Success:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"ok": true,
|
"ok": true,
|
||||||
"default_screen": 0,
|
"request_id": "...",
|
||||||
"displays": [
|
"time_ms": 1710000000000,
|
||||||
{"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
|
"data": {},
|
||||||
{"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
|
"error": null
|
||||||
]
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
`screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend.
|
Error:
|
||||||
Invalid `screen` values fall back to `0`.
|
|
||||||
|
|
||||||
## `GET /screen`
|
|
||||||
|
|
||||||
Query params:
|
|
||||||
|
|
||||||
- `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0`
|
|
||||||
- `with_grid` (bool, default `true`)
|
|
||||||
- `grid_rows` (int, default env or `12`)
|
|
||||||
- `grid_cols` (int, default env or `12`)
|
|
||||||
- `include_labels` (bool, default `true`)
|
|
||||||
- `image_format` (`png`|`jpeg`, default `png`)
|
|
||||||
- `jpeg_quality` (1-100, default `85`)
|
|
||||||
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
|
|
||||||
|
|
||||||
Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
|
|
||||||
`meta.region` uses global desktop coordinates.
|
|
||||||
|
|
||||||
These image-returning endpoints do not magically grant the agent live vision.
|
|
||||||
If the caller needs visual interpretation beyond OCR, pass the returned screenshot to OpenClaw's `image` tool and ask a narrow question about the visible UI.
|
|
||||||
|
|
||||||
## `POST /zoom`
|
|
||||||
|
|
||||||
Body:
|
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
"ok": false,
|
||||||
|
"request_id": "...",
|
||||||
|
"time_ms": 1710000000000,
|
||||||
|
"data": null,
|
||||||
|
"error": {
|
||||||
|
"code": "validation_error",
|
||||||
|
"message": "request validation failed",
|
||||||
|
"details": []
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## 1) See
|
||||||
|
|
||||||
|
### `POST /see`
|
||||||
|
Capture a full screen or a region. Optional grid overlay returns coordinate metadata for click mapping.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"screen": 0,
|
||||||
|
"region_x": null,
|
||||||
|
"region_y": null,
|
||||||
|
"region_width": null,
|
||||||
|
"region_height": null,
|
||||||
|
"with_grid": true,
|
||||||
|
"grid_rows": 12,
|
||||||
|
"grid_cols": 12,
|
||||||
|
"include_labels": true,
|
||||||
|
"image_format": "png",
|
||||||
|
"jpeg_quality": 85,
|
||||||
|
"ocr": false,
|
||||||
|
"ocr_min_confidence": 0,
|
||||||
|
"ocr_lang": "eng",
|
||||||
|
"ocr_psm": null
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
- `data.image.base64`
|
||||||
|
- `data.meta.region` (global desktop coords)
|
||||||
|
- `data.meta.grid` (rows/cols/cell size + formula)
|
||||||
|
- `data.meta.ocr` (when `ocr=true`)
|
||||||
|
|
||||||
|
OCR item shape:
|
||||||
|
- `text`
|
||||||
|
- `confidence`
|
||||||
|
- `bbox` (global coords)
|
||||||
|
- `center`
|
||||||
|
- `region_relative_bbox`
|
||||||
|
|
||||||
|
### `POST /see/zoom`
|
||||||
|
Capture a tighter crop around a global point and draw another grid over that crop.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"screen": 0,
|
||||||
"center_x": 1200,
|
"center_x": 1200,
|
||||||
"center_y": 700,
|
"center_y": 720,
|
||||||
"width": 500,
|
"width": 500,
|
||||||
"height": 350,
|
"height": 350,
|
||||||
"with_grid": true,
|
"with_grid": true,
|
||||||
@@ -68,70 +102,17 @@ Body:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Query params:
|
Use this for precision before clicking tiny controls.
|
||||||
|
|
||||||
- `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0`
|
## 2) Interact
|
||||||
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
|
|
||||||
|
|
||||||
Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.
|
### `POST /interact`
|
||||||
|
Mouse/keyboard action execution.
|
||||||
`POST /zoom` is often the best screenshot to hand to the `image` tool when the agent needs help judging a specific button, icon, or dialog layout.
|
|
||||||
|
|
||||||
## `POST /action`
|
|
||||||
|
|
||||||
Body: one action.
|
|
||||||
|
|
||||||
Important:
|
|
||||||
- the request body uses `action` plus an optional `target`
|
|
||||||
- pixel coordinates live inside `target` when `target.mode="pixel"`
|
|
||||||
- do **not** send top-level `x` / `y` fields
|
|
||||||
|
|
||||||
Query params:
|
|
||||||
|
|
||||||
- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`
|
|
||||||
|
|
||||||
Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target.
|
|
||||||
|
|
||||||
### Pointer target modes
|
|
||||||
|
|
||||||
#### Pixel target
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"mode": "pixel",
|
|
||||||
"x": 100,
|
|
||||||
"y": 200,
|
|
||||||
"dx": 0,
|
|
||||||
"dy": 0
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Grid target
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"mode": "grid",
|
|
||||||
"region_x": 0,
|
|
||||||
"region_y": 0,
|
|
||||||
"region_width": 1920,
|
|
||||||
"region_height": 1080,
|
|
||||||
"rows": 12,
|
|
||||||
"cols": 12,
|
|
||||||
"row": 5,
|
|
||||||
"col": 9,
|
|
||||||
"dx": 0.0,
|
|
||||||
"dy": 0.0
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
`dx`/`dy` are normalized offsets in `[-1, 1]` inside the selected cell.
|
|
||||||
|
|
||||||
### Action examples
|
|
||||||
|
|
||||||
Click:
|
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
"screen": 0,
|
||||||
|
"action": {
|
||||||
"action": "click",
|
"action": "click",
|
||||||
"target": {
|
"target": {
|
||||||
"mode": "grid",
|
"mode": "grid",
|
||||||
@@ -143,438 +124,63 @@ Click:
|
|||||||
"cols": 12,
|
"cols": 12,
|
||||||
"row": 7,
|
"row": 7,
|
||||||
"col": 3,
|
"col": 3,
|
||||||
"dx": 0.2,
|
"dx": 0.0,
|
||||||
"dy": -0.1
|
"dy": 0.0
|
||||||
},
|
},
|
||||||
"clicks": 1,
|
"button": "left",
|
||||||
"button": "left"
|
"clicks": 1
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Scroll:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"action": "scroll",
|
|
||||||
"target": {"mode": "pixel", "x": 1300, "y": 740},
|
|
||||||
"scroll_amount": -500
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Type text:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"action": "type",
|
|
||||||
"text": "hello world",
|
|
||||||
"interval_ms": 20
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Hotkey:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"action": "hotkey",
|
|
||||||
"keys": ["ctrl", "l"]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Right click:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"action": "right_click",
|
|
||||||
"target": {"mode": "pixel", "x": 1300, "y": 740}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Move only:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"action": "move",
|
|
||||||
"target": {"mode": "pixel", "x": 1300, "y": 740},
|
|
||||||
"duration_ms": 150
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## `GET /windows`
|
|
||||||
|
|
||||||
List desktop windows using structured filters instead of shelling out.
|
|
||||||
|
|
||||||
Query params:
|
|
||||||
|
|
||||||
- `title_contains` (optional substring match)
|
|
||||||
- `title_regex` (optional case-insensitive regex)
|
|
||||||
- `process_name` (optional exact process name, e.g. `explorer.exe`)
|
|
||||||
- `hwnd` (optional exact window handle)
|
|
||||||
- `visible_only` (bool, default `true`)
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"ok": true,
|
|
||||||
"count": 1,
|
|
||||||
"windows": [
|
|
||||||
{
|
|
||||||
"hwnd": 132640,
|
|
||||||
"title": "WinDirStat",
|
|
||||||
"class_name": "WinDirStatMainWindow",
|
|
||||||
"pid": 18420,
|
|
||||||
"process_name": "windirstat.exe",
|
|
||||||
"visible": true,
|
|
||||||
"enabled": true,
|
|
||||||
"minimized": false,
|
|
||||||
"maximized": false,
|
|
||||||
"foreground": true,
|
|
||||||
"rect": {"x": 194, "y": 116, "width": 1532, "height": 870}
|
|
||||||
}
|
}
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Notes:
|
|
||||||
- Currently supported on Windows hosts only.
|
|
||||||
- Returns `409` for ambiguous write-target matches when a mutation endpoint would affect multiple windows.
|
|
||||||
|
|
||||||
## `POST /windows/action`
|
|
||||||
|
|
||||||
Perform a structured window action against exactly one matched window.
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"action": "focus",
|
|
||||||
"title_contains": "WinDirStat",
|
|
||||||
"visible_only": true,
|
|
||||||
"timeout_ms": 3000
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Supported actions:
|
Supported actions:
|
||||||
- `focus`
|
- `move`, `click`, `right_click`, `double_click`, `middle_click`
|
||||||
- `restore`
|
- `scroll` (`scroll_amount`)
|
||||||
- `minimize`
|
- `type` (`text`, `interval_ms`)
|
||||||
- `maximize`
|
- `hotkey` (`keys`)
|
||||||
- `close`
|
- `click_text` (OCR-driven text click with optional region)
|
||||||
|
|
||||||
The response includes the matched pre-action window and the final observed window state (or `closed=true` if it disappeared).
|
Target modes:
|
||||||
|
- `pixel`: absolute global `x,y`
|
||||||
## `POST /launch`
|
- `grid`: grid cell from a `see`/`see/zoom` response
|
||||||
|
|
||||||
Start an app/process without invoking a shell.
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"executable": "C:/Program Files/WinDirStat/WinDirStat.exe",
|
|
||||||
"args": [],
|
|
||||||
"cwd": "C:/Program Files/WinDirStat",
|
|
||||||
"wait_for_window": true,
|
|
||||||
"match": {
|
|
||||||
"title_contains": "WinDirStat",
|
|
||||||
"visible_only": true
|
|
||||||
},
|
|
||||||
"timeout_ms": 8000
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Notes:
|
|
||||||
- Launch uses direct process execution (`subprocess.Popen`) rather than PowerShell/CMD.
|
|
||||||
- If `wait_for_window=true`, the server polls for a matching window and returns `window_found`.
|
|
||||||
- `dry_run=true` returns the resolved argv/cwd without launching.
|
|
||||||
|
|
||||||
## `POST /vision/diff`
|
|
||||||
|
|
||||||
Measure whether a screen region changed meaningfully between two captures.
|
|
||||||
|
|
||||||
Query params:
|
|
||||||
|
|
||||||
- `screen` (int, default `0`) - used for `mode=screen` and `mode=region`
|
|
||||||
|
|
||||||
Compare live captures:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"mode": "region",
|
|
||||||
"region_x": 120,
|
|
||||||
"region_y": 80,
|
|
||||||
"region_width": 600,
|
|
||||||
"region_height": 300,
|
|
||||||
"delay_ms": 400,
|
|
||||||
"diff_threshold": 0.01
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Compare provided images:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"mode": "image",
|
|
||||||
"before_image_base64": "iVBORw0KGgoAAA...",
|
|
||||||
"after_image_base64": "iVBORw0KGgoBBB...",
|
|
||||||
"diff_threshold": 0.01
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Response includes:
|
|
||||||
- `diff_ratio` — average normalized pixel difference
|
|
||||||
- `changed` — whether `diff_ratio >= diff_threshold`
|
|
||||||
- `region` — compared region
|
|
||||||
|
|
||||||
## `POST /vision/stability`
|
|
||||||
|
|
||||||
Measure whether a screen region stays visually stable over a short interval.
|
|
||||||
|
|
||||||
Query params:
|
|
||||||
|
|
||||||
- `screen` (int, default `0`)
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"region_x": 0,
|
|
||||||
"region_y": 0,
|
|
||||||
"region_width": 1920,
|
|
||||||
"region_height": 1080,
|
|
||||||
"sample_interval_ms": 250,
|
|
||||||
"duration_ms": 1200,
|
|
||||||
"diff_threshold": 0.005
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Response includes:
|
|
||||||
- `stable`
|
|
||||||
- `sample_count`
|
|
||||||
- `max_diff_ratio`
|
|
||||||
- `avg_diff_ratio`
|
|
||||||
|
|
||||||
## `POST /wait`
|
|
||||||
|
|
||||||
Wait on a structured UI condition instead of guessing sleep durations.
|
|
||||||
|
|
||||||
Query params:
|
|
||||||
|
|
||||||
- `screen` (int, default `0`) - used for text and visual waits
|
|
||||||
|
|
||||||
### Wait for text to appear
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"condition": {
|
|
||||||
"kind": "text",
|
|
||||||
"mode": "screen",
|
|
||||||
"text": "Scan complete",
|
|
||||||
"match": "contains",
|
|
||||||
"present": true,
|
|
||||||
"language_hint": "eng",
|
|
||||||
"min_confidence": 0.4
|
|
||||||
},
|
|
||||||
"timeout_ms": 15000,
|
|
||||||
"poll_interval_ms": 400
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Wait for a window state
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"condition": {
|
|
||||||
"kind": "window",
|
|
||||||
"title_contains": "WinDirStat",
|
|
||||||
"visible_only": true,
|
|
||||||
"state": "focused"
|
|
||||||
},
|
|
||||||
"timeout_ms": 5000,
|
|
||||||
"poll_interval_ms": 200
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Window states:
|
|
||||||
- `exists`
|
|
||||||
- `focused`
|
|
||||||
- `closed`
|
|
||||||
|
|
||||||
### Wait for visual change or stability
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"condition": {
|
|
||||||
"kind": "visual",
|
|
||||||
"state": "stable",
|
|
||||||
"region_x": 0,
|
|
||||||
"region_y": 0,
|
|
||||||
"region_width": 1920,
|
|
||||||
"region_height": 1080,
|
|
||||||
"diff_threshold": 0.005,
|
|
||||||
"stable_for_ms": 1000
|
|
||||||
},
|
|
||||||
"timeout_ms": 12000,
|
|
||||||
"poll_interval_ms": 300
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Visual states:
|
|
||||||
- `change` — succeeds when the average pixel diff crosses `diff_threshold`
|
|
||||||
- `stable` — succeeds when the diff stays at or below `diff_threshold` for `stable_for_ms`
|
|
||||||
|
|
||||||
Notes:
|
|
||||||
- Text waits reuse the OCR pipeline and return matching OCR blocks on success.
|
|
||||||
- Window waits build on the structured window discovery endpoint.
|
|
||||||
- Visual waits compare repeated captures of either the full selected display or an explicit region.
|
|
||||||
|
|
||||||
## `POST /action/verify`
|
|
||||||
|
|
||||||
Execute one action and wait for a structured success condition.
|
|
||||||
|
|
||||||
Query params:
|
|
||||||
|
|
||||||
- `screen` (int, default `0`)
|
|
||||||
|
|
||||||
|
### `click_text` example (full screen OCR)
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
"screen": 0,
|
||||||
"action": {
|
"action": {
|
||||||
"action": "click",
|
"action": "click_text",
|
||||||
"target": {"mode": "pixel", "x": 1300, "y": 740}
|
"click_text": {
|
||||||
},
|
"text": "Sign in",
|
||||||
"condition": {
|
|
||||||
"kind": "text",
|
|
||||||
"mode": "screen",
|
|
||||||
"text": "Settings",
|
|
||||||
"match": "contains",
|
"match": "contains",
|
||||||
"present": true,
|
"case_sensitive": false,
|
||||||
"language_hint": "eng",
|
"min_confidence": 45,
|
||||||
"min_confidence": 0.4
|
"occurrence": "best"
|
||||||
},
|
|
||||||
"retries": 1,
|
|
||||||
"timeout_ms": 4000,
|
|
||||||
"poll_interval_ms": 250,
|
|
||||||
"retry_delay_ms": 250
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Condition kinds mirror `POST /wait`:
|
|
||||||
- `text`
|
|
||||||
- `window`
|
|
||||||
- `visual`
|
|
||||||
|
|
||||||
The response returns per-attempt action output plus structured verification output.
|
|
||||||
|
|
||||||
## `POST /ocr`
|
|
||||||
|
|
||||||
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
|
|
||||||
|
|
||||||
Query params:
|
|
||||||
|
|
||||||
- `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0`
|
|
||||||
|
|
||||||
Body:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"mode": "screen",
|
|
||||||
"language_hint": "eng",
|
|
||||||
"min_confidence": 0.4
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Modes:
|
|
||||||
- `screen` (default): OCR over full selected monitor
|
|
||||||
- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
|
|
||||||
- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
|
|
||||||
|
|
||||||
Region mode example:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"mode": "region",
|
|
||||||
"region_x": 220,
|
|
||||||
"region_y": 160,
|
|
||||||
"region_width": 900,
|
|
||||||
"region_height": 400,
|
|
||||||
"language_hint": "eng",
|
|
||||||
"min_confidence": 0.5
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Image mode example:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"mode": "image",
|
|
||||||
"image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
|
|
||||||
"language_hint": "eng"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Response shape:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"ok": true,
|
|
||||||
"request_id": "...",
|
|
||||||
"time_ms": 1710000000000,
|
|
||||||
"result": {
|
|
||||||
"mode": "screen",
|
|
||||||
"language_hint": "eng",
|
|
||||||
"min_confidence": 0.4,
|
|
||||||
"region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
|
|
||||||
"blocks": [
|
|
||||||
{
|
|
||||||
"text": "Settings",
|
|
||||||
"confidence": 0.9821,
|
|
||||||
"bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
|
|
||||||
}
|
}
|
||||||
]
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Notes:
|
### `click_text` example (region OCR)
|
||||||
- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
|
|
||||||
- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
|
|
||||||
- Requires `tesseract` executable plus Python package `pytesseract`.
|
|
||||||
- If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.
|
|
||||||
|
|
||||||
## `POST /ocr/find`
|
|
||||||
|
|
||||||
Search OCR output for matching text instead of post-processing raw OCR blocks client-side.
|
|
||||||
|
|
||||||
Query params:
|
|
||||||
|
|
||||||
- `screen` (int, default `0`) - used for `mode=screen` and `mode=region`
|
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"mode": "screen",
|
"screen": 0,
|
||||||
"query": "Settings",
|
"action": {
|
||||||
"match": "contains",
|
"action": "click_text",
|
||||||
"group_lines": true,
|
"click_text": {
|
||||||
"max_results": 10,
|
"text": "Continue",
|
||||||
"language_hint": "eng",
|
"match": "exact",
|
||||||
"min_confidence": 0.4
|
"region": { "x": 940, "y": 520, "width": 400, "height": 260 },
|
||||||
|
"occurrence": "first"
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Modes:
|
## 3) Exec
|
||||||
- `screen`
|
|
||||||
- `region`
|
|
||||||
- `image`
|
|
||||||
|
|
||||||
Options:
|
### `POST /exec`
|
||||||
- `match`: `contains`, `exact`, or `regex`
|
Run host shell commands (PowerShell/Bash/CMD).
|
||||||
- `group_lines=true`: combine nearby OCR words into line-level candidates before matching
|
|
||||||
- `max_results`: result cap after confidence sorting
|
|
||||||
|
|
||||||
Response includes:
|
|
||||||
- `matches` — confidence-sorted candidate matches
|
|
||||||
- `match_count`
|
|
||||||
- `blocks_considered`
|
|
||||||
|
|
||||||
## `POST /exec`
|
|
||||||
|
|
||||||
Execute a shell command on the host running Clickthrough.
|
|
||||||
|
|
||||||
Requirements:
|
|
||||||
- `CLICKTHROUGH_EXEC_SECRET` must be configured on the server
|
|
||||||
- send header `x-clickthrough-exec-secret: <secret>`
|
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
@@ -586,29 +192,16 @@ Requirements:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Notes:
|
Required header:
|
||||||
- `shell` supports `powershell`, `bash`, `cmd`
|
|
||||||
- if `shell` is omitted, server uses `CLICKTHROUGH_EXEC_DEFAULT_SHELL`
|
|
||||||
- output is truncated based on `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS`
|
|
||||||
- endpoint can be disabled with `CLICKTHROUGH_EXEC_ENABLED=false`
|
|
||||||
- if `CLICKTHROUGH_EXEC_SECRET` is missing, `/exec` is blocked (`403`)
|
|
||||||
|
|
||||||
Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution metadata.
|
```http
|
||||||
|
x-clickthrough-exec-secret: <secret>
|
||||||
## `POST /batch`
|
|
||||||
|
|
||||||
Runs multiple `action` payloads sequentially.
|
|
||||||
|
|
||||||
Query params:
|
|
||||||
|
|
||||||
- `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0`
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"actions": [
|
|
||||||
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}},
|
|
||||||
{"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}}
|
|
||||||
],
|
|
||||||
"stop_on_error": true
|
|
||||||
}
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Minimal Procedure for Agents
|
||||||
|
|
||||||
|
1. `see` full screen with coarse grid.
|
||||||
|
2. If uncertain, `see/zoom` target area with denser grid.
|
||||||
|
3. `interact` one action.
|
||||||
|
4. `see` again to confirm state change.
|
||||||
|
5. Use `exec` only when GUI interaction is not the right tool.
|
||||||
|
|||||||
@@ -13,23 +13,52 @@ if TOKEN:
|
|||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
r = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10)
|
health = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10)
|
||||||
r.raise_for_status()
|
health.raise_for_status()
|
||||||
print("health:", r.json())
|
print("health:", health.json()["data"])
|
||||||
|
|
||||||
d = requests.get(f"{BASE_URL}/displays", headers=headers, timeout=10)
|
see = requests.post(
|
||||||
d.raise_for_status()
|
f"{BASE_URL}/see",
|
||||||
print("displays:", d.json().get("displays", []))
|
|
||||||
|
|
||||||
s = requests.get(
|
|
||||||
f"{BASE_URL}/screen",
|
|
||||||
headers=headers,
|
headers=headers,
|
||||||
params={"screen": SCREEN, "with_grid": True, "grid_rows": 12, "grid_cols": 12},
|
json={
|
||||||
|
"screen": SCREEN,
|
||||||
|
"with_grid": True,
|
||||||
|
"grid_rows": 12,
|
||||||
|
"grid_cols": 12,
|
||||||
|
"image_format": "jpeg",
|
||||||
|
"jpeg_quality": 70,
|
||||||
|
},
|
||||||
timeout=30,
|
timeout=30,
|
||||||
)
|
)
|
||||||
s.raise_for_status()
|
see.raise_for_status()
|
||||||
payload = s.json()
|
payload = see.json()["data"]
|
||||||
print("screen meta:", payload.get("meta", {}))
|
print("region:", payload["meta"]["region"])
|
||||||
|
print("grid:", payload["meta"].get("grid", {}))
|
||||||
|
|
||||||
|
see_ocr = requests.post(
|
||||||
|
f"{BASE_URL}/see",
|
||||||
|
headers=headers,
|
||||||
|
json={"screen": SCREEN, "ocr": True, "with_grid": False, "ocr_min_confidence": 40},
|
||||||
|
timeout=30,
|
||||||
|
)
|
||||||
|
see_ocr.raise_for_status()
|
||||||
|
ocr_items = see_ocr.json()["data"]["meta"].get("ocr", [])
|
||||||
|
print("ocr_items:", len(ocr_items))
|
||||||
|
|
||||||
|
if ocr_items:
|
||||||
|
label = ocr_items[0]["text"]
|
||||||
|
click_text = requests.post(
|
||||||
|
f"{BASE_URL}/interact",
|
||||||
|
headers=headers,
|
||||||
|
json={
|
||||||
|
"screen": SCREEN,
|
||||||
|
"action": {"action": "click_text", "click_text": {"text": label, "match": "exact", "occurrence": "first"}},
|
||||||
|
},
|
||||||
|
timeout=30,
|
||||||
|
)
|
||||||
|
click_text.raise_for_status()
|
||||||
|
click_data = click_text.json()["data"]
|
||||||
|
print("clicked:", click_data["resolved_target"])
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
1746
server/app.py
1746
server/app.py
File diff suppressed because it is too large
Load Diff
42
server/config.py
Normal file
42
server/config.py
Normal file
@@ -0,0 +1,42 @@
|
|||||||
|
import os
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
|
|
||||||
|
load_dotenv(dotenv_path=".env", override=False)
|
||||||
|
|
||||||
|
|
||||||
|
def _env_bool(name: str, default: bool) -> bool:
|
||||||
|
raw = os.getenv(name)
|
||||||
|
if raw is None:
|
||||||
|
return default
|
||||||
|
return raw.strip().lower() in {"1", "true", "yes", "on"}
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_allowed_region() -> Optional[tuple[int, int, int, int]]:
|
||||||
|
raw = os.getenv("CLICKTHROUGH_ALLOWED_REGION")
|
||||||
|
if not raw:
|
||||||
|
return None
|
||||||
|
parts = [p.strip() for p in raw.split(",")]
|
||||||
|
if len(parts) != 4:
|
||||||
|
raise ValueError("CLICKTHROUGH_ALLOWED_REGION must be x,y,width,height")
|
||||||
|
x, y, w, h = (int(p) for p in parts)
|
||||||
|
if w <= 0 or h <= 0:
|
||||||
|
raise ValueError("CLICKTHROUGH_ALLOWED_REGION width/height must be > 0")
|
||||||
|
return x, y, w, h
|
||||||
|
|
||||||
|
|
||||||
|
SETTINGS = {
|
||||||
|
"host": os.getenv("CLICKTHROUGH_HOST", "127.0.0.1"),
|
||||||
|
"port": int(os.getenv("CLICKTHROUGH_PORT", "8123")),
|
||||||
|
"token": os.getenv("CLICKTHROUGH_TOKEN", "").strip(),
|
||||||
|
"dry_run": _env_bool("CLICKTHROUGH_DRY_RUN", False),
|
||||||
|
"allowed_region": _parse_allowed_region(),
|
||||||
|
"exec_enabled": _env_bool("CLICKTHROUGH_EXEC_ENABLED", True),
|
||||||
|
"exec_default_shell": os.getenv("CLICKTHROUGH_EXEC_DEFAULT_SHELL", "powershell").strip().lower(),
|
||||||
|
"exec_default_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_TIMEOUT_S", "30")),
|
||||||
|
"exec_max_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_MAX_TIMEOUT_S", "120")),
|
||||||
|
"exec_max_output_chars": int(os.getenv("CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS", "20000")),
|
||||||
|
"exec_secret": os.getenv("CLICKTHROUGH_EXEC_SECRET", "").strip(),
|
||||||
|
}
|
||||||
167
server/models.py
Normal file
167
server/models.py
Normal file
@@ -0,0 +1,167 @@
|
|||||||
|
from typing import Literal, Optional
|
||||||
|
|
||||||
|
from pydantic import BaseModel, Field, model_validator
|
||||||
|
|
||||||
|
|
||||||
|
class PixelTarget(BaseModel):
|
||||||
|
mode: Literal["pixel"]
|
||||||
|
x: int
|
||||||
|
y: int
|
||||||
|
dx: int = 0
|
||||||
|
dy: int = 0
|
||||||
|
|
||||||
|
|
||||||
|
class GridTarget(BaseModel):
|
||||||
|
mode: Literal["grid"]
|
||||||
|
region_x: int
|
||||||
|
region_y: int
|
||||||
|
region_width: int = Field(gt=0)
|
||||||
|
region_height: int = Field(gt=0)
|
||||||
|
rows: int = Field(gt=0)
|
||||||
|
cols: int = Field(gt=0)
|
||||||
|
row: int = Field(ge=0)
|
||||||
|
col: int = Field(ge=0)
|
||||||
|
dx: float = 0.0
|
||||||
|
dy: float = 0.0
|
||||||
|
|
||||||
|
@model_validator(mode="after")
|
||||||
|
def _validate_indices(self):
|
||||||
|
if self.row >= self.rows or self.col >= self.cols:
|
||||||
|
raise ValueError("row/col must be inside rows/cols")
|
||||||
|
if not -1.0 <= self.dx <= 1.0:
|
||||||
|
raise ValueError("dx must be in [-1, 1]")
|
||||||
|
if not -1.0 <= self.dy <= 1.0:
|
||||||
|
raise ValueError("dy must be in [-1, 1]")
|
||||||
|
return self
|
||||||
|
|
||||||
|
|
||||||
|
Target = PixelTarget | GridTarget
|
||||||
|
|
||||||
|
|
||||||
|
class ActionRequest(BaseModel):
|
||||||
|
action: Literal[
|
||||||
|
"move",
|
||||||
|
"click",
|
||||||
|
"right_click",
|
||||||
|
"double_click",
|
||||||
|
"middle_click",
|
||||||
|
"scroll",
|
||||||
|
"type",
|
||||||
|
"hotkey",
|
||||||
|
"click_text",
|
||||||
|
]
|
||||||
|
target: Optional[Target] = None
|
||||||
|
duration_ms: int = Field(default=0, ge=0, le=20000)
|
||||||
|
button: Literal["left", "right", "middle"] = "left"
|
||||||
|
clicks: int = Field(default=1, ge=1, le=10)
|
||||||
|
scroll_amount: int = 0
|
||||||
|
text: str = ""
|
||||||
|
keys: list[str] = Field(default_factory=list)
|
||||||
|
interval_ms: int = Field(default=20, ge=0, le=5000)
|
||||||
|
dry_run: bool = False
|
||||||
|
click_text: "ClickTextAction | None" = None
|
||||||
|
|
||||||
|
@model_validator(mode="after")
|
||||||
|
def _validate_click_text(self):
|
||||||
|
if self.action == "click_text" and self.click_text is None:
|
||||||
|
raise ValueError("click_text payload is required when action=click_text")
|
||||||
|
return self
|
||||||
|
|
||||||
|
|
||||||
|
class ExecRequest(BaseModel):
|
||||||
|
command: str = Field(min_length=1, max_length=10000)
|
||||||
|
shell: Literal["powershell", "bash", "cmd"] | None = None
|
||||||
|
timeout_s: int | None = Field(default=None, ge=1, le=600)
|
||||||
|
cwd: str | None = None
|
||||||
|
dry_run: bool = False
|
||||||
|
|
||||||
|
|
||||||
|
class WindowQuery(BaseModel):
|
||||||
|
title_contains: str | None = Field(default=None, max_length=512)
|
||||||
|
title_regex: str | None = Field(default=None, max_length=512)
|
||||||
|
process_name: str | None = Field(default=None, max_length=260)
|
||||||
|
hwnd: int | None = Field(default=None, ge=1)
|
||||||
|
visible_only: bool = True
|
||||||
|
|
||||||
|
|
||||||
|
class WindowActionRequest(WindowQuery):
|
||||||
|
action: Literal["focus", "restore", "minimize", "maximize", "close"]
|
||||||
|
timeout_ms: int = Field(default=3000, ge=0, le=60000)
|
||||||
|
|
||||||
|
|
||||||
|
class LaunchRequest(BaseModel):
|
||||||
|
executable: str = Field(min_length=1, max_length=2048)
|
||||||
|
args: list[str] = Field(default_factory=list, max_length=100)
|
||||||
|
cwd: str | None = None
|
||||||
|
wait_for_window: bool = False
|
||||||
|
match: WindowQuery | None = None
|
||||||
|
timeout_ms: int = Field(default=5000, ge=0, le=120000)
|
||||||
|
dry_run: bool = False
|
||||||
|
|
||||||
|
|
||||||
|
class SeeRequest(BaseModel):
|
||||||
|
screen: int = 0
|
||||||
|
region_x: int | None = Field(default=None, ge=0)
|
||||||
|
region_y: int | None = Field(default=None, ge=0)
|
||||||
|
region_width: int | None = Field(default=None, gt=0)
|
||||||
|
region_height: int | None = Field(default=None, gt=0)
|
||||||
|
with_grid: bool = True
|
||||||
|
grid_rows: int = Field(default=12, ge=1, le=300)
|
||||||
|
grid_cols: int = Field(default=12, ge=1, le=300)
|
||||||
|
include_labels: bool = True
|
||||||
|
image_format: Literal["png", "jpeg"] = "png"
|
||||||
|
jpeg_quality: int = Field(default=85, ge=1, le=100)
|
||||||
|
ocr: bool = False
|
||||||
|
ocr_min_confidence: float = Field(default=0.0, ge=0.0, le=100.0)
|
||||||
|
ocr_lang: str = Field(default="eng", min_length=1, max_length=64)
|
||||||
|
ocr_psm: int | None = Field(default=None, ge=0, le=13)
|
||||||
|
|
||||||
|
|
||||||
|
class SeeZoomRequest(BaseModel):
|
||||||
|
screen: int = 0
|
||||||
|
center_x: int = Field(ge=0)
|
||||||
|
center_y: int = Field(ge=0)
|
||||||
|
width: int = Field(default=500, ge=10)
|
||||||
|
height: int = Field(default=350, ge=10)
|
||||||
|
with_grid: bool = True
|
||||||
|
grid_rows: int = Field(default=20, ge=1, le=300)
|
||||||
|
grid_cols: int = Field(default=20, ge=1, le=300)
|
||||||
|
include_labels: bool = True
|
||||||
|
image_format: Literal["png", "jpeg"] = "png"
|
||||||
|
jpeg_quality: int = Field(default=90, ge=1, le=100)
|
||||||
|
|
||||||
|
|
||||||
|
class InteractRequest(BaseModel):
|
||||||
|
screen: int = 0
|
||||||
|
action: ActionRequest
|
||||||
|
|
||||||
|
|
||||||
|
class OCRRegion(BaseModel):
|
||||||
|
x: int = Field(ge=0)
|
||||||
|
y: int = Field(ge=0)
|
||||||
|
width: int = Field(gt=0)
|
||||||
|
height: int = Field(gt=0)
|
||||||
|
|
||||||
|
|
||||||
|
class ClickTextAction(BaseModel):
|
||||||
|
text: str = Field(min_length=1, max_length=1000)
|
||||||
|
match: Literal["contains", "exact", "regex"] = "contains"
|
||||||
|
region: OCRRegion | None = None
|
||||||
|
screen: int | None = None
|
||||||
|
case_sensitive: bool = False
|
||||||
|
min_confidence: float = Field(default=0.0, ge=0.0, le=100.0)
|
||||||
|
occurrence: Literal["first", "best", "nth"] = "first"
|
||||||
|
nth: int | None = Field(default=None, ge=1, le=10000)
|
||||||
|
ocr_lang: str = Field(default="eng", min_length=1, max_length=64)
|
||||||
|
ocr_psm: int | None = Field(default=None, ge=0, le=13)
|
||||||
|
|
||||||
|
@model_validator(mode="after")
|
||||||
|
def _validate_nth(self):
|
||||||
|
if self.occurrence == "nth" and self.nth is None:
|
||||||
|
raise ValueError("nth is required when occurrence=nth")
|
||||||
|
if self.occurrence != "nth" and self.nth is not None:
|
||||||
|
raise ValueError("nth is only allowed when occurrence=nth")
|
||||||
|
return self
|
||||||
|
|
||||||
|
|
||||||
|
ActionRequest.model_rebuild()
|
||||||
602
server/services.py
Normal file
602
server/services.py
Normal file
@@ -0,0 +1,602 @@
|
|||||||
|
import ctypes
|
||||||
|
import io
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from typing import Literal
|
||||||
|
|
||||||
|
from fastapi import HTTPException
|
||||||
|
from PIL import ImageChops, ImageStat
|
||||||
|
|
||||||
|
from .config import SETTINGS
|
||||||
|
from .models import (
|
||||||
|
ActionRequest,
|
||||||
|
ClickTextAction,
|
||||||
|
GridTarget,
|
||||||
|
LaunchRequest,
|
||||||
|
PixelTarget,
|
||||||
|
Target,
|
||||||
|
WindowActionRequest,
|
||||||
|
WindowQuery,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def api_error(status_code: int, code: str, message: str, details=None):
|
||||||
|
raise HTTPException(status_code=status_code, detail={"code": code, "message": message, "details": details})
|
||||||
|
|
||||||
|
|
||||||
|
def import_capture_libs():
|
||||||
|
try:
|
||||||
|
from PIL import Image, ImageDraw
|
||||||
|
import mss
|
||||||
|
|
||||||
|
return Image, ImageDraw, mss
|
||||||
|
except Exception as exc:
|
||||||
|
raise HTTPException(status_code=500, detail=f"capture backend unavailable: {exc}") from exc
|
||||||
|
|
||||||
|
|
||||||
|
def display_region(mon: dict, screen: int, mss_index: int, primary: bool) -> dict:
|
||||||
|
return {
|
||||||
|
"screen": screen,
|
||||||
|
"mss_index": mss_index,
|
||||||
|
"primary": primary,
|
||||||
|
"x": mon["left"],
|
||||||
|
"y": mon["top"],
|
||||||
|
"width": mon["width"],
|
||||||
|
"height": mon["height"],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def ordered_displays(sct) -> list[dict]:
|
||||||
|
raw_monitors = list(enumerate(sct.monitors[1:], start=1))
|
||||||
|
if not raw_monitors:
|
||||||
|
raise HTTPException(status_code=500, detail="no displays detected")
|
||||||
|
|
||||||
|
primary_pos = next((idx for idx, (_, mon) in enumerate(raw_monitors) if mon["left"] == 0 and mon["top"] == 0), 0)
|
||||||
|
ordered = [raw_monitors[primary_pos]] + [item for idx, item in enumerate(raw_monitors) if idx != primary_pos]
|
||||||
|
return [display_region(mon, screen=index, mss_index=mss_index, primary=(index == 0)) for index, (mss_index, mon) in enumerate(ordered)]
|
||||||
|
|
||||||
|
|
||||||
|
def get_displays() -> list[dict]:
|
||||||
|
_, _, mss = import_capture_libs()
|
||||||
|
with mss.mss() as sct:
|
||||||
|
return ordered_displays(sct)
|
||||||
|
|
||||||
|
|
||||||
|
def select_display(screen: int) -> tuple[dict, list[dict], dict]:
|
||||||
|
displays = get_displays()
|
||||||
|
selected = displays[screen] if 0 <= screen < len(displays) else displays[0]
|
||||||
|
return selected, displays, {"requested": screen, "selected": selected["screen"], "fallback": selected["screen"] != screen}
|
||||||
|
|
||||||
|
|
||||||
|
def capture_screen(screen: int = 0):
|
||||||
|
Image, _, mss = import_capture_libs()
|
||||||
|
with mss.mss() as sct:
|
||||||
|
displays = ordered_displays(sct)
|
||||||
|
mon = displays[screen] if 0 <= screen < len(displays) else displays[0]
|
||||||
|
shot = sct.grab({"left": mon["x"], "top": mon["y"], "width": mon["width"], "height": mon["height"]})
|
||||||
|
image = Image.frombytes("RGB", shot.size, shot.rgb)
|
||||||
|
selection = {"requested": screen, "selected": mon["screen"], "fallback": mon["screen"] != screen}
|
||||||
|
return image, mon, displays, selection
|
||||||
|
|
||||||
|
|
||||||
|
def capture_region_image(screen: int, region_x: int | None, region_y: int | None, region_width: int | None, region_height: int | None):
|
||||||
|
base_img, mon, displays, screen_selection = capture_screen(screen)
|
||||||
|
if None in {region_x, region_y, region_width, region_height}:
|
||||||
|
return base_img, {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}, mon, displays, screen_selection
|
||||||
|
|
||||||
|
left = region_x - mon["x"]
|
||||||
|
top = region_y - mon["y"]
|
||||||
|
right = left + region_width
|
||||||
|
bottom = top + region_height
|
||||||
|
if left < 0 or top < 0 or right > base_img.size[0] or bottom > base_img.size[1]:
|
||||||
|
raise HTTPException(status_code=400, detail="requested region is outside the captured monitor")
|
||||||
|
|
||||||
|
crop = base_img.crop((left, top, right, bottom))
|
||||||
|
return crop, {"x": region_x, "y": region_y, "width": region_width, "height": region_height}, mon, displays, screen_selection
|
||||||
|
|
||||||
|
|
||||||
|
def extract_ocr_items(image, origin_x: int, origin_y: int, min_confidence: float, lang: str, psm: int | None) -> list[dict]:
|
||||||
|
try:
|
||||||
|
import pytesseract
|
||||||
|
except Exception as exc:
|
||||||
|
api_error(503, "ocr_unavailable", f"pytesseract unavailable: {exc}")
|
||||||
|
|
||||||
|
config = ""
|
||||||
|
if psm is not None:
|
||||||
|
config = f"--psm {psm}"
|
||||||
|
try:
|
||||||
|
data = pytesseract.image_to_data(image, lang=lang, config=config, output_type=pytesseract.Output.DICT)
|
||||||
|
except Exception as exc:
|
||||||
|
api_error(503, "ocr_failed", f"ocr failed: {exc}")
|
||||||
|
|
||||||
|
out: list[dict] = []
|
||||||
|
n = len(data.get("text", []))
|
||||||
|
for i in range(n):
|
||||||
|
text = (data["text"][i] or "").strip()
|
||||||
|
if not text:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
confidence = float(data["conf"][i])
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if confidence < min_confidence:
|
||||||
|
continue
|
||||||
|
left = int(data["left"][i])
|
||||||
|
top = int(data["top"][i])
|
||||||
|
width = int(data["width"][i])
|
||||||
|
height = int(data["height"][i])
|
||||||
|
bbox = {"x": origin_x + left, "y": origin_y + top, "width": width, "height": height}
|
||||||
|
center = {"x": bbox["x"] + (width // 2), "y": bbox["y"] + (height // 2)}
|
||||||
|
out.append(
|
||||||
|
{
|
||||||
|
"text": text,
|
||||||
|
"confidence": confidence,
|
||||||
|
"bbox": bbox,
|
||||||
|
"center": center,
|
||||||
|
"region_relative_bbox": {"x": left, "y": top, "width": width, "height": height},
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def serialize_image(image, image_format: str, jpeg_quality: int) -> bytes:
|
||||||
|
buf = io.BytesIO()
|
||||||
|
if image_format == "jpeg":
|
||||||
|
image.save(buf, format="JPEG", quality=jpeg_quality)
|
||||||
|
else:
|
||||||
|
image.save(buf, format="PNG")
|
||||||
|
return buf.getvalue()
|
||||||
|
|
||||||
|
|
||||||
|
def encode_image(image, image_format: str, jpeg_quality: int) -> str:
|
||||||
|
import base64
|
||||||
|
|
||||||
|
return base64.b64encode(serialize_image(image, image_format, jpeg_quality)).decode("ascii")
|
||||||
|
|
||||||
|
|
||||||
|
def draw_grid(image, region_x: int, region_y: int, rows: int, cols: int, include_labels: bool):
|
||||||
|
_, ImageDraw, _ = import_capture_libs()
|
||||||
|
out = image.copy()
|
||||||
|
draw = ImageDraw.Draw(out)
|
||||||
|
w, h = out.size
|
||||||
|
cell_w = w / cols
|
||||||
|
cell_h = h / rows
|
||||||
|
|
||||||
|
for c in range(1, cols):
|
||||||
|
x = int(round(c * cell_w))
|
||||||
|
draw.line([(x, 0), (x, h)], fill=(255, 0, 0), width=1)
|
||||||
|
for r in range(1, rows):
|
||||||
|
y = int(round(r * cell_h))
|
||||||
|
draw.line([(0, y), (w, y)], fill=(255, 0, 0), width=1)
|
||||||
|
|
||||||
|
draw.rectangle([(0, 0), (w - 1, h - 1)], outline=(255, 0, 0), width=2)
|
||||||
|
if include_labels:
|
||||||
|
for r in range(rows):
|
||||||
|
for c in range(cols):
|
||||||
|
cx = int((c + 0.5) * cell_w)
|
||||||
|
cy = int((r + 0.5) * cell_h)
|
||||||
|
draw.text((cx - 12, cy - 6), f"{r},{c}", fill=(255, 255, 0))
|
||||||
|
|
||||||
|
meta = {
|
||||||
|
"region": {"x": region_x, "y": region_y, "width": w, "height": h},
|
||||||
|
"grid": {
|
||||||
|
"rows": rows,
|
||||||
|
"cols": cols,
|
||||||
|
"cell_width": cell_w,
|
||||||
|
"cell_height": cell_h,
|
||||||
|
"indexing": "zero-based",
|
||||||
|
"point_formula": {
|
||||||
|
"pixel_x": "region.x + ((col + 0.5 + dx*0.5) * cell_width)",
|
||||||
|
"pixel_y": "region.y + ((row + 0.5 + dy*0.5) * cell_height)",
|
||||||
|
"dx_range": "[-1,1]",
|
||||||
|
"dy_range": "[-1,1]",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
return out, meta
|
||||||
|
|
||||||
|
|
||||||
|
def resolve_target(target: Target) -> tuple[int, int, dict]:
|
||||||
|
if isinstance(target, PixelTarget):
|
||||||
|
x = target.x + target.dx
|
||||||
|
y = target.y + target.dy
|
||||||
|
return x, y, {"mode": "pixel", "source": target.model_dump()}
|
||||||
|
|
||||||
|
cell_w = target.region_width / target.cols
|
||||||
|
cell_h = target.region_height / target.rows
|
||||||
|
x = target.region_x + int(round((target.col + 0.5 + (target.dx * 0.5)) * cell_w))
|
||||||
|
y = target.region_y + int(round((target.row + 0.5 + (target.dy * 0.5)) * cell_h))
|
||||||
|
return x, y, {"mode": "grid", "source": target.model_dump(), "derived": {"cell_width": cell_w, "cell_height": cell_h}}
|
||||||
|
|
||||||
|
|
||||||
|
def enforce_allowed_region(x: int, y: int):
|
||||||
|
region = SETTINGS["allowed_region"]
|
||||||
|
if region is None:
|
||||||
|
return
|
||||||
|
rx, ry, rw, rh = region
|
||||||
|
if not (rx <= x < rx + rw and ry <= y < ry + rh):
|
||||||
|
raise HTTPException(status_code=403, detail="point outside allowed region")
|
||||||
|
|
||||||
|
|
||||||
|
def _text_matches(candidate: str, needle: str, mode: str, case_sensitive: bool) -> bool:
|
||||||
|
hay = candidate if case_sensitive else candidate.lower()
|
||||||
|
ndl = needle if case_sensitive else needle.lower()
|
||||||
|
if mode == "contains":
|
||||||
|
return ndl in hay
|
||||||
|
if mode == "exact":
|
||||||
|
return hay == ndl
|
||||||
|
flags = 0 if case_sensitive else re.IGNORECASE
|
||||||
|
return re.search(needle, candidate, flags=flags) is not None
|
||||||
|
|
||||||
|
|
||||||
|
def _resolve_text_match(click_text: ClickTextAction, items: list[dict]) -> dict:
|
||||||
|
matches = [item for item in items if _text_matches(item["text"], click_text.text, click_text.match, click_text.case_sensitive)]
|
||||||
|
if not matches:
|
||||||
|
candidates = [item["text"] for item in sorted(items, key=lambda v: v["confidence"], reverse=True)[:8]]
|
||||||
|
api_error(404, "ocr_text_not_found", "no OCR text matched", {"query": click_text.text, "candidates": candidates})
|
||||||
|
if click_text.occurrence == "best":
|
||||||
|
return max(matches, key=lambda item: item["confidence"])
|
||||||
|
if click_text.occurrence == "nth":
|
||||||
|
idx = (click_text.nth or 1) - 1
|
||||||
|
if idx >= len(matches):
|
||||||
|
api_error(409, "ocr_nth_out_of_range", "requested nth match is out of range", {"match_count": len(matches), "nth": click_text.nth})
|
||||||
|
return matches[idx]
|
||||||
|
if len(matches) > 1 and click_text.match == "exact":
|
||||||
|
api_error(
|
||||||
|
409,
|
||||||
|
"ocr_text_ambiguous",
|
||||||
|
"multiple OCR entries matched",
|
||||||
|
{"match_count": len(matches), "candidates": [item["text"] for item in matches[:8]]},
|
||||||
|
)
|
||||||
|
return matches[0]
|
||||||
|
|
||||||
|
|
||||||
|
def import_input_lib():
|
||||||
|
try:
|
||||||
|
import pyautogui
|
||||||
|
|
||||||
|
pyautogui.FAILSAFE = True
|
||||||
|
return pyautogui
|
||||||
|
except Exception as exc:
|
||||||
|
raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc
|
||||||
|
|
||||||
|
|
||||||
|
def exec_action(req: ActionRequest, screen: int = 0) -> dict:
|
||||||
|
run_dry = SETTINGS["dry_run"] or req.dry_run
|
||||||
|
action_screen = screen
|
||||||
|
if req.action == "click_text" and req.click_text and req.click_text.screen is not None:
|
||||||
|
action_screen = req.click_text.screen
|
||||||
|
selected_display, _, screen_selection = select_display(action_screen)
|
||||||
|
pyautogui = None if run_dry else import_input_lib()
|
||||||
|
resolved_target = None
|
||||||
|
|
||||||
|
if req.target is not None:
|
||||||
|
x, y, info = resolve_target(req.target)
|
||||||
|
enforce_allowed_region(x, y)
|
||||||
|
resolved_target = {"x": x, "y": y, "target_info": info}
|
||||||
|
|
||||||
|
duration_sec = req.duration_ms / 1000.0
|
||||||
|
if req.action in {"move", "click", "right_click", "double_click", "middle_click"} and resolved_target is None:
|
||||||
|
raise HTTPException(status_code=400, detail="target is required for pointer actions")
|
||||||
|
if req.action == "scroll" and resolved_target is None:
|
||||||
|
raise HTTPException(status_code=400, detail="target is required for scroll")
|
||||||
|
|
||||||
|
click_text_match = None
|
||||||
|
if req.action == "click_text":
|
||||||
|
if req.click_text is None:
|
||||||
|
api_error(400, "click_text_payload_required", "click_text payload is required")
|
||||||
|
region = req.click_text.region
|
||||||
|
img, captured_region, _, _, _ = capture_region_image(
|
||||||
|
action_screen,
|
||||||
|
None if region is None else region.x,
|
||||||
|
None if region is None else region.y,
|
||||||
|
None if region is None else region.width,
|
||||||
|
None if region is None else region.height,
|
||||||
|
)
|
||||||
|
items = extract_ocr_items(
|
||||||
|
img,
|
||||||
|
captured_region["x"],
|
||||||
|
captured_region["y"],
|
||||||
|
req.click_text.min_confidence,
|
||||||
|
req.click_text.ocr_lang,
|
||||||
|
req.click_text.ocr_psm,
|
||||||
|
)
|
||||||
|
matched = _resolve_text_match(req.click_text, items)
|
||||||
|
enforce_allowed_region(matched["center"]["x"], matched["center"]["y"])
|
||||||
|
click_text_match = {
|
||||||
|
"query": req.click_text.model_dump(),
|
||||||
|
"matched": matched,
|
||||||
|
"capture_region": captured_region,
|
||||||
|
"screen": screen_selection,
|
||||||
|
}
|
||||||
|
resolved_target = {"x": matched["center"]["x"], "y": matched["center"]["y"], "target_info": {"mode": "ocr_text"}}
|
||||||
|
|
||||||
|
if not run_dry:
|
||||||
|
if req.action == "move":
|
||||||
|
pyautogui.moveTo(resolved_target["x"], resolved_target["y"], duration=duration_sec)
|
||||||
|
elif req.action == "click":
|
||||||
|
pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], clicks=req.clicks, interval=req.interval_ms / 1000.0, button=req.button, duration=duration_sec)
|
||||||
|
elif req.action == "right_click":
|
||||||
|
pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], button="right", duration=duration_sec)
|
||||||
|
elif req.action == "double_click":
|
||||||
|
pyautogui.doubleClick(x=resolved_target["x"], y=resolved_target["y"], interval=req.interval_ms / 1000.0)
|
||||||
|
elif req.action == "middle_click":
|
||||||
|
pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], button="middle", duration=duration_sec)
|
||||||
|
elif req.action == "scroll":
|
||||||
|
pyautogui.moveTo(resolved_target["x"], resolved_target["y"], duration=duration_sec)
|
||||||
|
pyautogui.scroll(req.scroll_amount)
|
||||||
|
elif req.action == "type":
|
||||||
|
pyautogui.write(req.text, interval=req.interval_ms / 1000.0)
|
||||||
|
elif req.action == "hotkey":
|
||||||
|
if len(req.keys) < 1:
|
||||||
|
raise HTTPException(status_code=400, detail="keys is required for hotkey")
|
||||||
|
pyautogui.hotkey(*req.keys)
|
||||||
|
elif req.action == "click_text":
|
||||||
|
pyautogui.click(
|
||||||
|
x=resolved_target["x"],
|
||||||
|
y=resolved_target["y"],
|
||||||
|
clicks=req.clicks,
|
||||||
|
interval=req.interval_ms / 1000.0,
|
||||||
|
button=req.button,
|
||||||
|
duration=duration_sec,
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"action": req.action,
|
||||||
|
"executed": not run_dry,
|
||||||
|
"dry_run": run_dry,
|
||||||
|
"screen": screen_selection,
|
||||||
|
"display": selected_display,
|
||||||
|
"resolved_target": resolved_target,
|
||||||
|
"click_text_match": click_text_match,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def windows_only(feature: str):
|
||||||
|
if sys.platform != "win32":
|
||||||
|
raise HTTPException(status_code=501, detail=f"{feature} is currently supported on Windows hosts only")
|
||||||
|
|
||||||
|
|
||||||
|
def tasklist_process_name(pid: int) -> str | None:
|
||||||
|
try:
|
||||||
|
completed = subprocess.run(["tasklist", "/FI", f"PID eq {pid}", "/FO", "CSV", "/NH"], capture_output=True, text=True, timeout=5, check=False)
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
line = (completed.stdout or "").strip().splitlines()
|
||||||
|
if not line:
|
||||||
|
return None
|
||||||
|
row = line[0].strip()
|
||||||
|
if not row or row.startswith("INFO:"):
|
||||||
|
return None
|
||||||
|
if row.startswith('"') and '","' in row:
|
||||||
|
return row.split('","', 1)[0].strip('"')
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def list_windows(query: WindowQuery | None = None) -> list[dict]:
|
||||||
|
windows_only("window endpoints")
|
||||||
|
query = query or WindowQuery()
|
||||||
|
|
||||||
|
user32 = ctypes.windll.user32
|
||||||
|
kernel32 = ctypes.windll.kernel32
|
||||||
|
psapi = ctypes.windll.psapi
|
||||||
|
|
||||||
|
user32.GetWindowTextLengthW.argtypes = [ctypes.c_void_p]
|
||||||
|
user32.GetWindowTextLengthW.restype = ctypes.c_int
|
||||||
|
user32.GetWindowTextW.argtypes = [ctypes.c_void_p, ctypes.c_wchar_p, ctypes.c_int]
|
||||||
|
user32.GetWindowTextW.restype = ctypes.c_int
|
||||||
|
user32.IsWindowVisible.argtypes = [ctypes.c_void_p]
|
||||||
|
user32.IsWindowVisible.restype = ctypes.c_bool
|
||||||
|
user32.IsWindowEnabled.argtypes = [ctypes.c_void_p]
|
||||||
|
user32.IsWindowEnabled.restype = ctypes.c_bool
|
||||||
|
user32.IsIconic.argtypes = [ctypes.c_void_p]
|
||||||
|
user32.IsIconic.restype = ctypes.c_bool
|
||||||
|
user32.IsZoomed.argtypes = [ctypes.c_void_p]
|
||||||
|
user32.IsZoomed.restype = ctypes.c_bool
|
||||||
|
user32.GetForegroundWindow.restype = ctypes.c_void_p
|
||||||
|
user32.GetWindowRect.argtypes = [ctypes.c_void_p, ctypes.POINTER(ctypes.wintypes.RECT)]
|
||||||
|
user32.GetWindowRect.restype = ctypes.c_bool
|
||||||
|
user32.GetClassNameW.argtypes = [ctypes.c_void_p, ctypes.c_wchar_p, ctypes.c_int]
|
||||||
|
user32.GetClassNameW.restype = ctypes.c_int
|
||||||
|
|
||||||
|
kernel32.OpenProcess.argtypes = [ctypes.wintypes.DWORD, ctypes.wintypes.BOOL, ctypes.wintypes.DWORD]
|
||||||
|
kernel32.OpenProcess.restype = ctypes.wintypes.HANDLE
|
||||||
|
kernel32.CloseHandle.argtypes = [ctypes.wintypes.HANDLE]
|
||||||
|
kernel32.CloseHandle.restype = ctypes.wintypes.BOOL
|
||||||
|
psapi.GetModuleBaseNameW.argtypes = [ctypes.wintypes.HANDLE, ctypes.wintypes.HMODULE, ctypes.c_wchar_p, ctypes.wintypes.DWORD]
|
||||||
|
psapi.GetModuleBaseNameW.restype = ctypes.wintypes.DWORD
|
||||||
|
|
||||||
|
foreground = int(user32.GetForegroundWindow() or 0)
|
||||||
|
results: list[dict] = []
|
||||||
|
|
||||||
|
def callback(hwnd, _lparam):
|
||||||
|
hwnd_int = int(hwnd)
|
||||||
|
if query.hwnd and hwnd_int != query.hwnd:
|
||||||
|
return True
|
||||||
|
visible = bool(user32.IsWindowVisible(hwnd))
|
||||||
|
if query.visible_only and not visible:
|
||||||
|
return True
|
||||||
|
|
||||||
|
length = user32.GetWindowTextLengthW(hwnd)
|
||||||
|
title_buf = ctypes.create_unicode_buffer(max(1, length + 1))
|
||||||
|
user32.GetWindowTextW(hwnd, title_buf, len(title_buf))
|
||||||
|
title = title_buf.value or ""
|
||||||
|
|
||||||
|
if query.title_contains and query.title_contains.lower() not in title.lower():
|
||||||
|
return True
|
||||||
|
if query.title_regex and re.search(query.title_regex, title, flags=re.IGNORECASE) is None:
|
||||||
|
return True
|
||||||
|
|
||||||
|
pid = ctypes.wintypes.DWORD(0)
|
||||||
|
user32.GetWindowThreadProcessId(hwnd, ctypes.byref(pid))
|
||||||
|
process_name = tasklist_process_name(pid.value)
|
||||||
|
if query.process_name and (process_name or "").lower() != query.process_name.lower():
|
||||||
|
return True
|
||||||
|
|
||||||
|
class_buf = ctypes.create_unicode_buffer(256)
|
||||||
|
user32.GetClassNameW(hwnd, class_buf, len(class_buf))
|
||||||
|
rect = ctypes.wintypes.RECT()
|
||||||
|
user32.GetWindowRect(hwnd, ctypes.byref(rect))
|
||||||
|
|
||||||
|
results.append(
|
||||||
|
{
|
||||||
|
"hwnd": hwnd_int,
|
||||||
|
"title": title,
|
||||||
|
"class_name": class_buf.value,
|
||||||
|
"pid": int(pid.value),
|
||||||
|
"process_name": process_name,
|
||||||
|
"visible": visible,
|
||||||
|
"enabled": bool(user32.IsWindowEnabled(hwnd)),
|
||||||
|
"minimized": bool(user32.IsIconic(hwnd)),
|
||||||
|
"maximized": bool(user32.IsZoomed(hwnd)),
|
||||||
|
"foreground": hwnd_int == foreground,
|
||||||
|
"rect": {"x": int(rect.left), "y": int(rect.top), "width": int(rect.right - rect.left), "height": int(rect.bottom - rect.top)},
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return True
|
||||||
|
|
||||||
|
enum_proc = ctypes.WINFUNCTYPE(ctypes.c_bool, ctypes.c_void_p, ctypes.c_void_p)(callback)
|
||||||
|
user32.EnumWindows(enum_proc, 0)
|
||||||
|
results.sort(key=lambda item: (not item["foreground"], item["title"].lower(), item["hwnd"]))
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def _pick_single_window(query: WindowQuery) -> dict:
|
||||||
|
matches = list_windows(query)
|
||||||
|
if not matches:
|
||||||
|
raise HTTPException(status_code=404, detail="no window matched")
|
||||||
|
if len(matches) > 1:
|
||||||
|
raise HTTPException(status_code=409, detail={"message": "multiple windows matched", "matches": matches[:10]})
|
||||||
|
return matches[0]
|
||||||
|
|
||||||
|
|
||||||
|
def apply_window_action(req: WindowActionRequest) -> dict:
|
||||||
|
windows_only("window endpoints")
|
||||||
|
match = _pick_single_window(req)
|
||||||
|
hwnd = match["hwnd"]
|
||||||
|
user32 = ctypes.windll.user32
|
||||||
|
|
||||||
|
SW_RESTORE, SW_MINIMIZE, SW_MAXIMIZE = 9, 6, 3
|
||||||
|
WM_CLOSE = 0x0010
|
||||||
|
|
||||||
|
if req.action == "focus":
|
||||||
|
user32.ShowWindow(hwnd, SW_RESTORE)
|
||||||
|
ok = bool(user32.SetForegroundWindow(hwnd))
|
||||||
|
if not ok:
|
||||||
|
raise HTTPException(status_code=500, detail="failed to focus window")
|
||||||
|
elif req.action == "restore":
|
||||||
|
user32.ShowWindow(hwnd, SW_RESTORE)
|
||||||
|
elif req.action == "minimize":
|
||||||
|
user32.ShowWindow(hwnd, SW_MINIMIZE)
|
||||||
|
elif req.action == "maximize":
|
||||||
|
user32.ShowWindow(hwnd, SW_MAXIMIZE)
|
||||||
|
elif req.action == "close":
|
||||||
|
user32.PostMessageW(hwnd, WM_CLOSE, 0, 0)
|
||||||
|
|
||||||
|
deadline = time.time() + (req.timeout_ms / 1000.0)
|
||||||
|
final = None
|
||||||
|
while time.time() <= deadline:
|
||||||
|
current = list_windows(WindowQuery(hwnd=hwnd, visible_only=False))
|
||||||
|
if not current:
|
||||||
|
if req.action == "close":
|
||||||
|
return {"matched": match, "closed": True, "final": None}
|
||||||
|
time.sleep(0.05)
|
||||||
|
continue
|
||||||
|
final = current[0]
|
||||||
|
if req.action == "focus" and final.get("foreground"):
|
||||||
|
break
|
||||||
|
if req.action in {"restore", "minimize", "maximize"}:
|
||||||
|
break
|
||||||
|
time.sleep(0.05)
|
||||||
|
|
||||||
|
return {"matched": match, "closed": False, "final": final}
|
||||||
|
|
||||||
|
|
||||||
|
def launch_app(req: LaunchRequest) -> dict:
|
||||||
|
if req.cwd and not os.path.isdir(req.cwd):
|
||||||
|
raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
|
||||||
|
argv = [req.executable, *req.args]
|
||||||
|
cwd = req.cwd or None
|
||||||
|
|
||||||
|
if req.dry_run or SETTINGS["dry_run"]:
|
||||||
|
return {"executed": False, "dry_run": True, "argv": argv, "cwd": cwd}
|
||||||
|
|
||||||
|
try:
|
||||||
|
proc = subprocess.Popen(argv, cwd=cwd)
|
||||||
|
except FileNotFoundError as exc:
|
||||||
|
raise HTTPException(status_code=400, detail=f"executable not found: {exc}") from exc
|
||||||
|
except OSError as exc:
|
||||||
|
raise HTTPException(status_code=400, detail=f"failed to launch process: {exc}") from exc
|
||||||
|
|
||||||
|
result = {"executed": True, "dry_run": False, "argv": argv, "cwd": cwd, "pid": proc.pid}
|
||||||
|
if req.wait_for_window:
|
||||||
|
query = req.match or WindowQuery(process_name=os.path.basename(req.executable), visible_only=True)
|
||||||
|
deadline = time.time() + (req.timeout_ms / 1000.0)
|
||||||
|
match = None
|
||||||
|
while time.time() <= deadline:
|
||||||
|
matches = list_windows(query)
|
||||||
|
if matches:
|
||||||
|
match = matches[0]
|
||||||
|
break
|
||||||
|
time.sleep(0.2)
|
||||||
|
result["window"] = match
|
||||||
|
result["window_found"] = match is not None
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def _truncate_text(text: str, limit: int) -> tuple[str, bool]:
|
||||||
|
if len(text) <= limit:
|
||||||
|
return text, False
|
||||||
|
return text[:limit], True
|
||||||
|
|
||||||
|
|
||||||
|
def _resolve_exec_program(shell_name: str, command: str) -> list[str]:
|
||||||
|
if shell_name == "powershell":
|
||||||
|
return ["powershell", "-NoProfile", "-NonInteractive", "-ExecutionPolicy", "Bypass", "-Command", command]
|
||||||
|
if shell_name == "bash":
|
||||||
|
return ["bash", "-lc", command]
|
||||||
|
if shell_name == "cmd":
|
||||||
|
return ["cmd", "/c", command]
|
||||||
|
raise HTTPException(status_code=400, detail="unsupported shell")
|
||||||
|
|
||||||
|
|
||||||
|
def exec_command(req):
|
||||||
|
if not SETTINGS["exec_enabled"]:
|
||||||
|
raise HTTPException(status_code=403, detail="exec endpoint disabled")
|
||||||
|
if not SETTINGS["exec_secret"]:
|
||||||
|
raise HTTPException(status_code=403, detail="exec secret not configured")
|
||||||
|
|
||||||
|
shell_name = (req.shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip()
|
||||||
|
if shell_name not in {"powershell", "bash", "cmd"}:
|
||||||
|
raise HTTPException(status_code=400, detail="unsupported shell")
|
||||||
|
|
||||||
|
run_dry = SETTINGS["dry_run"] or req.dry_run
|
||||||
|
timeout_s = req.timeout_s if req.timeout_s is not None else SETTINGS["exec_default_timeout_s"]
|
||||||
|
timeout_s = min(timeout_s, SETTINGS["exec_max_timeout_s"])
|
||||||
|
|
||||||
|
cwd = None
|
||||||
|
if req.cwd:
|
||||||
|
cwd = os.path.abspath(req.cwd)
|
||||||
|
if not os.path.isdir(cwd):
|
||||||
|
raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
|
||||||
|
|
||||||
|
argv = _resolve_exec_program(shell_name, req.command)
|
||||||
|
if run_dry:
|
||||||
|
return {"executed": False, "dry_run": True, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd}
|
||||||
|
|
||||||
|
start = time.time()
|
||||||
|
try:
|
||||||
|
completed = subprocess.run(argv, cwd=cwd, capture_output=True, text=True, timeout=timeout_s, check=False)
|
||||||
|
except subprocess.TimeoutExpired as exc:
|
||||||
|
stdout, stdout_truncated = _truncate_text(str(exc.stdout or ""), SETTINGS["exec_max_output_chars"])
|
||||||
|
stderr, stderr_truncated = _truncate_text(str(exc.stderr or ""), SETTINGS["exec_max_output_chars"])
|
||||||
|
return {"executed": True, "timed_out": True, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd, "duration_ms": int((time.time() - start) * 1000), "exit_code": None, "stdout": stdout, "stderr": stderr, "stdout_truncated": stdout_truncated, "stderr_truncated": stderr_truncated}
|
||||||
|
except FileNotFoundError as exc:
|
||||||
|
raise HTTPException(status_code=400, detail=f"shell executable not found: {exc}") from exc
|
||||||
|
|
||||||
|
stdout, stdout_truncated = _truncate_text(completed.stdout or "", SETTINGS["exec_max_output_chars"])
|
||||||
|
stderr, stderr_truncated = _truncate_text(completed.stderr or "", SETTINGS["exec_max_output_chars"])
|
||||||
|
return {"executed": True, "timed_out": False, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd, "duration_ms": int((time.time() - start) * 1000), "exit_code": completed.returncode, "stdout": stdout, "stderr": stderr, "stdout_truncated": stdout_truncated, "stderr_truncated": stderr_truncated}
|
||||||
368
skill/SKILL.md
368
skill/SKILL.md
@@ -1,346 +1,64 @@
|
|||||||
---
|
---
|
||||||
name: clickthrough-http-control
|
name: clickthrough-http-control
|
||||||
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
|
description: Use 3 methods to control a computer: see (screenshot+grid), interact (mouse/keyboard), and exec (shell).
|
||||||
---
|
---
|
||||||
|
|
||||||
# Clickthrough HTTP Control
|
# Clickthrough Computer Control
|
||||||
|
|
||||||
Use a strict observe-decide-act-verify loop.
|
Use these methods:
|
||||||
|
- `see`
|
||||||
|
- `interact`
|
||||||
|
- `exec`
|
||||||
|
|
||||||
## Getting a computer instance (user-owned setup)
|
## Method 1: See
|
||||||
|
|
||||||
The **user/operator** is responsible for provisioning and exposing the target machine.
|
Use `POST /see` to capture full screen or a region with a grid overlay.
|
||||||
The agent should not assume it can self-install this stack.
|
Use `POST /see/zoom` to capture a tighter crop with a denser grid.
|
||||||
|
Use `POST /see` with `ocr=true` when text localization is needed.
|
||||||
### What the user must do
|
|
||||||
|
|
||||||
1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
|
|
||||||
2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
|
|
||||||
3. Configure secrets on target machine:
|
|
||||||
- `CLICKTHROUGH_TOKEN` for general API auth
|
|
||||||
- `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
|
|
||||||
4. Share connection details with the agent through a secure channel:
|
|
||||||
- `base_url`
|
|
||||||
- `x-clickthrough-token`
|
|
||||||
- `x-clickthrough-exec-secret` (only when `/exec` is needed)
|
|
||||||
|
|
||||||
### What the agent should do
|
|
||||||
|
|
||||||
1. Validate connection with `GET /health` using provided headers.
|
|
||||||
2. Refuse `/exec` attempts when exec secret is missing/invalid.
|
|
||||||
3. Ask user for missing setup inputs instead of guessing infrastructure.
|
|
||||||
|
|
||||||
## What the agent can actually see
|
|
||||||
|
|
||||||
The agent does **not** inherently see the remote desktop.
|
|
||||||
Clickthrough provides screenshots, OCR data, window metadata, and input endpoints — not native live vision.
|
|
||||||
|
|
||||||
That means:
|
|
||||||
- `GET /screen` and `POST /zoom` return image data the agent may need to inspect explicitly
|
|
||||||
- `POST /ocr` returns machine-readable text blocks when text extraction is enough
|
|
||||||
- the OpenClaw `image` tool is the right fallback when the agent needs judgment about visual layout, icons, button styling, dialog structure, or other non-OCR cues
|
|
||||||
- every visual conclusion is only as fresh as the last screenshot; after an action, recapture before assuming the UI changed as expected
|
|
||||||
|
|
||||||
Do not write or think as if the agent is directly watching the screen in real time.
|
|
||||||
Say what you actually have: screenshots, OCR output, and fresh verification captures.
|
|
||||||
|
|
||||||
## Mini API map
|
|
||||||
|
|
||||||
- `GET /health` → server status + safety flags
|
|
||||||
- `GET /displays` → detected displays in zero-based API order
|
|
||||||
- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
|
|
||||||
- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
|
|
||||||
- `GET /windows` → discover visible desktop windows and their handles/processes
|
|
||||||
- `POST /windows/action` → focus/restore/minimize/maximize/close a matched window
|
|
||||||
- `POST /launch` → start an app/process without dropping to a shell
|
|
||||||
- `POST /wait?screen=0` → wait for text, window, or visual state changes
|
|
||||||
- `POST /vision/diff?screen=0` → compare screenshots or regions for meaningful visual change
|
|
||||||
- `POST /vision/stability?screen=0` → measure short-interval visual stability
|
|
||||||
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
|
|
||||||
- `POST /ocr/find?screen=0` → search OCR output for matching text candidates
|
|
||||||
- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
|
|
||||||
- `POST /action/verify?screen=0` → execute one action plus structured success verification
|
|
||||||
- `POST /batch?screen=0` → sequential action list
|
|
||||||
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
|
|
||||||
|
|
||||||
### Display selection
|
|
||||||
|
|
||||||
- Use `GET /displays` before operating on multi-monitor systems.
|
|
||||||
- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
|
|
||||||
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
|
|
||||||
- Do not assume `screen=1` starts at `(0,0)`; it may start at `(1920,0)`, `(-1920,0)`, or another global offset.
|
|
||||||
- If a screenshot came from `/screen?screen=1`, keep using that response's `region` metadata when forming later `/action` targets.
|
|
||||||
|
|
||||||
### OCR usage
|
|
||||||
|
|
||||||
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
|
|
||||||
- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
|
|
||||||
- Use `language_hint` when known (for example `eng`) to improve consistency.
|
|
||||||
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
|
|
||||||
- Treat OCR as one signal, not the only signal, before high-impact clicks.
|
|
||||||
|
|
||||||
### Screenshot + `image` tool usage
|
|
||||||
|
|
||||||
Use the OpenClaw `image` tool when OCR is not enough.
|
|
||||||
This is especially useful for:
|
|
||||||
- identifying which visible button looks like the primary confirm action
|
|
||||||
- understanding dialog layout or pane structure
|
|
||||||
- distinguishing similar nearby controls by icon, spacing, or emphasis
|
|
||||||
- checking whether a visual state changed after a click
|
|
||||||
|
|
||||||
Good pattern:
|
|
||||||
1. capture with `GET /screen` or `POST /zoom`
|
|
||||||
2. hand that screenshot to the `image` tool
|
|
||||||
3. ask a precise question about the visible UI
|
|
||||||
4. convert the answer into a concrete Clickthrough target
|
|
||||||
5. act once
|
|
||||||
6. recapture and verify again, or use `POST /action/verify` when the action+postcondition loop is simple enough to bundle cleanly
|
|
||||||
|
|
||||||
Ask narrow questions.
|
|
||||||
Good:
|
|
||||||
- "Which button in this dialog is the primary confirmation action?"
|
|
||||||
- "Is the scan still running, or does this look complete?"
|
|
||||||
- "Which of these tabs appears selected?"
|
|
||||||
|
|
||||||
Bad:
|
|
||||||
- "What should I click?"
|
|
||||||
- "Use your eyes and do the task"
|
|
||||||
- anything that assumes the model has live continuity without a new screenshot
|
|
||||||
|
|
||||||
### Header requirements
|
|
||||||
|
|
||||||
- Always send `x-clickthrough-token` when token auth is enabled.
|
|
||||||
- For `/exec`, also send `x-clickthrough-exec-secret`.
|
|
||||||
|
|
||||||
## `POST /action` request shape (important)
|
|
||||||
|
|
||||||
`/action` always expects an `action` plus an optional `target` object.
|
|
||||||
Do **not** invent top-level `x` / `y` fields.
|
|
||||||
|
|
||||||
Minimal pixel click:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"action": "click",
|
|
||||||
"target": {"mode": "pixel", "x": 100, "y": 200},
|
|
||||||
"button": "left",
|
|
||||||
"clicks": 1
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Minimal grid click:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"action": "click",
|
|
||||||
"target": {
|
|
||||||
"mode": "grid",
|
|
||||||
"region_x": 0,
|
|
||||||
"region_y": 0,
|
|
||||||
"region_width": 1920,
|
|
||||||
"region_height": 1080,
|
|
||||||
"rows": 12,
|
|
||||||
"cols": 12,
|
|
||||||
"row": 6,
|
|
||||||
"col": 8,
|
|
||||||
"dx": 0.0,
|
|
||||||
"dy": 0.0
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Other canonical examples:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{"action": "move", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
|
||||||
{"action": "double_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
|
||||||
{"action": "right_click", "target": {"mode": "pixel", "x": 100, "y": 200}}
|
|
||||||
{"action": "scroll", "target": {"mode": "pixel", "x": 100, "y": 200}, "scroll_amount": -500}
|
|
||||||
{"action": "type", "text": "hello world", "interval_ms": 20}
|
|
||||||
{"action": "hotkey", "keys": ["ctrl", "l"]}
|
|
||||||
```
|
|
||||||
|
|
||||||
Rules:
|
Rules:
|
||||||
- `dx` / `dy` belong inside `target`, not beside it.
|
- Start with coarse grid (`12x12`).
|
||||||
- `type` and `hotkey` usually do not need a `target`.
|
- For precision, zoom and use denser grid (`20x20` or higher).
|
||||||
- For pixel targets, `x` / `y` are global desktop coordinates.
|
- Always use returned `meta.region` and `meta.grid` when computing click targets.
|
||||||
- For grid targets, copy the exact `region_*`, `rows`, and `cols` basis from the screenshot/zoom you actually used.
|
- Coordinates are global desktop coordinates.
|
||||||
|
- OCR results are in `data.meta.ocr` and include confidence, bbox, and center.
|
||||||
|
|
||||||
## When to use `/exec`
|
## Method 2: Interact
|
||||||
|
|
||||||
Prefer structured GUI control first:
|
Use `POST /interact` for one action at a time.
|
||||||
- `/screen`, `/zoom`, `/ocr` to observe
|
|
||||||
- `/action` or `/batch` to interact
|
|
||||||
|
|
||||||
Use `/exec` only when it is the cleanest available tool for the job, for example:
|
Mouse actions:
|
||||||
- querying machine state that the GUI does not expose well
|
- `move`, `click`, `right_click`, `double_click`, `middle_click`, `scroll`
|
||||||
- performing an explicit user-requested shell/system task
|
- `click_text` (OCR-driven click; optionally scope with `click_text.region`)
|
||||||
- recovering from a blocked GUI flow when normal interaction failed
|
|
||||||
|
|
||||||
Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
|
Keyboard actions:
|
||||||
Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
|
- `type`, `hotkey`
|
||||||
|
|
||||||
## Core workflow (mandatory)
|
Rules:
|
||||||
|
- Prefer `grid` targets derived from fresh `see`/`see/zoom` captures.
|
||||||
|
- For text buttons/labels, prefer `click_text` and bound OCR with a region when possible.
|
||||||
|
- Use `pixel` only when you already have reliable coordinates.
|
||||||
|
- After each important action, call `see` again before continuing.
|
||||||
|
|
||||||
1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
|
## Method 3: Exec
|
||||||
2. Identify likely target region and compute an initial confidence score.
|
|
||||||
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
|
||||||
4. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
|
||||||
5. If OCR is insufficient, inspect the screenshot explicitly with the OpenClaw `image` tool instead of pretending you can already see enough.
|
|
||||||
6. Execute one minimal action via `POST /action`.
|
|
||||||
7. Re-capture with `GET /screen` or use `POST /wait`, `POST /vision/diff`, `POST /vision/stability`, or `POST /action/verify` to verify the expected state change.
|
|
||||||
8. Repeat until objective is complete.
|
|
||||||
|
|
||||||
## Verify-before-click rules
|
Use `POST /exec` only for shell/system tasks.
|
||||||
|
|
||||||
- Never click if target identity is ambiguous.
|
Rules:
|
||||||
- Require at least two matching signals before click.
|
- Requires `x-clickthrough-exec-secret`.
|
||||||
- Good signal pairs include:
|
- Do not use exec for normal clicking/typing flows.
|
||||||
- OCR text + expected UI region
|
- Prefer GUI interaction first; exec is fallback or explicit shell task.
|
||||||
- OCR text + matching button shape/icon nearby
|
|
||||||
- dialog title text + expected button position within that dialog
|
|
||||||
- known app/window focus + expected control location
|
|
||||||
- If confidence is low, do not "test click"; zoom and re-localize first.
|
|
||||||
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
|
|
||||||
1) preview intended coordinate + reason
|
|
||||||
2) execute only after explicit confirmation.
|
|
||||||
|
|
||||||
## Precision rules
|
## Lightweight Procedure
|
||||||
|
|
||||||
- Prefer grid targets first, then use `dx/dy` for subcell precision.
|
1. `see` capture.
|
||||||
- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
|
2. If needed, `see/zoom` refine.
|
||||||
- Use zoom before guessing offsets.
|
3. `interact` one step (`click_text` for text UI targets).
|
||||||
- Avoid stale coordinates: re-capture before action if UI moved/scrolled.
|
4. `see` verify.
|
||||||
|
5. Repeat.
|
||||||
|
|
||||||
## Safety rules
|
## Quick Safety Rules
|
||||||
|
|
||||||
- Respect `dry_run` and `allowed_region` restrictions from `/health`.
|
- Never click with stale screenshots.
|
||||||
- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
|
- Never send multiple uncertain clicks in a row.
|
||||||
- Avoid destructive shortcuts unless explicitly requested.
|
- If localization is ambiguous, re-capture with a tighter zoom.
|
||||||
- Send one action at a time unless deterministic; then use `/batch`.
|
|
||||||
|
|
||||||
## Reliability rules
|
|
||||||
|
|
||||||
- After every meaningful action, verify with a fresh screenshot.
|
|
||||||
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
|
|
||||||
- Prefer short, reversible actions over long macros.
|
|
||||||
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
|
|
||||||
|
|
||||||
## Fallback ladder for uncertain targeting
|
|
||||||
|
|
||||||
1. Full-screen capture with a coarse grid.
|
|
||||||
2. Zoom into the candidate area with a denser grid.
|
|
||||||
3. OCR the full screen or the tighter region.
|
|
||||||
4. Re-anchor on a more reliable nearby control, title, or label.
|
|
||||||
5. Try a keyboard-first flow if the app supports it.
|
|
||||||
6. Use `/exec` only if GUI control is blocked and shell-level intervention is genuinely cleaner.
|
|
||||||
|
|
||||||
Do not skip from "uncertain click" straight to random retries.
|
|
||||||
|
|
||||||
## Concrete screenshot -> `image` -> action example
|
|
||||||
|
|
||||||
Example loop:
|
|
||||||
1. `GET /screen?screen=0` to capture the current app state
|
|
||||||
2. if the UI is text-heavy, try `POST /ocr` first
|
|
||||||
3. if OCR does not answer the real question, pass the screenshot to the OpenClaw `image` tool with a narrow prompt like:
|
|
||||||
- "In this save dialog, which visible button is the primary action?"
|
|
||||||
- "Is there a dismiss/close button in the top-right of this modal?"
|
|
||||||
4. map the answer back to a Clickthrough target using the returned grid/region metadata
|
|
||||||
5. click once with `POST /action`
|
|
||||||
6. recapture the screen
|
|
||||||
7. optionally use `POST /wait` or another `image`/OCR check to confirm the result
|
|
||||||
|
|
||||||
The key rule is simple: screenshot first, interpret second, click third, verify fourth.
|
|
||||||
Do not collapse those steps into fake certainty.
|
|
||||||
|
|
||||||
## App-specific playbooks (recommended)
|
|
||||||
|
|
||||||
Build per-app routines for repetitive tasks instead of generic clicking.
|
|
||||||
|
|
||||||
### Launcher / search / start app playbook
|
|
||||||
|
|
||||||
Use this when the goal is "open app X" or "bring up tool Y".
|
|
||||||
|
|
||||||
1. check `GET /windows` first in case the app is already open
|
|
||||||
2. if present, use `POST /windows/action` to focus or restore it
|
|
||||||
3. if absent, prefer `POST /launch` when you know the executable path
|
|
||||||
4. if launch path is unknown but the OS launcher/search UI is available, use a keyboard-first flow:
|
|
||||||
- open launcher (`win`, `cmd+space`, or app-specific shortcut depending on host)
|
|
||||||
- type exact app name
|
|
||||||
- wait for stable results with `POST /wait` or recapture
|
|
||||||
- verify the result text with OCR or the `image` tool
|
|
||||||
- press Enter or click the exact result once
|
|
||||||
5. verify the app window now exists or is focused
|
|
||||||
|
|
||||||
Do not keep relaunching if the window already exists; that’s sloppy.
|
|
||||||
|
|
||||||
### Dialog confirmation playbook
|
|
||||||
|
|
||||||
Use for modals like save/discard, delete confirmation, permission prompts, and installer dialogs.
|
|
||||||
|
|
||||||
1. capture the dialog region with `POST /zoom`
|
|
||||||
2. use OCR first for title/body/button labels
|
|
||||||
3. if button hierarchy or emphasis matters, inspect the zoomed screenshot with the `image` tool
|
|
||||||
4. identify the exact intended action (`Cancel`, `Save`, `Allow`, `Delete`, etc.)
|
|
||||||
5. for destructive actions, require explicit user confirmation unless already requested
|
|
||||||
6. click once and verify the dialog disappeared or changed state
|
|
||||||
|
|
||||||
Good verification targets:
|
|
||||||
- dialog title vanished
|
|
||||||
- expected next window appeared
|
|
||||||
- destructive side effect is visible and confirmed
|
|
||||||
|
|
||||||
### File picker playbook
|
|
||||||
|
|
||||||
Use for open/save dialogs.
|
|
||||||
|
|
||||||
1. verify the file picker window is focused
|
|
||||||
2. OCR the visible breadcrumb/path area, filename field, and button row
|
|
||||||
3. prefer keyboard-first entry when possible:
|
|
||||||
- type or paste the target path/name into the focused field
|
|
||||||
- use `tab` / `shift+tab` to move predictably between filename and action buttons
|
|
||||||
4. if the target path is uncertain, use OCR plus the `image` tool to identify the active field and selected folder/file row
|
|
||||||
5. verify the intended filename/path is visible before confirming
|
|
||||||
6. activate `Open` / `Save` once and verify the picker closes
|
|
||||||
|
|
||||||
If the picker stays open, stop and inspect why instead of hammering Enter like a maniac.
|
|
||||||
|
|
||||||
### Browser tab / window playbook
|
|
||||||
|
|
||||||
Use for browser navigation, tab targeting, or web app recovery.
|
|
||||||
|
|
||||||
1. use `GET /windows` to focus the correct browser window first
|
|
||||||
2. prefer keyboard-first navigation:
|
|
||||||
- `ctrl+l` / `cmd+l` to focus the address bar
|
|
||||||
- `ctrl+tab` / `ctrl+shift+tab` for tab movement when order is known
|
|
||||||
- `ctrl+w` only for explicitly requested close actions
|
|
||||||
3. verify tab or page identity with OCR on the tab strip or page heading
|
|
||||||
4. if multiple similar tabs are open, zoom into the tab strip and use the `image` tool to distinguish active vs inactive tabs
|
|
||||||
5. after navigation, wait for visual stability or expected text before taking the next action
|
|
||||||
|
|
||||||
Do not assume a page loaded just because the click landed. Verify it.
|
|
||||||
|
|
||||||
### Settings / preferences navigation playbook
|
|
||||||
|
|
||||||
Use when the task involves toggles, dropdowns, sidebars, or nested settings panels.
|
|
||||||
|
|
||||||
1. identify the current settings page with OCR on the heading/sidebar
|
|
||||||
2. use OCR to find the specific section label before trying to toggle anything
|
|
||||||
3. if the layout is dense, zoom into the relevant pane and use the `image` tool to distinguish labels from controls
|
|
||||||
4. prefer small reversible actions: one toggle, one dropdown, one field edit at a time
|
|
||||||
5. after each change, verify the control state changed visually or via visible text
|
|
||||||
6. if a save/apply button exists, treat it as a separate confirmation step and verify completion
|
|
||||||
|
|
||||||
Settings UIs love hiding side effects. Assume nothing.
|
|
||||||
|
|
||||||
### Spotify playbook
|
|
||||||
|
|
||||||
- Focus app window before search/navigation.
|
|
||||||
- Prefer keyboard-first flow for song start:
|
|
||||||
1) `Ctrl+L` (search)
|
|
||||||
2) type exact query
|
|
||||||
3) Enter
|
|
||||||
4) verify exact song+artist text
|
|
||||||
5) click/double-click row
|
|
||||||
6) verify now-playing bar
|
|
||||||
- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.
|
|
||||||
|
|||||||
114
tests/test_ocr_and_interact.py
Normal file
114
tests/test_ocr_and_interact.py
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
import sys
|
||||||
|
|
||||||
|
from PIL import Image
|
||||||
|
from fastapi.testclient import TestClient
|
||||||
|
|
||||||
|
from server import services
|
||||||
|
from server.app import app
|
||||||
|
from server.config import SETTINGS
|
||||||
|
from server.models import ClickTextAction
|
||||||
|
|
||||||
|
|
||||||
|
def _auth_headers() -> dict:
|
||||||
|
token = SETTINGS.get("token", "")
|
||||||
|
if not token:
|
||||||
|
return {}
|
||||||
|
return {"x-clickthrough-token": token}
|
||||||
|
|
||||||
|
|
||||||
|
def test_extract_ocr_items_normalization(monkeypatch):
|
||||||
|
class FakeOutput:
|
||||||
|
DICT = "DICT"
|
||||||
|
|
||||||
|
class FakeTesseract:
|
||||||
|
Output = FakeOutput
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def image_to_data(_image, lang, config, output_type):
|
||||||
|
assert lang == "eng"
|
||||||
|
assert output_type == "DICT"
|
||||||
|
return {
|
||||||
|
"text": ["hello", " ", "world"],
|
||||||
|
"conf": ["95.0", "-1", "62.5"],
|
||||||
|
"left": [10, 12, 40],
|
||||||
|
"top": [20, 25, 60],
|
||||||
|
"width": [30, 10, 50],
|
||||||
|
"height": [10, 10, 12],
|
||||||
|
}
|
||||||
|
|
||||||
|
monkeypatch.setitem(sys.modules, "pytesseract", FakeTesseract)
|
||||||
|
items = services.extract_ocr_items(Image.new("RGB", (100, 100)), origin_x=100, origin_y=200, min_confidence=60, lang="eng", psm=None)
|
||||||
|
assert len(items) == 2
|
||||||
|
assert items[0]["text"] == "hello"
|
||||||
|
assert items[0]["bbox"]["x"] == 110
|
||||||
|
assert items[0]["center"]["y"] == 225
|
||||||
|
assert items[1]["text"] == "world"
|
||||||
|
|
||||||
|
|
||||||
|
def test_resolve_text_match_contains_exact_regex_and_nth():
|
||||||
|
items = [
|
||||||
|
{"text": "Save", "confidence": 70},
|
||||||
|
{"text": "Save as", "confidence": 96},
|
||||||
|
{"text": "SAVE", "confidence": 88},
|
||||||
|
]
|
||||||
|
contains = services._resolve_text_match(ClickTextAction(text="save", match="contains", occurrence="first"), items)
|
||||||
|
assert contains["text"] == "Save"
|
||||||
|
best = services._resolve_text_match(ClickTextAction(text="save", match="contains", occurrence="best"), items)
|
||||||
|
assert best["text"] == "Save as"
|
||||||
|
exact_case = services._resolve_text_match(
|
||||||
|
ClickTextAction(text="SAVE", match="exact", case_sensitive=True, occurrence="first"),
|
||||||
|
items,
|
||||||
|
)
|
||||||
|
assert exact_case["text"] == "SAVE"
|
||||||
|
regex_nth = services._resolve_text_match(ClickTextAction(text="^Save", match="regex", occurrence="nth", nth=2), items)
|
||||||
|
assert regex_nth["text"] == "Save as"
|
||||||
|
|
||||||
|
|
||||||
|
def test_interact_click_text_region_optional(monkeypatch):
|
||||||
|
monkeypatch.setattr(services, "select_display", lambda screen: ({"screen": screen}, [], {"requested": screen, "selected": screen, "fallback": False}))
|
||||||
|
monkeypatch.setattr(
|
||||||
|
services,
|
||||||
|
"capture_region_image",
|
||||||
|
lambda screen, x, y, w, h: (Image.new("RGB", (20, 20)), {"x": x or 0, "y": y or 0, "width": w or 20, "height": h or 20}, {}, [], {}),
|
||||||
|
)
|
||||||
|
monkeypatch.setattr(
|
||||||
|
services,
|
||||||
|
"extract_ocr_items",
|
||||||
|
lambda *args, **kwargs: [
|
||||||
|
{
|
||||||
|
"text": "Apply",
|
||||||
|
"confidence": 93.0,
|
||||||
|
"bbox": {"x": 10, "y": 20, "width": 20, "height": 10},
|
||||||
|
"center": {"x": 20, "y": 25},
|
||||||
|
"region_relative_bbox": {"x": 10, "y": 20, "width": 20, "height": 10},
|
||||||
|
}
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
client = TestClient(app)
|
||||||
|
response = client.post(
|
||||||
|
"/interact",
|
||||||
|
json={"screen": 0, "action": {"action": "click_text", "dry_run": True, "click_text": {"text": "Apply", "match": "contains"}}},
|
||||||
|
headers=_auth_headers(),
|
||||||
|
)
|
||||||
|
assert response.status_code == 200
|
||||||
|
body = response.json()["data"]
|
||||||
|
assert body["resolved_target"]["x"] == 20
|
||||||
|
assert body["click_text_match"]["matched"]["text"] == "Apply"
|
||||||
|
|
||||||
|
|
||||||
|
def test_see_ocr_off_on_contract(monkeypatch):
|
||||||
|
monkeypatch.setattr(
|
||||||
|
"server.app.capture_region_image",
|
||||||
|
lambda *args, **kwargs: (Image.new("RGB", (10, 10)), {"x": 0, "y": 0, "width": 10, "height": 10}, {"screen": 0}, [], {}),
|
||||||
|
)
|
||||||
|
monkeypatch.setattr("server.app.encode_image", lambda *args, **kwargs: "abc")
|
||||||
|
monkeypatch.setattr("server.app.extract_ocr_items", lambda *args, **kwargs: [{"text": "x"}])
|
||||||
|
|
||||||
|
client = TestClient(app)
|
||||||
|
off = client.post("/see", json={"ocr": False, "with_grid": False}, headers=_auth_headers())
|
||||||
|
assert off.status_code == 200
|
||||||
|
assert "ocr" not in off.json()["data"]["meta"]
|
||||||
|
on = client.post("/see", json={"ocr": True, "with_grid": False}, headers=_auth_headers())
|
||||||
|
assert on.status_code == 200
|
||||||
|
assert on.json()["data"]["meta"]["ocr"][0]["text"] == "x"
|
||||||
Reference in New Issue
Block a user