feat: migrate to v2-only API and unified response envelope
All checks were successful
python-syntax / syntax-check (push) Successful in 7s
All checks were successful
python-syntax / syntax-check (push) Successful in 7s
This commit is contained in:
85
README.md
85
README.md
@@ -1,22 +1,25 @@
|
||||
# Clickthrough
|
||||
|
||||
Let an Agent interact with your computer over HTTP, with grid-aware screenshots and precise input actions.
|
||||
Let an agent interact with a computer over HTTP.
|
||||
|
||||
## Primary mode (v2)
|
||||
|
||||
Use the v2 contract for faster, less OCR-heavy control loops:
|
||||
- `POST /v2/observe`
|
||||
- `POST /v2/localize`
|
||||
- `POST /v2/act`
|
||||
- `POST /v2/act-verify`
|
||||
|
||||
This is optimized for agents that cannot directly see the screen and must use screenshot/image tools.
|
||||
|
||||
## What this provides
|
||||
|
||||
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
|
||||
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
|
||||
- **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ...
|
||||
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
|
||||
- **Window lifecycle endpoints**: list/focus/restore/minimize/maximize/close windows via `GET /windows` + `POST /windows/action`
|
||||
- **Structured launch endpoint**: start an app/process without dropping to a shell via `POST /launch`
|
||||
- **Wait/sync endpoint**: poll for text, window, or visual state changes via `POST /wait`
|
||||
- **Vision helper endpoints**: compare screenshots and measure stability via `POST /vision/diff` and `POST /vision/stability`
|
||||
- **OCR endpoints**: extract text blocks or search for matching text via `POST /ocr` and `POST /ocr/find`
|
||||
- **Compound verify endpoint**: execute an action and wait for a structured success condition via `POST /action/verify`
|
||||
- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
|
||||
- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
|
||||
- **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
|
||||
- Screen/region capture with optional OCR and timing stats
|
||||
- Observation IDs for deterministic follow-up localization
|
||||
- Text localization and image-tool coordinate localization
|
||||
- Action execution with resolved target IDs
|
||||
- Risk-aware action+verification defaults
|
||||
- Unified response envelope across all endpoints
|
||||
|
||||
## Quick start
|
||||
|
||||
@@ -30,53 +33,17 @@ CLICKTHROUGH_TOKEN=change-me python -m server.app
|
||||
|
||||
Server defaults to `127.0.0.1:8123`.
|
||||
|
||||
For OCR support, install the native `tesseract` binary on the host (in addition to Python deps), or point `CLICKTHROUGH_TESSERACT_CMD` at the executable if it lives somewhere weird.
|
||||
## Fast control loop
|
||||
|
||||
`python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically.
|
||||
1. `POST /v2/observe` on a tight region
|
||||
2. If OCR is enough, `POST /v2/localize` with `text_query`
|
||||
3. If ambiguous, ask image tool for one x,y in observation bounds
|
||||
4. `POST /v2/localize` with `image_tool_point`
|
||||
5. `POST /v2/act` or `POST /v2/act-verify`
|
||||
6. Re-observe only changed region
|
||||
|
||||
## Minimal API flow
|
||||
## See docs
|
||||
|
||||
1. `GET /displays` if you need a non-primary monitor
|
||||
2. `GET /screen?screen=0` with grid
|
||||
3. Decide cell / target
|
||||
4. Optional `POST /zoom?screen=0` for finer targeting
|
||||
5. `POST /action?screen=0` to execute (or `POST /action/verify?screen=0` for a bundled action+wait flow)
|
||||
6. `GET /screen?screen=0` again to verify result, or use `POST /wait`, `POST /vision/diff`, or `POST /ocr/find`
|
||||
|
||||
Important:
|
||||
- `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields.
|
||||
- Pixel coordinates and OCR bounding boxes are always global desktop coordinates.
|
||||
- The agent does **not** inherently see the remote desktop; it reasons from screenshots, OCR, and window metadata.
|
||||
- When OCR is not enough, pair Clickthrough screenshots with OpenClaw's `image` tool for explicit screenshot interpretation.
|
||||
- Prefer structured GUI interaction first; use `/windows`, `/launch`, `/wait`, and `/action` before reaching for `/exec`.
|
||||
|
||||
See:
|
||||
- `docs/API.md`
|
||||
- `docs/coordinate-system.md`
|
||||
- `skill/SKILL.md`
|
||||
|
||||
## Configuration
|
||||
|
||||
Environment variables:
|
||||
|
||||
- `CLICKTHROUGH_HOST` (default `127.0.0.1`)
|
||||
- `CLICKTHROUGH_PORT` (default `8123`)
|
||||
- `CLICKTHROUGH_TOKEN` (optional; if set, require `x-clickthrough-token` header)
|
||||
- `CLICKTHROUGH_DRY_RUN` (`true`/`false`; default `false`)
|
||||
- `CLICKTHROUGH_GRID_ROWS` (default `12`)
|
||||
- `CLICKTHROUGH_GRID_COLS` (default `12`)
|
||||
- `CLICKTHROUGH_ALLOWED_REGION` (optional `x,y,width,height`)
|
||||
- `CLICKTHROUGH_EXEC_ENABLED` (default `true`)
|
||||
- `CLICKTHROUGH_EXEC_SECRET` (**required for `/exec` to run**)
|
||||
- `CLICKTHROUGH_EXEC_DEFAULT_SHELL` (default `powershell`; one of `powershell`, `bash`, `cmd`)
|
||||
- `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`)
|
||||
- `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`)
|
||||
- `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`)
|
||||
- `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable)
|
||||
|
||||
Window management endpoints currently target Windows hosts. On non-Windows hosts they return `501` instead of guessing.
|
||||
|
||||
## Gitea CI
|
||||
|
||||
A Gitea Actions workflow is included at `.gitea/workflows/python-syntax.yml`.
|
||||
It runs Python syntax checks (`py_compile`) on every push and pull request.
|
||||
- `docs/coordinate-system.md`
|
||||
|
||||
Reference in New Issue
Block a user