refactor: simplify to see/interact/exec and split server modules
All checks were successful
python-syntax / syntax-check (push) Successful in 6s

This commit is contained in:
2026-05-03 20:07:12 +02:00
parent aced5be25e
commit 1c03cab457
8 changed files with 911 additions and 1928 deletions

View File

@@ -1,116 +1,21 @@
# API Reference (v2)
# API Reference
Base URL: `http://127.0.0.1:8123`
If `CLICKTHROUGH_TOKEN` is set, include:
Auth header when enabled:
```http
x-clickthrough-token: <token>
```
## Endpoints
This API is intended for AI computer control through 3 methods only:
- `see`
- `interact`
- `exec`
- `POST /v2/observe`
- `POST /v2/localize`
- `POST /v2/act`
- `POST /v2/act-verify`
- `GET /health`
- `GET /displays`
- `GET /windows`
- `POST /windows/action`
- `POST /launch`
- `POST /exec`
All responses use one envelope.
No v1 endpoints are supported.
## `POST /v2/observe`
```json
{
"mode": "region",
"region_x": 800,
"region_y": 420,
"region_width": 700,
"region_height": 420,
"include_image": true,
"image_format": "jpeg",
"jpeg_quality": 75,
"ocr_mode": "region",
"language_hint": "eng",
"min_confidence": 0.45,
"max_ocr_area_px": 1500000,
"group_lines": true
}
```
Returns observation metadata, optional image, OCR blocks/lines, and timing fields.
## `POST /v2/localize`
Text localization:
```json
{
"observation_id": "...",
"text_query": "Save",
"text_match": "exact",
"candidate_index": 0
}
```
Image-tool point localization:
```json
{
"observation_id": "...",
"image_tool_point": {"x": 312, "y": 188}
}
```
Returns `resolved_target_id`, global pixel, and `localization_confidence`.
## `POST /v2/act`
```json
{
"action": {
"action": "click",
"target": {"resolved_target_id": "..."},
"button": "left",
"clicks": 1
}
}
```
## `POST /v2/act-verify`
```json
{
"action": {
"action": "click",
"target": {"resolved_target_id": "..."}
},
"condition": {
"kind": "text",
"mode": "region",
"text": "Saved",
"match": "contains",
"present": true,
"region_x": 820,
"region_y": 420,
"region_width": 500,
"region_height": 140,
"min_confidence": 0.4
},
"risk_level": "low"
}
```
Risk defaults:
- `low`: retries `0`, timeout `2500ms`
- `high`: retries `1`, timeout `6000ms`
## Response envelope
## Response Envelope
Success:
@@ -119,7 +24,7 @@ Success:
"ok": true,
"request_id": "...",
"time_ms": 1710000000000,
"data": { },
"data": {},
"error": null
}
```
@@ -133,9 +38,124 @@ Error:
"time_ms": 1710000000000,
"data": null,
"error": {
"code": "http_error",
"message": "...",
"details": {}
"code": "validation_error",
"message": "request validation failed",
"details": []
}
}
```
## 1) See
### `POST /see`
Capture a full screen or a region. Optional grid overlay returns coordinate metadata for click mapping.
```json
{
"screen": 0,
"region_x": null,
"region_y": null,
"region_width": null,
"region_height": null,
"with_grid": true,
"grid_rows": 12,
"grid_cols": 12,
"include_labels": true,
"image_format": "png",
"jpeg_quality": 85
}
```
Returns:
- `data.image.base64`
- `data.meta.region` (global desktop coords)
- `data.meta.grid` (rows/cols/cell size + formula)
### `POST /see/zoom`
Capture a tighter crop around a global point and draw another grid over that crop.
```json
{
"screen": 0,
"center_x": 1200,
"center_y": 720,
"width": 500,
"height": 350,
"with_grid": true,
"grid_rows": 20,
"grid_cols": 20,
"include_labels": true,
"image_format": "png",
"jpeg_quality": 90
}
```
Use this for precision before clicking tiny controls.
## 2) Interact
### `POST /interact`
Mouse/keyboard action execution.
```json
{
"screen": 0,
"action": {
"action": "click",
"target": {
"mode": "grid",
"region_x": 0,
"region_y": 0,
"region_width": 1920,
"region_height": 1080,
"rows": 12,
"cols": 12,
"row": 7,
"col": 3,
"dx": 0.0,
"dy": 0.0
},
"button": "left",
"clicks": 1
}
}
```
Supported actions:
- `move`, `click`, `right_click`, `double_click`, `middle_click`
- `scroll` (`scroll_amount`)
- `type` (`text`, `interval_ms`)
- `hotkey` (`keys`)
Target modes:
- `pixel`: absolute global `x,y`
- `grid`: grid cell from a `see`/`see/zoom` response
## 3) Exec
### `POST /exec`
Run host shell commands (PowerShell/Bash/CMD).
```json
{
"command": "Get-Process | Select-Object -First 5",
"shell": "powershell",
"timeout_s": 20,
"cwd": "C:/Users/Paul",
"dry_run": false
}
```
Required header:
```http
x-clickthrough-exec-secret: <secret>
```
## Minimal Procedure for Agents
1. `see` full screen with coarse grid.
2. If uncertain, `see/zoom` target area with denser grid.
3. `interact` one action.
4. `see` again to confirm state change.
5. Use `exec` only when GUI interaction is not the right tool.