refactor: simplify to see/interact/exec and split server modules
All checks were successful
python-syntax / syntax-check (push) Successful in 6s
All checks were successful
python-syntax / syntax-check (push) Successful in 6s
This commit is contained in:
115
skill/SKILL.md
115
skill/SKILL.md
@@ -1,97 +1,60 @@
|
||||
---
|
||||
name: clickthrough-http-control
|
||||
description: Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops.
|
||||
description: Use 3 methods to control a computer: see (screenshot+grid), interact (mouse/keyboard), and exec (shell).
|
||||
---
|
||||
|
||||
# Clickthrough HTTP Control (v2)
|
||||
# Clickthrough Computer Control
|
||||
|
||||
Agents do not see live desktop video. They operate on snapshots.
|
||||
Use this loop: **observe -> localize -> act -> verify**.
|
||||
Use exactly 3 methods:
|
||||
- `see`
|
||||
- `interact`
|
||||
- `exec`
|
||||
|
||||
## Fast defaults
|
||||
## Method 1: See
|
||||
|
||||
- Start with `POST /v2/observe` on a tight region, not full screen.
|
||||
- Set `ocr_mode` to `none` unless text is required immediately.
|
||||
- Use `image` tool localization for icon-heavy or dense controls.
|
||||
- Use `POST /v2/act-verify` instead of manual sleep/poll loops.
|
||||
|
||||
## Mandatory image-tool click localization
|
||||
|
||||
When OCR is weak or ambiguous, ask image tool for one coordinate in bounds.
|
||||
|
||||
Prompt template:
|
||||
- "Return one click point as JSON `{\"x\":<int>,\"y\":<int>}` inside this image (`width=W`, `height=H`) for the **<exact target>** control."
|
||||
Use `POST /see` to capture full screen or a region with a grid overlay.
|
||||
Use `POST /see/zoom` to capture a tighter crop with a denser grid.
|
||||
|
||||
Rules:
|
||||
- Ask for one point only.
|
||||
- Include bounds in the prompt.
|
||||
- If answer is not parseable `x,y`, re-ask once with stricter format.
|
||||
- Send returned point to `POST /v2/localize` via `image_tool_point`.
|
||||
- Start with coarse grid (`12x12`).
|
||||
- For precision, zoom and use denser grid (`20x20` or higher).
|
||||
- Always use returned `meta.region` and `meta.grid` when computing click targets.
|
||||
- Coordinates are global desktop coordinates.
|
||||
|
||||
## API playbook
|
||||
## Method 2: Interact
|
||||
|
||||
1. **Observe**
|
||||
Use `POST /interact` for one action at a time.
|
||||
|
||||
```json
|
||||
POST /v2/observe?screen=0
|
||||
{
|
||||
"mode": "region",
|
||||
"region_x": 820,
|
||||
"region_y": 420,
|
||||
"region_width": 700,
|
||||
"region_height": 420,
|
||||
"include_image": true,
|
||||
"ocr_mode": "none"
|
||||
}
|
||||
```
|
||||
Mouse actions:
|
||||
- `move`, `click`, `right_click`, `double_click`, `middle_click`, `scroll`
|
||||
|
||||
2. **Localize** (choose one)
|
||||
Keyboard actions:
|
||||
- `type`, `hotkey`
|
||||
|
||||
Text:
|
||||
```json
|
||||
POST /v2/localize
|
||||
{"observation_id":"...","text_query":"Save","text_match":"exact"}
|
||||
```
|
||||
Rules:
|
||||
- Prefer `grid` targets derived from fresh `see`/`see/zoom` captures.
|
||||
- Use `pixel` only when you already have reliable coordinates.
|
||||
- After each important action, call `see` again before continuing.
|
||||
|
||||
Image-tool point:
|
||||
```json
|
||||
POST /v2/localize
|
||||
{"observation_id":"...","image_tool_point":{"x":312,"y":188}}
|
||||
```
|
||||
## Method 3: Exec
|
||||
|
||||
3. **Act**
|
||||
Use `POST /exec` only for shell/system tasks.
|
||||
|
||||
```json
|
||||
POST /v2/act?screen=0
|
||||
{"action":{"action":"click","target":{"resolved_target_id":"..."}}}
|
||||
```
|
||||
Rules:
|
||||
- Requires `x-clickthrough-exec-secret`.
|
||||
- Do not use exec for normal clicking/typing flows.
|
||||
- Prefer GUI interaction first; exec is fallback or explicit shell task.
|
||||
|
||||
4. **Verify**
|
||||
## Lightweight Procedure
|
||||
|
||||
```json
|
||||
POST /v2/act-verify?screen=0
|
||||
{
|
||||
"action":{"action":"click","target":{"resolved_target_id":"..."}},
|
||||
"condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420},
|
||||
"risk_level":"low"
|
||||
}
|
||||
```
|
||||
1. `see` capture.
|
||||
2. If needed, `see/zoom` refine.
|
||||
3. `interact` one step.
|
||||
4. `see` verify.
|
||||
5. Repeat.
|
||||
|
||||
## Risk policy
|
||||
## Quick Safety Rules
|
||||
|
||||
- Low risk (navigation, focus, benign clicks): single verification signal.
|
||||
- High risk (delete/send/purchase/close-lossy): use `risk_level=high` and require two checks before act.
|
||||
- Never do speculative repeat clicks; switch strategy after one failed verify.
|
||||
|
||||
## Anti-latency rules
|
||||
|
||||
- Never repeat full-screen OCR by default.
|
||||
- Re-observe only the active pane/region.
|
||||
- Prefer keyboard + window APIs for app switching.
|
||||
- Use OCR on region only and cap area with `max_ocr_area_px`.
|
||||
|
||||
## Setup and auth
|
||||
|
||||
- Include `x-clickthrough-token` when token auth is enabled.
|
||||
- `/exec` additionally requires `x-clickthrough-exec-secret`.
|
||||
- Validate server first: `GET /health`.
|
||||
- Never click with stale screenshots.
|
||||
- Never send multiple uncertain clicks in a row.
|
||||
- If localization is ambiguous, re-capture with a tighter zoom.
|
||||
|
||||
Reference in New Issue
Block a user