refactor: simplify to see/interact/exec and split server modules
All checks were successful
python-syntax / syntax-check (push) Successful in 6s
All checks were successful
python-syntax / syntax-check (push) Successful in 6s
This commit is contained in:
62
README.md
62
README.md
@@ -1,49 +1,37 @@
|
||||
# Clickthrough
|
||||
|
||||
Let an agent interact with a computer over HTTP.
|
||||
Clickthrough is a lightweight HTTP control layer that lets an AI safely operate a real computer by repeatedly capturing structured screenshots with coordinate-aware grids (`see`), executing precise mouse/keyboard actions from those coordinates (`interact`), and optionally running authenticated shell commands for system-level tasks (`exec`) under a consistent response contract.
|
||||
|
||||
## Primary mode (v2)
|
||||
## Core Methods
|
||||
|
||||
Use the v2 contract for faster, less OCR-heavy control loops:
|
||||
- `POST /v2/observe`
|
||||
- `POST /v2/localize`
|
||||
- `POST /v2/act`
|
||||
- `POST /v2/act-verify`
|
||||
- `POST /see`: Capture a full screen or region, optionally with a click-ready grid overlay.
|
||||
- `POST /see/zoom`: Capture a tighter crop around a point and draw a denser grid for precise targeting.
|
||||
- `POST /interact`: Perform one mouse or keyboard action (`click`, `scroll`, `type`, `hotkey`, etc.).
|
||||
- `POST /exec`: Run PowerShell/Bash/CMD commands when shell-level control is needed.
|
||||
|
||||
This is optimized for agents that cannot directly see the screen and must use screenshot/image tools.
|
||||
## Why this works for AI agents
|
||||
|
||||
## What this provides
|
||||
- Agents do not need live vision; they iterate on snapshots.
|
||||
- Grid metadata bridges image understanding to deterministic click coordinates.
|
||||
- Interaction stays explicit and auditable (one action per request).
|
||||
- A unified response envelope (`ok`, `data`, `error`) reduces agent-side branching.
|
||||
|
||||
- Screen/region capture with optional OCR and timing stats
|
||||
- Observation IDs for deterministic follow-up localization
|
||||
- Text localization and image-tool coordinate localization
|
||||
- Action execution with resolved target IDs
|
||||
- Risk-aware action+verification defaults
|
||||
- Unified response envelope across all endpoints
|
||||
## Minimal Agent Loop
|
||||
|
||||
## Quick start
|
||||
1. Call `see` with a coarse grid.
|
||||
2. If uncertain, call `see/zoom` with a denser grid.
|
||||
3. Call `interact` once.
|
||||
4. Call `see` again to verify state change.
|
||||
5. Use `exec` only for explicit shell/system tasks.
|
||||
|
||||
```bash
|
||||
cd /root/external-projects/clickthrough
|
||||
python3 -m venv .venv
|
||||
. .venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
CLICKTHROUGH_TOKEN=change-me python -m server.app
|
||||
```
|
||||
## Safety and Auth
|
||||
|
||||
Server defaults to `127.0.0.1:8123`.
|
||||
- `x-clickthrough-token` protects API access when enabled.
|
||||
- `x-clickthrough-exec-secret` is required for `/exec`.
|
||||
- Optional dry-run and allowed-region constraints reduce accidental risk.
|
||||
|
||||
## Fast control loop
|
||||
## Docs
|
||||
|
||||
1. `POST /v2/observe` on a tight region
|
||||
2. If OCR is enough, `POST /v2/localize` with `text_query`
|
||||
3. If ambiguous, ask image tool for one x,y in observation bounds
|
||||
4. `POST /v2/localize` with `image_tool_point`
|
||||
5. `POST /v2/act` or `POST /v2/act-verify`
|
||||
6. Re-observe only changed region
|
||||
|
||||
## See docs
|
||||
|
||||
- `docs/API.md`
|
||||
- `skill/SKILL.md`
|
||||
- `docs/coordinate-system.md`
|
||||
- API: `docs/API.md`
|
||||
- Agent procedure: `skill/SKILL.md`
|
||||
- Coordinate system details: `docs/coordinate-system.md`
|
||||
|
||||
232
docs/API.md
232
docs/API.md
@@ -1,116 +1,21 @@
|
||||
# API Reference (v2)
|
||||
# API Reference
|
||||
|
||||
Base URL: `http://127.0.0.1:8123`
|
||||
|
||||
If `CLICKTHROUGH_TOKEN` is set, include:
|
||||
Auth header when enabled:
|
||||
|
||||
```http
|
||||
x-clickthrough-token: <token>
|
||||
```
|
||||
|
||||
## Endpoints
|
||||
This API is intended for AI computer control through 3 methods only:
|
||||
- `see`
|
||||
- `interact`
|
||||
- `exec`
|
||||
|
||||
- `POST /v2/observe`
|
||||
- `POST /v2/localize`
|
||||
- `POST /v2/act`
|
||||
- `POST /v2/act-verify`
|
||||
- `GET /health`
|
||||
- `GET /displays`
|
||||
- `GET /windows`
|
||||
- `POST /windows/action`
|
||||
- `POST /launch`
|
||||
- `POST /exec`
|
||||
All responses use one envelope.
|
||||
|
||||
No v1 endpoints are supported.
|
||||
|
||||
## `POST /v2/observe`
|
||||
|
||||
```json
|
||||
{
|
||||
"mode": "region",
|
||||
"region_x": 800,
|
||||
"region_y": 420,
|
||||
"region_width": 700,
|
||||
"region_height": 420,
|
||||
"include_image": true,
|
||||
"image_format": "jpeg",
|
||||
"jpeg_quality": 75,
|
||||
"ocr_mode": "region",
|
||||
"language_hint": "eng",
|
||||
"min_confidence": 0.45,
|
||||
"max_ocr_area_px": 1500000,
|
||||
"group_lines": true
|
||||
}
|
||||
```
|
||||
|
||||
Returns observation metadata, optional image, OCR blocks/lines, and timing fields.
|
||||
|
||||
## `POST /v2/localize`
|
||||
|
||||
Text localization:
|
||||
|
||||
```json
|
||||
{
|
||||
"observation_id": "...",
|
||||
"text_query": "Save",
|
||||
"text_match": "exact",
|
||||
"candidate_index": 0
|
||||
}
|
||||
```
|
||||
|
||||
Image-tool point localization:
|
||||
|
||||
```json
|
||||
{
|
||||
"observation_id": "...",
|
||||
"image_tool_point": {"x": 312, "y": 188}
|
||||
}
|
||||
```
|
||||
|
||||
Returns `resolved_target_id`, global pixel, and `localization_confidence`.
|
||||
|
||||
## `POST /v2/act`
|
||||
|
||||
```json
|
||||
{
|
||||
"action": {
|
||||
"action": "click",
|
||||
"target": {"resolved_target_id": "..."},
|
||||
"button": "left",
|
||||
"clicks": 1
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## `POST /v2/act-verify`
|
||||
|
||||
```json
|
||||
{
|
||||
"action": {
|
||||
"action": "click",
|
||||
"target": {"resolved_target_id": "..."}
|
||||
},
|
||||
"condition": {
|
||||
"kind": "text",
|
||||
"mode": "region",
|
||||
"text": "Saved",
|
||||
"match": "contains",
|
||||
"present": true,
|
||||
"region_x": 820,
|
||||
"region_y": 420,
|
||||
"region_width": 500,
|
||||
"region_height": 140,
|
||||
"min_confidence": 0.4
|
||||
},
|
||||
"risk_level": "low"
|
||||
}
|
||||
```
|
||||
|
||||
Risk defaults:
|
||||
- `low`: retries `0`, timeout `2500ms`
|
||||
- `high`: retries `1`, timeout `6000ms`
|
||||
|
||||
## Response envelope
|
||||
## Response Envelope
|
||||
|
||||
Success:
|
||||
|
||||
@@ -133,9 +38,124 @@ Error:
|
||||
"time_ms": 1710000000000,
|
||||
"data": null,
|
||||
"error": {
|
||||
"code": "http_error",
|
||||
"message": "...",
|
||||
"details": {}
|
||||
"code": "validation_error",
|
||||
"message": "request validation failed",
|
||||
"details": []
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 1) See
|
||||
|
||||
### `POST /see`
|
||||
Capture a full screen or a region. Optional grid overlay returns coordinate metadata for click mapping.
|
||||
|
||||
```json
|
||||
{
|
||||
"screen": 0,
|
||||
"region_x": null,
|
||||
"region_y": null,
|
||||
"region_width": null,
|
||||
"region_height": null,
|
||||
"with_grid": true,
|
||||
"grid_rows": 12,
|
||||
"grid_cols": 12,
|
||||
"include_labels": true,
|
||||
"image_format": "png",
|
||||
"jpeg_quality": 85
|
||||
}
|
||||
```
|
||||
|
||||
Returns:
|
||||
- `data.image.base64`
|
||||
- `data.meta.region` (global desktop coords)
|
||||
- `data.meta.grid` (rows/cols/cell size + formula)
|
||||
|
||||
### `POST /see/zoom`
|
||||
Capture a tighter crop around a global point and draw another grid over that crop.
|
||||
|
||||
```json
|
||||
{
|
||||
"screen": 0,
|
||||
"center_x": 1200,
|
||||
"center_y": 720,
|
||||
"width": 500,
|
||||
"height": 350,
|
||||
"with_grid": true,
|
||||
"grid_rows": 20,
|
||||
"grid_cols": 20,
|
||||
"include_labels": true,
|
||||
"image_format": "png",
|
||||
"jpeg_quality": 90
|
||||
}
|
||||
```
|
||||
|
||||
Use this for precision before clicking tiny controls.
|
||||
|
||||
## 2) Interact
|
||||
|
||||
### `POST /interact`
|
||||
Mouse/keyboard action execution.
|
||||
|
||||
```json
|
||||
{
|
||||
"screen": 0,
|
||||
"action": {
|
||||
"action": "click",
|
||||
"target": {
|
||||
"mode": "grid",
|
||||
"region_x": 0,
|
||||
"region_y": 0,
|
||||
"region_width": 1920,
|
||||
"region_height": 1080,
|
||||
"rows": 12,
|
||||
"cols": 12,
|
||||
"row": 7,
|
||||
"col": 3,
|
||||
"dx": 0.0,
|
||||
"dy": 0.0
|
||||
},
|
||||
"button": "left",
|
||||
"clicks": 1
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Supported actions:
|
||||
- `move`, `click`, `right_click`, `double_click`, `middle_click`
|
||||
- `scroll` (`scroll_amount`)
|
||||
- `type` (`text`, `interval_ms`)
|
||||
- `hotkey` (`keys`)
|
||||
|
||||
Target modes:
|
||||
- `pixel`: absolute global `x,y`
|
||||
- `grid`: grid cell from a `see`/`see/zoom` response
|
||||
|
||||
## 3) Exec
|
||||
|
||||
### `POST /exec`
|
||||
Run host shell commands (PowerShell/Bash/CMD).
|
||||
|
||||
```json
|
||||
{
|
||||
"command": "Get-Process | Select-Object -First 5",
|
||||
"shell": "powershell",
|
||||
"timeout_s": 20,
|
||||
"cwd": "C:/Users/Paul",
|
||||
"dry_run": false
|
||||
}
|
||||
```
|
||||
|
||||
Required header:
|
||||
|
||||
```http
|
||||
x-clickthrough-exec-secret: <secret>
|
||||
```
|
||||
|
||||
## Minimal Procedure for Agents
|
||||
|
||||
1. `see` full screen with coarse grid.
|
||||
2. If uncertain, `see/zoom` target area with denser grid.
|
||||
3. `interact` one action.
|
||||
4. `see` again to confirm state change.
|
||||
5. Use `exec` only when GUI interaction is not the right tool.
|
||||
|
||||
@@ -15,24 +15,25 @@ if TOKEN:
|
||||
def main():
|
||||
health = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10)
|
||||
health.raise_for_status()
|
||||
print("health ok:", health.json().get("ok"))
|
||||
print("health:", health.json()["data"])
|
||||
|
||||
observe = requests.post(
|
||||
f"{BASE_URL}/v2/observe",
|
||||
see = requests.post(
|
||||
f"{BASE_URL}/see",
|
||||
headers=headers,
|
||||
params={"screen": SCREEN},
|
||||
json={
|
||||
"mode": "screen",
|
||||
"include_image": False,
|
||||
"ocr_mode": "none",
|
||||
"screen": SCREEN,
|
||||
"with_grid": True,
|
||||
"grid_rows": 12,
|
||||
"grid_cols": 12,
|
||||
"image_format": "jpeg",
|
||||
"jpeg_quality": 70,
|
||||
},
|
||||
timeout=20,
|
||||
timeout=30,
|
||||
)
|
||||
observe.raise_for_status()
|
||||
payload = observe.json()["data"]
|
||||
print("observation_id:", payload["observation_id"])
|
||||
print("region:", payload["region"])
|
||||
print("timing_ms:", payload["timing_ms"])
|
||||
see.raise_for_status()
|
||||
payload = see.json()["data"]
|
||||
print("region:", payload["meta"]["region"])
|
||||
print("grid:", payload["meta"].get("grid", {}))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
1773
server/app.py
1773
server/app.py
File diff suppressed because it is too large
Load Diff
42
server/config.py
Normal file
42
server/config.py
Normal file
@@ -0,0 +1,42 @@
|
||||
import os
|
||||
from typing import Optional
|
||||
|
||||
from dotenv import load_dotenv
|
||||
|
||||
|
||||
load_dotenv(dotenv_path=".env", override=False)
|
||||
|
||||
|
||||
def _env_bool(name: str, default: bool) -> bool:
|
||||
raw = os.getenv(name)
|
||||
if raw is None:
|
||||
return default
|
||||
return raw.strip().lower() in {"1", "true", "yes", "on"}
|
||||
|
||||
|
||||
def _parse_allowed_region() -> Optional[tuple[int, int, int, int]]:
|
||||
raw = os.getenv("CLICKTHROUGH_ALLOWED_REGION")
|
||||
if not raw:
|
||||
return None
|
||||
parts = [p.strip() for p in raw.split(",")]
|
||||
if len(parts) != 4:
|
||||
raise ValueError("CLICKTHROUGH_ALLOWED_REGION must be x,y,width,height")
|
||||
x, y, w, h = (int(p) for p in parts)
|
||||
if w <= 0 or h <= 0:
|
||||
raise ValueError("CLICKTHROUGH_ALLOWED_REGION width/height must be > 0")
|
||||
return x, y, w, h
|
||||
|
||||
|
||||
SETTINGS = {
|
||||
"host": os.getenv("CLICKTHROUGH_HOST", "127.0.0.1"),
|
||||
"port": int(os.getenv("CLICKTHROUGH_PORT", "8123")),
|
||||
"token": os.getenv("CLICKTHROUGH_TOKEN", "").strip(),
|
||||
"dry_run": _env_bool("CLICKTHROUGH_DRY_RUN", False),
|
||||
"allowed_region": _parse_allowed_region(),
|
||||
"exec_enabled": _env_bool("CLICKTHROUGH_EXEC_ENABLED", True),
|
||||
"exec_default_shell": os.getenv("CLICKTHROUGH_EXEC_DEFAULT_SHELL", "powershell").strip().lower(),
|
||||
"exec_default_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_TIMEOUT_S", "30")),
|
||||
"exec_max_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_MAX_TIMEOUT_S", "120")),
|
||||
"exec_max_output_chars": int(os.getenv("CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS", "20000")),
|
||||
"exec_secret": os.getenv("CLICKTHROUGH_EXEC_SECRET", "").strip(),
|
||||
}
|
||||
124
server/models.py
Normal file
124
server/models.py
Normal file
@@ -0,0 +1,124 @@
|
||||
from typing import Literal, Optional
|
||||
|
||||
from pydantic import BaseModel, Field, model_validator
|
||||
|
||||
|
||||
class PixelTarget(BaseModel):
|
||||
mode: Literal["pixel"]
|
||||
x: int
|
||||
y: int
|
||||
dx: int = 0
|
||||
dy: int = 0
|
||||
|
||||
|
||||
class GridTarget(BaseModel):
|
||||
mode: Literal["grid"]
|
||||
region_x: int
|
||||
region_y: int
|
||||
region_width: int = Field(gt=0)
|
||||
region_height: int = Field(gt=0)
|
||||
rows: int = Field(gt=0)
|
||||
cols: int = Field(gt=0)
|
||||
row: int = Field(ge=0)
|
||||
col: int = Field(ge=0)
|
||||
dx: float = 0.0
|
||||
dy: float = 0.0
|
||||
|
||||
@model_validator(mode="after")
|
||||
def _validate_indices(self):
|
||||
if self.row >= self.rows or self.col >= self.cols:
|
||||
raise ValueError("row/col must be inside rows/cols")
|
||||
if not -1.0 <= self.dx <= 1.0:
|
||||
raise ValueError("dx must be in [-1, 1]")
|
||||
if not -1.0 <= self.dy <= 1.0:
|
||||
raise ValueError("dy must be in [-1, 1]")
|
||||
return self
|
||||
|
||||
|
||||
Target = PixelTarget | GridTarget
|
||||
|
||||
|
||||
class ActionRequest(BaseModel):
|
||||
action: Literal[
|
||||
"move",
|
||||
"click",
|
||||
"right_click",
|
||||
"double_click",
|
||||
"middle_click",
|
||||
"scroll",
|
||||
"type",
|
||||
"hotkey",
|
||||
]
|
||||
target: Optional[Target] = None
|
||||
duration_ms: int = Field(default=0, ge=0, le=20000)
|
||||
button: Literal["left", "right", "middle"] = "left"
|
||||
clicks: int = Field(default=1, ge=1, le=10)
|
||||
scroll_amount: int = 0
|
||||
text: str = ""
|
||||
keys: list[str] = Field(default_factory=list)
|
||||
interval_ms: int = Field(default=20, ge=0, le=5000)
|
||||
dry_run: bool = False
|
||||
|
||||
|
||||
class ExecRequest(BaseModel):
|
||||
command: str = Field(min_length=1, max_length=10000)
|
||||
shell: Literal["powershell", "bash", "cmd"] | None = None
|
||||
timeout_s: int | None = Field(default=None, ge=1, le=600)
|
||||
cwd: str | None = None
|
||||
dry_run: bool = False
|
||||
|
||||
|
||||
class WindowQuery(BaseModel):
|
||||
title_contains: str | None = Field(default=None, max_length=512)
|
||||
title_regex: str | None = Field(default=None, max_length=512)
|
||||
process_name: str | None = Field(default=None, max_length=260)
|
||||
hwnd: int | None = Field(default=None, ge=1)
|
||||
visible_only: bool = True
|
||||
|
||||
|
||||
class WindowActionRequest(WindowQuery):
|
||||
action: Literal["focus", "restore", "minimize", "maximize", "close"]
|
||||
timeout_ms: int = Field(default=3000, ge=0, le=60000)
|
||||
|
||||
|
||||
class LaunchRequest(BaseModel):
|
||||
executable: str = Field(min_length=1, max_length=2048)
|
||||
args: list[str] = Field(default_factory=list, max_length=100)
|
||||
cwd: str | None = None
|
||||
wait_for_window: bool = False
|
||||
match: WindowQuery | None = None
|
||||
timeout_ms: int = Field(default=5000, ge=0, le=120000)
|
||||
dry_run: bool = False
|
||||
|
||||
|
||||
class SeeRequest(BaseModel):
|
||||
screen: int = 0
|
||||
region_x: int | None = Field(default=None, ge=0)
|
||||
region_y: int | None = Field(default=None, ge=0)
|
||||
region_width: int | None = Field(default=None, gt=0)
|
||||
region_height: int | None = Field(default=None, gt=0)
|
||||
with_grid: bool = True
|
||||
grid_rows: int = Field(default=12, ge=1, le=300)
|
||||
grid_cols: int = Field(default=12, ge=1, le=300)
|
||||
include_labels: bool = True
|
||||
image_format: Literal["png", "jpeg"] = "png"
|
||||
jpeg_quality: int = Field(default=85, ge=1, le=100)
|
||||
|
||||
|
||||
class SeeZoomRequest(BaseModel):
|
||||
screen: int = 0
|
||||
center_x: int = Field(ge=0)
|
||||
center_y: int = Field(ge=0)
|
||||
width: int = Field(default=500, ge=10)
|
||||
height: int = Field(default=350, ge=10)
|
||||
with_grid: bool = True
|
||||
grid_rows: int = Field(default=20, ge=1, le=300)
|
||||
grid_cols: int = Field(default=20, ge=1, le=300)
|
||||
include_labels: bool = True
|
||||
image_format: Literal["png", "jpeg"] = "png"
|
||||
jpeg_quality: int = Field(default=90, ge=1, le=100)
|
||||
|
||||
|
||||
class InteractRequest(BaseModel):
|
||||
screen: int = 0
|
||||
action: ActionRequest
|
||||
462
server/services.py
Normal file
462
server/services.py
Normal file
@@ -0,0 +1,462 @@
|
||||
import ctypes
|
||||
import io
|
||||
import os
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from typing import Literal
|
||||
|
||||
from fastapi import HTTPException
|
||||
from PIL import ImageChops, ImageStat
|
||||
|
||||
from .config import SETTINGS
|
||||
from .models import ActionRequest, GridTarget, LaunchRequest, PixelTarget, Target, WindowActionRequest, WindowQuery
|
||||
|
||||
|
||||
def import_capture_libs():
|
||||
try:
|
||||
from PIL import Image, ImageDraw
|
||||
import mss
|
||||
|
||||
return Image, ImageDraw, mss
|
||||
except Exception as exc:
|
||||
raise HTTPException(status_code=500, detail=f"capture backend unavailable: {exc}") from exc
|
||||
|
||||
|
||||
def display_region(mon: dict, screen: int, mss_index: int, primary: bool) -> dict:
|
||||
return {
|
||||
"screen": screen,
|
||||
"mss_index": mss_index,
|
||||
"primary": primary,
|
||||
"x": mon["left"],
|
||||
"y": mon["top"],
|
||||
"width": mon["width"],
|
||||
"height": mon["height"],
|
||||
}
|
||||
|
||||
|
||||
def ordered_displays(sct) -> list[dict]:
|
||||
raw_monitors = list(enumerate(sct.monitors[1:], start=1))
|
||||
if not raw_monitors:
|
||||
raise HTTPException(status_code=500, detail="no displays detected")
|
||||
|
||||
primary_pos = next((idx for idx, (_, mon) in enumerate(raw_monitors) if mon["left"] == 0 and mon["top"] == 0), 0)
|
||||
ordered = [raw_monitors[primary_pos]] + [item for idx, item in enumerate(raw_monitors) if idx != primary_pos]
|
||||
return [display_region(mon, screen=index, mss_index=mss_index, primary=(index == 0)) for index, (mss_index, mon) in enumerate(ordered)]
|
||||
|
||||
|
||||
def get_displays() -> list[dict]:
|
||||
_, _, mss = import_capture_libs()
|
||||
with mss.mss() as sct:
|
||||
return ordered_displays(sct)
|
||||
|
||||
|
||||
def select_display(screen: int) -> tuple[dict, list[dict], dict]:
|
||||
displays = get_displays()
|
||||
selected = displays[screen] if 0 <= screen < len(displays) else displays[0]
|
||||
return selected, displays, {"requested": screen, "selected": selected["screen"], "fallback": selected["screen"] != screen}
|
||||
|
||||
|
||||
def capture_screen(screen: int = 0):
|
||||
Image, _, mss = import_capture_libs()
|
||||
with mss.mss() as sct:
|
||||
displays = ordered_displays(sct)
|
||||
mon = displays[screen] if 0 <= screen < len(displays) else displays[0]
|
||||
shot = sct.grab({"left": mon["x"], "top": mon["y"], "width": mon["width"], "height": mon["height"]})
|
||||
image = Image.frombytes("RGB", shot.size, shot.rgb)
|
||||
selection = {"requested": screen, "selected": mon["screen"], "fallback": mon["screen"] != screen}
|
||||
return image, mon, displays, selection
|
||||
|
||||
|
||||
def capture_region_image(screen: int, region_x: int | None, region_y: int | None, region_width: int | None, region_height: int | None):
|
||||
base_img, mon, displays, screen_selection = capture_screen(screen)
|
||||
if None in {region_x, region_y, region_width, region_height}:
|
||||
return base_img, {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}, mon, displays, screen_selection
|
||||
|
||||
left = region_x - mon["x"]
|
||||
top = region_y - mon["y"]
|
||||
right = left + region_width
|
||||
bottom = top + region_height
|
||||
if left < 0 or top < 0 or right > base_img.size[0] or bottom > base_img.size[1]:
|
||||
raise HTTPException(status_code=400, detail="requested region is outside the captured monitor")
|
||||
|
||||
crop = base_img.crop((left, top, right, bottom))
|
||||
return crop, {"x": region_x, "y": region_y, "width": region_width, "height": region_height}, mon, displays, screen_selection
|
||||
|
||||
|
||||
def serialize_image(image, image_format: str, jpeg_quality: int) -> bytes:
|
||||
buf = io.BytesIO()
|
||||
if image_format == "jpeg":
|
||||
image.save(buf, format="JPEG", quality=jpeg_quality)
|
||||
else:
|
||||
image.save(buf, format="PNG")
|
||||
return buf.getvalue()
|
||||
|
||||
|
||||
def encode_image(image, image_format: str, jpeg_quality: int) -> str:
|
||||
import base64
|
||||
|
||||
return base64.b64encode(serialize_image(image, image_format, jpeg_quality)).decode("ascii")
|
||||
|
||||
|
||||
def draw_grid(image, region_x: int, region_y: int, rows: int, cols: int, include_labels: bool):
|
||||
_, ImageDraw, _ = import_capture_libs()
|
||||
out = image.copy()
|
||||
draw = ImageDraw.Draw(out)
|
||||
w, h = out.size
|
||||
cell_w = w / cols
|
||||
cell_h = h / rows
|
||||
|
||||
for c in range(1, cols):
|
||||
x = int(round(c * cell_w))
|
||||
draw.line([(x, 0), (x, h)], fill=(255, 0, 0), width=1)
|
||||
for r in range(1, rows):
|
||||
y = int(round(r * cell_h))
|
||||
draw.line([(0, y), (w, y)], fill=(255, 0, 0), width=1)
|
||||
|
||||
draw.rectangle([(0, 0), (w - 1, h - 1)], outline=(255, 0, 0), width=2)
|
||||
if include_labels:
|
||||
for r in range(rows):
|
||||
for c in range(cols):
|
||||
cx = int((c + 0.5) * cell_w)
|
||||
cy = int((r + 0.5) * cell_h)
|
||||
draw.text((cx - 12, cy - 6), f"{r},{c}", fill=(255, 255, 0))
|
||||
|
||||
meta = {
|
||||
"region": {"x": region_x, "y": region_y, "width": w, "height": h},
|
||||
"grid": {
|
||||
"rows": rows,
|
||||
"cols": cols,
|
||||
"cell_width": cell_w,
|
||||
"cell_height": cell_h,
|
||||
"indexing": "zero-based",
|
||||
"point_formula": {
|
||||
"pixel_x": "region.x + ((col + 0.5 + dx*0.5) * cell_width)",
|
||||
"pixel_y": "region.y + ((row + 0.5 + dy*0.5) * cell_height)",
|
||||
"dx_range": "[-1,1]",
|
||||
"dy_range": "[-1,1]",
|
||||
},
|
||||
},
|
||||
}
|
||||
return out, meta
|
||||
|
||||
|
||||
def resolve_target(target: Target) -> tuple[int, int, dict]:
|
||||
if isinstance(target, PixelTarget):
|
||||
x = target.x + target.dx
|
||||
y = target.y + target.dy
|
||||
return x, y, {"mode": "pixel", "source": target.model_dump()}
|
||||
|
||||
cell_w = target.region_width / target.cols
|
||||
cell_h = target.region_height / target.rows
|
||||
x = target.region_x + int(round((target.col + 0.5 + (target.dx * 0.5)) * cell_w))
|
||||
y = target.region_y + int(round((target.row + 0.5 + (target.dy * 0.5)) * cell_h))
|
||||
return x, y, {"mode": "grid", "source": target.model_dump(), "derived": {"cell_width": cell_w, "cell_height": cell_h}}
|
||||
|
||||
|
||||
def enforce_allowed_region(x: int, y: int):
|
||||
region = SETTINGS["allowed_region"]
|
||||
if region is None:
|
||||
return
|
||||
rx, ry, rw, rh = region
|
||||
if not (rx <= x < rx + rw and ry <= y < ry + rh):
|
||||
raise HTTPException(status_code=403, detail="point outside allowed region")
|
||||
|
||||
|
||||
def import_input_lib():
|
||||
try:
|
||||
import pyautogui
|
||||
|
||||
pyautogui.FAILSAFE = True
|
||||
return pyautogui
|
||||
except Exception as exc:
|
||||
raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc
|
||||
|
||||
|
||||
def exec_action(req: ActionRequest, screen: int = 0) -> dict:
|
||||
run_dry = SETTINGS["dry_run"] or req.dry_run
|
||||
selected_display, _, screen_selection = select_display(screen)
|
||||
pyautogui = None if run_dry else import_input_lib()
|
||||
resolved_target = None
|
||||
|
||||
if req.target is not None:
|
||||
x, y, info = resolve_target(req.target)
|
||||
enforce_allowed_region(x, y)
|
||||
resolved_target = {"x": x, "y": y, "target_info": info}
|
||||
|
||||
duration_sec = req.duration_ms / 1000.0
|
||||
if req.action in {"move", "click", "right_click", "double_click", "middle_click"} and resolved_target is None:
|
||||
raise HTTPException(status_code=400, detail="target is required for pointer actions")
|
||||
if req.action == "scroll" and resolved_target is None:
|
||||
raise HTTPException(status_code=400, detail="target is required for scroll")
|
||||
|
||||
if not run_dry:
|
||||
if req.action == "move":
|
||||
pyautogui.moveTo(resolved_target["x"], resolved_target["y"], duration=duration_sec)
|
||||
elif req.action == "click":
|
||||
pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], clicks=req.clicks, interval=req.interval_ms / 1000.0, button=req.button, duration=duration_sec)
|
||||
elif req.action == "right_click":
|
||||
pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], button="right", duration=duration_sec)
|
||||
elif req.action == "double_click":
|
||||
pyautogui.doubleClick(x=resolved_target["x"], y=resolved_target["y"], interval=req.interval_ms / 1000.0)
|
||||
elif req.action == "middle_click":
|
||||
pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], button="middle", duration=duration_sec)
|
||||
elif req.action == "scroll":
|
||||
pyautogui.moveTo(resolved_target["x"], resolved_target["y"], duration=duration_sec)
|
||||
pyautogui.scroll(req.scroll_amount)
|
||||
elif req.action == "type":
|
||||
pyautogui.write(req.text, interval=req.interval_ms / 1000.0)
|
||||
elif req.action == "hotkey":
|
||||
if len(req.keys) < 1:
|
||||
raise HTTPException(status_code=400, detail="keys is required for hotkey")
|
||||
pyautogui.hotkey(*req.keys)
|
||||
|
||||
return {"action": req.action, "executed": not run_dry, "dry_run": run_dry, "screen": screen_selection, "display": selected_display, "resolved_target": resolved_target}
|
||||
|
||||
|
||||
def windows_only(feature: str):
|
||||
if sys.platform != "win32":
|
||||
raise HTTPException(status_code=501, detail=f"{feature} is currently supported on Windows hosts only")
|
||||
|
||||
|
||||
def tasklist_process_name(pid: int) -> str | None:
|
||||
try:
|
||||
completed = subprocess.run(["tasklist", "/FI", f"PID eq {pid}", "/FO", "CSV", "/NH"], capture_output=True, text=True, timeout=5, check=False)
|
||||
except Exception:
|
||||
return None
|
||||
line = (completed.stdout or "").strip().splitlines()
|
||||
if not line:
|
||||
return None
|
||||
row = line[0].strip()
|
||||
if not row or row.startswith("INFO:"):
|
||||
return None
|
||||
if row.startswith('"') and '","' in row:
|
||||
return row.split('","', 1)[0].strip('"')
|
||||
return None
|
||||
|
||||
|
||||
def list_windows(query: WindowQuery | None = None) -> list[dict]:
|
||||
windows_only("window endpoints")
|
||||
query = query or WindowQuery()
|
||||
|
||||
user32 = ctypes.windll.user32
|
||||
kernel32 = ctypes.windll.kernel32
|
||||
psapi = ctypes.windll.psapi
|
||||
|
||||
user32.GetWindowTextLengthW.argtypes = [ctypes.c_void_p]
|
||||
user32.GetWindowTextLengthW.restype = ctypes.c_int
|
||||
user32.GetWindowTextW.argtypes = [ctypes.c_void_p, ctypes.c_wchar_p, ctypes.c_int]
|
||||
user32.GetWindowTextW.restype = ctypes.c_int
|
||||
user32.IsWindowVisible.argtypes = [ctypes.c_void_p]
|
||||
user32.IsWindowVisible.restype = ctypes.c_bool
|
||||
user32.IsWindowEnabled.argtypes = [ctypes.c_void_p]
|
||||
user32.IsWindowEnabled.restype = ctypes.c_bool
|
||||
user32.IsIconic.argtypes = [ctypes.c_void_p]
|
||||
user32.IsIconic.restype = ctypes.c_bool
|
||||
user32.IsZoomed.argtypes = [ctypes.c_void_p]
|
||||
user32.IsZoomed.restype = ctypes.c_bool
|
||||
user32.GetForegroundWindow.restype = ctypes.c_void_p
|
||||
user32.GetWindowRect.argtypes = [ctypes.c_void_p, ctypes.POINTER(ctypes.wintypes.RECT)]
|
||||
user32.GetWindowRect.restype = ctypes.c_bool
|
||||
user32.GetClassNameW.argtypes = [ctypes.c_void_p, ctypes.c_wchar_p, ctypes.c_int]
|
||||
user32.GetClassNameW.restype = ctypes.c_int
|
||||
|
||||
kernel32.OpenProcess.argtypes = [ctypes.wintypes.DWORD, ctypes.wintypes.BOOL, ctypes.wintypes.DWORD]
|
||||
kernel32.OpenProcess.restype = ctypes.wintypes.HANDLE
|
||||
kernel32.CloseHandle.argtypes = [ctypes.wintypes.HANDLE]
|
||||
kernel32.CloseHandle.restype = ctypes.wintypes.BOOL
|
||||
psapi.GetModuleBaseNameW.argtypes = [ctypes.wintypes.HANDLE, ctypes.wintypes.HMODULE, ctypes.c_wchar_p, ctypes.wintypes.DWORD]
|
||||
psapi.GetModuleBaseNameW.restype = ctypes.wintypes.DWORD
|
||||
|
||||
foreground = int(user32.GetForegroundWindow() or 0)
|
||||
results: list[dict] = []
|
||||
|
||||
def callback(hwnd, _lparam):
|
||||
hwnd_int = int(hwnd)
|
||||
if query.hwnd and hwnd_int != query.hwnd:
|
||||
return True
|
||||
visible = bool(user32.IsWindowVisible(hwnd))
|
||||
if query.visible_only and not visible:
|
||||
return True
|
||||
|
||||
length = user32.GetWindowTextLengthW(hwnd)
|
||||
title_buf = ctypes.create_unicode_buffer(max(1, length + 1))
|
||||
user32.GetWindowTextW(hwnd, title_buf, len(title_buf))
|
||||
title = title_buf.value or ""
|
||||
|
||||
if query.title_contains and query.title_contains.lower() not in title.lower():
|
||||
return True
|
||||
if query.title_regex and re.search(query.title_regex, title, flags=re.IGNORECASE) is None:
|
||||
return True
|
||||
|
||||
pid = ctypes.wintypes.DWORD(0)
|
||||
user32.GetWindowThreadProcessId(hwnd, ctypes.byref(pid))
|
||||
process_name = tasklist_process_name(pid.value)
|
||||
if query.process_name and (process_name or "").lower() != query.process_name.lower():
|
||||
return True
|
||||
|
||||
class_buf = ctypes.create_unicode_buffer(256)
|
||||
user32.GetClassNameW(hwnd, class_buf, len(class_buf))
|
||||
rect = ctypes.wintypes.RECT()
|
||||
user32.GetWindowRect(hwnd, ctypes.byref(rect))
|
||||
|
||||
results.append(
|
||||
{
|
||||
"hwnd": hwnd_int,
|
||||
"title": title,
|
||||
"class_name": class_buf.value,
|
||||
"pid": int(pid.value),
|
||||
"process_name": process_name,
|
||||
"visible": visible,
|
||||
"enabled": bool(user32.IsWindowEnabled(hwnd)),
|
||||
"minimized": bool(user32.IsIconic(hwnd)),
|
||||
"maximized": bool(user32.IsZoomed(hwnd)),
|
||||
"foreground": hwnd_int == foreground,
|
||||
"rect": {"x": int(rect.left), "y": int(rect.top), "width": int(rect.right - rect.left), "height": int(rect.bottom - rect.top)},
|
||||
}
|
||||
)
|
||||
return True
|
||||
|
||||
enum_proc = ctypes.WINFUNCTYPE(ctypes.c_bool, ctypes.c_void_p, ctypes.c_void_p)(callback)
|
||||
user32.EnumWindows(enum_proc, 0)
|
||||
results.sort(key=lambda item: (not item["foreground"], item["title"].lower(), item["hwnd"]))
|
||||
return results
|
||||
|
||||
|
||||
def _pick_single_window(query: WindowQuery) -> dict:
|
||||
matches = list_windows(query)
|
||||
if not matches:
|
||||
raise HTTPException(status_code=404, detail="no window matched")
|
||||
if len(matches) > 1:
|
||||
raise HTTPException(status_code=409, detail={"message": "multiple windows matched", "matches": matches[:10]})
|
||||
return matches[0]
|
||||
|
||||
|
||||
def apply_window_action(req: WindowActionRequest) -> dict:
|
||||
windows_only("window endpoints")
|
||||
match = _pick_single_window(req)
|
||||
hwnd = match["hwnd"]
|
||||
user32 = ctypes.windll.user32
|
||||
|
||||
SW_RESTORE, SW_MINIMIZE, SW_MAXIMIZE = 9, 6, 3
|
||||
WM_CLOSE = 0x0010
|
||||
|
||||
if req.action == "focus":
|
||||
user32.ShowWindow(hwnd, SW_RESTORE)
|
||||
ok = bool(user32.SetForegroundWindow(hwnd))
|
||||
if not ok:
|
||||
raise HTTPException(status_code=500, detail="failed to focus window")
|
||||
elif req.action == "restore":
|
||||
user32.ShowWindow(hwnd, SW_RESTORE)
|
||||
elif req.action == "minimize":
|
||||
user32.ShowWindow(hwnd, SW_MINIMIZE)
|
||||
elif req.action == "maximize":
|
||||
user32.ShowWindow(hwnd, SW_MAXIMIZE)
|
||||
elif req.action == "close":
|
||||
user32.PostMessageW(hwnd, WM_CLOSE, 0, 0)
|
||||
|
||||
deadline = time.time() + (req.timeout_ms / 1000.0)
|
||||
final = None
|
||||
while time.time() <= deadline:
|
||||
current = list_windows(WindowQuery(hwnd=hwnd, visible_only=False))
|
||||
if not current:
|
||||
if req.action == "close":
|
||||
return {"matched": match, "closed": True, "final": None}
|
||||
time.sleep(0.05)
|
||||
continue
|
||||
final = current[0]
|
||||
if req.action == "focus" and final.get("foreground"):
|
||||
break
|
||||
if req.action in {"restore", "minimize", "maximize"}:
|
||||
break
|
||||
time.sleep(0.05)
|
||||
|
||||
return {"matched": match, "closed": False, "final": final}
|
||||
|
||||
|
||||
def launch_app(req: LaunchRequest) -> dict:
|
||||
if req.cwd and not os.path.isdir(req.cwd):
|
||||
raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
|
||||
argv = [req.executable, *req.args]
|
||||
cwd = req.cwd or None
|
||||
|
||||
if req.dry_run or SETTINGS["dry_run"]:
|
||||
return {"executed": False, "dry_run": True, "argv": argv, "cwd": cwd}
|
||||
|
||||
try:
|
||||
proc = subprocess.Popen(argv, cwd=cwd)
|
||||
except FileNotFoundError as exc:
|
||||
raise HTTPException(status_code=400, detail=f"executable not found: {exc}") from exc
|
||||
except OSError as exc:
|
||||
raise HTTPException(status_code=400, detail=f"failed to launch process: {exc}") from exc
|
||||
|
||||
result = {"executed": True, "dry_run": False, "argv": argv, "cwd": cwd, "pid": proc.pid}
|
||||
if req.wait_for_window:
|
||||
query = req.match or WindowQuery(process_name=os.path.basename(req.executable), visible_only=True)
|
||||
deadline = time.time() + (req.timeout_ms / 1000.0)
|
||||
match = None
|
||||
while time.time() <= deadline:
|
||||
matches = list_windows(query)
|
||||
if matches:
|
||||
match = matches[0]
|
||||
break
|
||||
time.sleep(0.2)
|
||||
result["window"] = match
|
||||
result["window_found"] = match is not None
|
||||
return result
|
||||
|
||||
|
||||
def _truncate_text(text: str, limit: int) -> tuple[str, bool]:
|
||||
if len(text) <= limit:
|
||||
return text, False
|
||||
return text[:limit], True
|
||||
|
||||
|
||||
def _resolve_exec_program(shell_name: str, command: str) -> list[str]:
|
||||
if shell_name == "powershell":
|
||||
return ["powershell", "-NoProfile", "-NonInteractive", "-ExecutionPolicy", "Bypass", "-Command", command]
|
||||
if shell_name == "bash":
|
||||
return ["bash", "-lc", command]
|
||||
if shell_name == "cmd":
|
||||
return ["cmd", "/c", command]
|
||||
raise HTTPException(status_code=400, detail="unsupported shell")
|
||||
|
||||
|
||||
def exec_command(req):
|
||||
if not SETTINGS["exec_enabled"]:
|
||||
raise HTTPException(status_code=403, detail="exec endpoint disabled")
|
||||
if not SETTINGS["exec_secret"]:
|
||||
raise HTTPException(status_code=403, detail="exec secret not configured")
|
||||
|
||||
shell_name = (req.shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip()
|
||||
if shell_name not in {"powershell", "bash", "cmd"}:
|
||||
raise HTTPException(status_code=400, detail="unsupported shell")
|
||||
|
||||
run_dry = SETTINGS["dry_run"] or req.dry_run
|
||||
timeout_s = req.timeout_s if req.timeout_s is not None else SETTINGS["exec_default_timeout_s"]
|
||||
timeout_s = min(timeout_s, SETTINGS["exec_max_timeout_s"])
|
||||
|
||||
cwd = None
|
||||
if req.cwd:
|
||||
cwd = os.path.abspath(req.cwd)
|
||||
if not os.path.isdir(cwd):
|
||||
raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
|
||||
|
||||
argv = _resolve_exec_program(shell_name, req.command)
|
||||
if run_dry:
|
||||
return {"executed": False, "dry_run": True, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd}
|
||||
|
||||
start = time.time()
|
||||
try:
|
||||
completed = subprocess.run(argv, cwd=cwd, capture_output=True, text=True, timeout=timeout_s, check=False)
|
||||
except subprocess.TimeoutExpired as exc:
|
||||
stdout, stdout_truncated = _truncate_text(str(exc.stdout or ""), SETTINGS["exec_max_output_chars"])
|
||||
stderr, stderr_truncated = _truncate_text(str(exc.stderr or ""), SETTINGS["exec_max_output_chars"])
|
||||
return {"executed": True, "timed_out": True, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd, "duration_ms": int((time.time() - start) * 1000), "exit_code": None, "stdout": stdout, "stderr": stderr, "stdout_truncated": stdout_truncated, "stderr_truncated": stderr_truncated}
|
||||
except FileNotFoundError as exc:
|
||||
raise HTTPException(status_code=400, detail=f"shell executable not found: {exc}") from exc
|
||||
|
||||
stdout, stdout_truncated = _truncate_text(completed.stdout or "", SETTINGS["exec_max_output_chars"])
|
||||
stderr, stderr_truncated = _truncate_text(completed.stderr or "", SETTINGS["exec_max_output_chars"])
|
||||
return {"executed": True, "timed_out": False, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd, "duration_ms": int((time.time() - start) * 1000), "exit_code": completed.returncode, "stdout": stdout, "stderr": stderr, "stdout_truncated": stdout_truncated, "stderr_truncated": stderr_truncated}
|
||||
115
skill/SKILL.md
115
skill/SKILL.md
@@ -1,97 +1,60 @@
|
||||
---
|
||||
name: clickthrough-http-control
|
||||
description: Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops.
|
||||
description: Use 3 methods to control a computer: see (screenshot+grid), interact (mouse/keyboard), and exec (shell).
|
||||
---
|
||||
|
||||
# Clickthrough HTTP Control (v2)
|
||||
# Clickthrough Computer Control
|
||||
|
||||
Agents do not see live desktop video. They operate on snapshots.
|
||||
Use this loop: **observe -> localize -> act -> verify**.
|
||||
Use exactly 3 methods:
|
||||
- `see`
|
||||
- `interact`
|
||||
- `exec`
|
||||
|
||||
## Fast defaults
|
||||
## Method 1: See
|
||||
|
||||
- Start with `POST /v2/observe` on a tight region, not full screen.
|
||||
- Set `ocr_mode` to `none` unless text is required immediately.
|
||||
- Use `image` tool localization for icon-heavy or dense controls.
|
||||
- Use `POST /v2/act-verify` instead of manual sleep/poll loops.
|
||||
|
||||
## Mandatory image-tool click localization
|
||||
|
||||
When OCR is weak or ambiguous, ask image tool for one coordinate in bounds.
|
||||
|
||||
Prompt template:
|
||||
- "Return one click point as JSON `{\"x\":<int>,\"y\":<int>}` inside this image (`width=W`, `height=H`) for the **<exact target>** control."
|
||||
Use `POST /see` to capture full screen or a region with a grid overlay.
|
||||
Use `POST /see/zoom` to capture a tighter crop with a denser grid.
|
||||
|
||||
Rules:
|
||||
- Ask for one point only.
|
||||
- Include bounds in the prompt.
|
||||
- If answer is not parseable `x,y`, re-ask once with stricter format.
|
||||
- Send returned point to `POST /v2/localize` via `image_tool_point`.
|
||||
- Start with coarse grid (`12x12`).
|
||||
- For precision, zoom and use denser grid (`20x20` or higher).
|
||||
- Always use returned `meta.region` and `meta.grid` when computing click targets.
|
||||
- Coordinates are global desktop coordinates.
|
||||
|
||||
## API playbook
|
||||
## Method 2: Interact
|
||||
|
||||
1. **Observe**
|
||||
Use `POST /interact` for one action at a time.
|
||||
|
||||
```json
|
||||
POST /v2/observe?screen=0
|
||||
{
|
||||
"mode": "region",
|
||||
"region_x": 820,
|
||||
"region_y": 420,
|
||||
"region_width": 700,
|
||||
"region_height": 420,
|
||||
"include_image": true,
|
||||
"ocr_mode": "none"
|
||||
}
|
||||
```
|
||||
Mouse actions:
|
||||
- `move`, `click`, `right_click`, `double_click`, `middle_click`, `scroll`
|
||||
|
||||
2. **Localize** (choose one)
|
||||
Keyboard actions:
|
||||
- `type`, `hotkey`
|
||||
|
||||
Text:
|
||||
```json
|
||||
POST /v2/localize
|
||||
{"observation_id":"...","text_query":"Save","text_match":"exact"}
|
||||
```
|
||||
Rules:
|
||||
- Prefer `grid` targets derived from fresh `see`/`see/zoom` captures.
|
||||
- Use `pixel` only when you already have reliable coordinates.
|
||||
- After each important action, call `see` again before continuing.
|
||||
|
||||
Image-tool point:
|
||||
```json
|
||||
POST /v2/localize
|
||||
{"observation_id":"...","image_tool_point":{"x":312,"y":188}}
|
||||
```
|
||||
## Method 3: Exec
|
||||
|
||||
3. **Act**
|
||||
Use `POST /exec` only for shell/system tasks.
|
||||
|
||||
```json
|
||||
POST /v2/act?screen=0
|
||||
{"action":{"action":"click","target":{"resolved_target_id":"..."}}}
|
||||
```
|
||||
Rules:
|
||||
- Requires `x-clickthrough-exec-secret`.
|
||||
- Do not use exec for normal clicking/typing flows.
|
||||
- Prefer GUI interaction first; exec is fallback or explicit shell task.
|
||||
|
||||
4. **Verify**
|
||||
## Lightweight Procedure
|
||||
|
||||
```json
|
||||
POST /v2/act-verify?screen=0
|
||||
{
|
||||
"action":{"action":"click","target":{"resolved_target_id":"..."}},
|
||||
"condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420},
|
||||
"risk_level":"low"
|
||||
}
|
||||
```
|
||||
1. `see` capture.
|
||||
2. If needed, `see/zoom` refine.
|
||||
3. `interact` one step.
|
||||
4. `see` verify.
|
||||
5. Repeat.
|
||||
|
||||
## Risk policy
|
||||
## Quick Safety Rules
|
||||
|
||||
- Low risk (navigation, focus, benign clicks): single verification signal.
|
||||
- High risk (delete/send/purchase/close-lossy): use `risk_level=high` and require two checks before act.
|
||||
- Never do speculative repeat clicks; switch strategy after one failed verify.
|
||||
|
||||
## Anti-latency rules
|
||||
|
||||
- Never repeat full-screen OCR by default.
|
||||
- Re-observe only the active pane/region.
|
||||
- Prefer keyboard + window APIs for app switching.
|
||||
- Use OCR on region only and cap area with `max_ocr_area_px`.
|
||||
|
||||
## Setup and auth
|
||||
|
||||
- Include `x-clickthrough-token` when token auth is enabled.
|
||||
- `/exec` additionally requires `x-clickthrough-exec-secret`.
|
||||
- Validate server first: `GET /health`.
|
||||
- Never click with stale screenshots.
|
||||
- Never send multiple uncertain clicks in a row.
|
||||
- If localization is ambiguous, re-capture with a tighter zoom.
|
||||
|
||||
Reference in New Issue
Block a user