Compare commits
11 Commits
bb247aaad2
...
fix/ocr-te
| Author | SHA1 | Date | |
|---|---|---|---|
| a8f2e01bb9 | |||
| dccf7b209a | |||
| 89cf228d13 | |||
| a6d7e37beb | |||
| 097c6a095c | |||
| 2955426f14 | |||
| 3a49560e82 | |||
| 2b84bf95f1 | |||
| 1efe999331 | |||
| 38c1127347 | |||
| 930cdd2887 |
@@ -7,3 +7,11 @@ CLICKTHROUGH_DRY_RUN=false
|
|||||||
CLICKTHROUGH_GRID_ROWS=12
|
CLICKTHROUGH_GRID_ROWS=12
|
||||||
CLICKTHROUGH_GRID_COLS=12
|
CLICKTHROUGH_GRID_COLS=12
|
||||||
# CLICKTHROUGH_ALLOWED_REGION=0,0,1920,1080
|
# CLICKTHROUGH_ALLOWED_REGION=0,0,1920,1080
|
||||||
|
|
||||||
|
CLICKTHROUGH_EXEC_ENABLED=true
|
||||||
|
CLICKTHROUGH_EXEC_SECRET=replace-with-a-strong-random-secret
|
||||||
|
CLICKTHROUGH_EXEC_DEFAULT_SHELL=powershell
|
||||||
|
CLICKTHROUGH_EXEC_TIMEOUT_S=30
|
||||||
|
CLICKTHROUGH_EXEC_MAX_TIMEOUT_S=120
|
||||||
|
CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS=20000
|
||||||
|
# CLICKTHROUGH_TESSERACT_CMD=/usr/bin/tesseract
|
||||||
|
|||||||
21
LICENSE
Normal file
21
LICENSE
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
MIT License
|
||||||
|
|
||||||
|
Copyright (c) 2026 Paul W.
|
||||||
|
|
||||||
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
in the Software without restriction, including without limitation the rights
|
||||||
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||||
|
copies of the Software, and to permit persons to whom the Software is
|
||||||
|
furnished to do so, subject to the following conditions:
|
||||||
|
|
||||||
|
The above copyright notice and this permission notice shall be included in all
|
||||||
|
copies or substantial portions of the Software.
|
||||||
|
|
||||||
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||||
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||||
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||||
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||||
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||||
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||||
|
SOFTWARE.
|
||||||
11
README.md
11
README.md
@@ -7,6 +7,8 @@ Let an Agent interact with your computer over HTTP, with grid-aware screenshots
|
|||||||
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
|
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
|
||||||
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
|
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
|
||||||
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
|
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
|
||||||
|
- **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr`
|
||||||
|
- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
|
||||||
- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
|
- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
|
||||||
- **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
|
- **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
|
||||||
|
|
||||||
@@ -22,6 +24,8 @@ CLICKTHROUGH_TOKEN=change-me python -m server.app
|
|||||||
|
|
||||||
Server defaults to `127.0.0.1:8123`.
|
Server defaults to `127.0.0.1:8123`.
|
||||||
|
|
||||||
|
For OCR support, install the native `tesseract` binary on the host (in addition to Python deps), or point `CLICKTHROUGH_TESSERACT_CMD` at the executable if it lives somewhere weird.
|
||||||
|
|
||||||
`python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically.
|
`python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically.
|
||||||
|
|
||||||
## Minimal API flow
|
## Minimal API flow
|
||||||
@@ -48,6 +52,13 @@ Environment variables:
|
|||||||
- `CLICKTHROUGH_GRID_ROWS` (default `12`)
|
- `CLICKTHROUGH_GRID_ROWS` (default `12`)
|
||||||
- `CLICKTHROUGH_GRID_COLS` (default `12`)
|
- `CLICKTHROUGH_GRID_COLS` (default `12`)
|
||||||
- `CLICKTHROUGH_ALLOWED_REGION` (optional `x,y,width,height`)
|
- `CLICKTHROUGH_ALLOWED_REGION` (optional `x,y,width,height`)
|
||||||
|
- `CLICKTHROUGH_EXEC_ENABLED` (default `true`)
|
||||||
|
- `CLICKTHROUGH_EXEC_SECRET` (**required for `/exec` to run**)
|
||||||
|
- `CLICKTHROUGH_EXEC_DEFAULT_SHELL` (default `powershell`; one of `powershell`, `bash`, `cmd`)
|
||||||
|
- `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`)
|
||||||
|
- `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`)
|
||||||
|
- `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`)
|
||||||
|
- `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable)
|
||||||
|
|
||||||
## Gitea CI
|
## Gitea CI
|
||||||
|
|
||||||
|
|||||||
11
TODO.md
11
TODO.md
@@ -17,5 +17,12 @@
|
|||||||
- CI workflow runs syntax checks on push + PR
|
- CI workflow runs syntax checks on push + PR
|
||||||
|
|
||||||
## Next
|
## Next
|
||||||
- Manual runtime test on a desktop session (capture + click loop)
|
- [x] Add `POST /exec` endpoint (PowerShell/Bash/CMD) with timeout + stdout/stderr
|
||||||
- Optional: add monitor selection and OCR helper endpoint
|
- [x] Add exec configuration via env (`CLICKTHROUGH_EXEC_*`)
|
||||||
|
- [x] Document exec API + config
|
||||||
|
- [x] Create backlog issues for OCR/find/window/input/session-state improvements
|
||||||
|
- [x] Open PR for exec feature branch and review/merge
|
||||||
|
- [x] Require configured exec secret + per-request exec secret header
|
||||||
|
- [x] Upgrade skill with verify-before-click rules, confidence thresholds, two-phase risky actions, and Spotify playbook
|
||||||
|
- [x] Add top-level skill section for instance setup + mini API docs
|
||||||
|
- [x] Clarify user-owned setup responsibilities vs agent responsibilities in skill docs
|
||||||
|
|||||||
101
docs/API.md
101
docs/API.md
@@ -10,7 +10,7 @@ x-clickthrough-token: <token>
|
|||||||
|
|
||||||
## `GET /health`
|
## `GET /health`
|
||||||
|
|
||||||
Returns status and runtime safety flags.
|
Returns status and runtime safety flags, including `exec` capability config.
|
||||||
|
|
||||||
## `GET /screen`
|
## `GET /screen`
|
||||||
|
|
||||||
@@ -143,6 +143,105 @@ Hotkey:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## `POST /ocr`
|
||||||
|
|
||||||
|
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
|
||||||
|
|
||||||
|
Body:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"mode": "screen",
|
||||||
|
"language_hint": "eng",
|
||||||
|
"min_confidence": 0.4
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Modes:
|
||||||
|
- `screen` (default): OCR over full captured monitor
|
||||||
|
- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
|
||||||
|
- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
|
||||||
|
|
||||||
|
Region mode example:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"mode": "region",
|
||||||
|
"region_x": 220,
|
||||||
|
"region_y": 160,
|
||||||
|
"region_width": 900,
|
||||||
|
"region_height": 400,
|
||||||
|
"language_hint": "eng",
|
||||||
|
"min_confidence": 0.5
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Image mode example:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"mode": "image",
|
||||||
|
"image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
|
||||||
|
"language_hint": "eng"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Response shape:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"ok": true,
|
||||||
|
"request_id": "...",
|
||||||
|
"time_ms": 1710000000000,
|
||||||
|
"result": {
|
||||||
|
"mode": "screen",
|
||||||
|
"language_hint": "eng",
|
||||||
|
"min_confidence": 0.4,
|
||||||
|
"region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
|
||||||
|
"blocks": [
|
||||||
|
{
|
||||||
|
"text": "Settings",
|
||||||
|
"confidence": 0.9821,
|
||||||
|
"bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
|
||||||
|
- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
|
||||||
|
- Requires `tesseract` executable plus Python package `pytesseract`.
|
||||||
|
- If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.
|
||||||
|
|
||||||
|
## `POST /exec`
|
||||||
|
|
||||||
|
Execute a shell command on the host running Clickthrough.
|
||||||
|
|
||||||
|
Requirements:
|
||||||
|
- `CLICKTHROUGH_EXEC_SECRET` must be configured on the server
|
||||||
|
- send header `x-clickthrough-exec-secret: <secret>`
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"command": "Get-Process | Select-Object -First 5",
|
||||||
|
"shell": "powershell",
|
||||||
|
"timeout_s": 20,
|
||||||
|
"cwd": "C:/Users/Paul",
|
||||||
|
"dry_run": false
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- `shell` supports `powershell`, `bash`, `cmd`
|
||||||
|
- if `shell` is omitted, server uses `CLICKTHROUGH_EXEC_DEFAULT_SHELL`
|
||||||
|
- output is truncated based on `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS`
|
||||||
|
- endpoint can be disabled with `CLICKTHROUGH_EXEC_ENABLED=false`
|
||||||
|
- if `CLICKTHROUGH_EXEC_SECRET` is missing, `/exec` is blocked (`403`)
|
||||||
|
|
||||||
|
Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution metadata.
|
||||||
|
|
||||||
## `POST /batch`
|
## `POST /batch`
|
||||||
|
|
||||||
Runs multiple `action` payloads sequentially.
|
Runs multiple `action` payloads sequentially.
|
||||||
|
|||||||
@@ -4,3 +4,4 @@ python-dotenv>=1.0.1
|
|||||||
mss>=9.0.1
|
mss>=9.0.1
|
||||||
pillow>=10.4.0
|
pillow>=10.4.0
|
||||||
pyautogui>=0.9.54
|
pyautogui>=0.9.54
|
||||||
|
pytesseract>=0.3.10
|
||||||
|
|||||||
321
server/app.py
321
server/app.py
@@ -1,6 +1,8 @@
|
|||||||
import base64
|
import base64
|
||||||
|
import hmac
|
||||||
import io
|
import io
|
||||||
import os
|
import os
|
||||||
|
import subprocess
|
||||||
import time
|
import time
|
||||||
import uuid
|
import uuid
|
||||||
from typing import Literal, Optional
|
from typing import Literal, Optional
|
||||||
@@ -43,6 +45,13 @@ SETTINGS = {
|
|||||||
"default_grid_rows": int(os.getenv("CLICKTHROUGH_GRID_ROWS", "12")),
|
"default_grid_rows": int(os.getenv("CLICKTHROUGH_GRID_ROWS", "12")),
|
||||||
"default_grid_cols": int(os.getenv("CLICKTHROUGH_GRID_COLS", "12")),
|
"default_grid_cols": int(os.getenv("CLICKTHROUGH_GRID_COLS", "12")),
|
||||||
"allowed_region": _parse_allowed_region(),
|
"allowed_region": _parse_allowed_region(),
|
||||||
|
"exec_enabled": _env_bool("CLICKTHROUGH_EXEC_ENABLED", True),
|
||||||
|
"exec_default_shell": os.getenv("CLICKTHROUGH_EXEC_DEFAULT_SHELL", "powershell").strip().lower(),
|
||||||
|
"exec_default_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_TIMEOUT_S", "30")),
|
||||||
|
"exec_max_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_MAX_TIMEOUT_S", "120")),
|
||||||
|
"exec_max_output_chars": int(os.getenv("CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS", "20000")),
|
||||||
|
"exec_secret": os.getenv("CLICKTHROUGH_EXEC_SECRET", "").strip(),
|
||||||
|
"tesseract_cmd": os.getenv("CLICKTHROUGH_TESSERACT_CMD", "").strip(),
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -130,6 +139,35 @@ class BatchRequest(BaseModel):
|
|||||||
stop_on_error: bool = True
|
stop_on_error: bool = True
|
||||||
|
|
||||||
|
|
||||||
|
class ExecRequest(BaseModel):
|
||||||
|
command: str = Field(min_length=1, max_length=10000)
|
||||||
|
shell: Literal["powershell", "bash", "cmd"] | None = None
|
||||||
|
timeout_s: int | None = Field(default=None, ge=1, le=600)
|
||||||
|
cwd: str | None = None
|
||||||
|
dry_run: bool = False
|
||||||
|
|
||||||
|
|
||||||
|
class OCRRequest(BaseModel):
|
||||||
|
mode: Literal["screen", "region", "image"] = "screen"
|
||||||
|
region_x: int | None = Field(default=None, ge=0)
|
||||||
|
region_y: int | None = Field(default=None, ge=0)
|
||||||
|
region_width: int | None = Field(default=None, gt=0)
|
||||||
|
region_height: int | None = Field(default=None, gt=0)
|
||||||
|
image_base64: str | None = None
|
||||||
|
language_hint: str | None = Field(default=None, min_length=1, max_length=64)
|
||||||
|
min_confidence: float = Field(default=0.0, ge=0.0, le=1.0)
|
||||||
|
|
||||||
|
@model_validator(mode="after")
|
||||||
|
def _validate_mode_inputs(self):
|
||||||
|
if self.mode == "region":
|
||||||
|
required = [self.region_x, self.region_y, self.region_width, self.region_height]
|
||||||
|
if any(v is None for v in required):
|
||||||
|
raise ValueError("region_x, region_y, region_width, region_height are required for mode=region")
|
||||||
|
if self.mode == "image" and not self.image_base64:
|
||||||
|
raise ValueError("image_base64 is required for mode=image")
|
||||||
|
return self
|
||||||
|
|
||||||
|
|
||||||
def _auth(x_clickthrough_token: Optional[str] = Header(default=None)):
|
def _auth(x_clickthrough_token: Optional[str] = Header(default=None)):
|
||||||
token = SETTINGS["token"]
|
token = SETTINGS["token"]
|
||||||
if token and x_clickthrough_token != token:
|
if token and x_clickthrough_token != token:
|
||||||
@@ -259,6 +297,212 @@ def _import_input_lib():
|
|||||||
raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc
|
raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc
|
||||||
|
|
||||||
|
|
||||||
|
def _import_ocr_libs():
|
||||||
|
try:
|
||||||
|
import pytesseract
|
||||||
|
from pytesseract import Output
|
||||||
|
|
||||||
|
tesseract_cmd = SETTINGS["tesseract_cmd"]
|
||||||
|
if tesseract_cmd:
|
||||||
|
pytesseract.pytesseract.tesseract_cmd = tesseract_cmd
|
||||||
|
|
||||||
|
return pytesseract, Output
|
||||||
|
except Exception as exc:
|
||||||
|
raise HTTPException(status_code=500, detail=f"ocr backend unavailable: {exc}") from exc
|
||||||
|
|
||||||
|
|
||||||
|
def _decode_image_base64(value: str):
|
||||||
|
Image, _, _ = _import_capture_libs()
|
||||||
|
payload = value.strip()
|
||||||
|
if payload.startswith("data:"):
|
||||||
|
parts = payload.split(",", 1)
|
||||||
|
if len(parts) != 2:
|
||||||
|
raise HTTPException(status_code=400, detail="invalid data URL image payload")
|
||||||
|
payload = parts[1]
|
||||||
|
|
||||||
|
try:
|
||||||
|
image_bytes = base64.b64decode(payload, validate=True)
|
||||||
|
except Exception as exc:
|
||||||
|
raise HTTPException(status_code=400, detail="invalid image_base64 payload") from exc
|
||||||
|
|
||||||
|
try:
|
||||||
|
image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
|
||||||
|
except Exception as exc:
|
||||||
|
raise HTTPException(status_code=400, detail="unsupported or unreadable image bytes") from exc
|
||||||
|
|
||||||
|
return image
|
||||||
|
|
||||||
|
|
||||||
|
def _run_ocr(image, language_hint: str | None, min_confidence: float, offset_x: int = 0, offset_y: int = 0) -> list[dict]:
|
||||||
|
pytesseract, Output = _import_ocr_libs()
|
||||||
|
|
||||||
|
config = "--oem 3 --psm 6"
|
||||||
|
kwargs = {
|
||||||
|
"image": image,
|
||||||
|
"output_type": Output.DICT,
|
||||||
|
"config": config,
|
||||||
|
}
|
||||||
|
if language_hint:
|
||||||
|
kwargs["lang"] = language_hint
|
||||||
|
|
||||||
|
try:
|
||||||
|
data = pytesseract.image_to_data(**kwargs)
|
||||||
|
except pytesseract.TesseractNotFoundError as exc:
|
||||||
|
raise HTTPException(status_code=500, detail="tesseract executable not found") from exc
|
||||||
|
except pytesseract.TesseractError as exc:
|
||||||
|
raise HTTPException(status_code=400, detail=f"ocr failed: {exc}") from exc
|
||||||
|
|
||||||
|
blocks = []
|
||||||
|
count = len(data.get("text", []))
|
||||||
|
for idx in range(count):
|
||||||
|
text = (data["text"][idx] or "").strip()
|
||||||
|
if not text:
|
||||||
|
continue
|
||||||
|
|
||||||
|
raw_conf = str(data["conf"][idx]).strip()
|
||||||
|
try:
|
||||||
|
conf_0_100 = float(raw_conf)
|
||||||
|
except ValueError:
|
||||||
|
conf_0_100 = -1.0
|
||||||
|
if conf_0_100 < 0:
|
||||||
|
continue
|
||||||
|
|
||||||
|
confidence = round(conf_0_100 / 100.0, 4)
|
||||||
|
if confidence < min_confidence:
|
||||||
|
continue
|
||||||
|
|
||||||
|
left = int(data["left"][idx])
|
||||||
|
top = int(data["top"][idx])
|
||||||
|
width = int(data["width"][idx])
|
||||||
|
height = int(data["height"][idx])
|
||||||
|
|
||||||
|
blocks.append(
|
||||||
|
{
|
||||||
|
"text": text,
|
||||||
|
"confidence": confidence,
|
||||||
|
"bbox": {
|
||||||
|
"x": left + offset_x,
|
||||||
|
"y": top + offset_y,
|
||||||
|
"width": width,
|
||||||
|
"height": height,
|
||||||
|
},
|
||||||
|
"_sort": [top + offset_y, left + offset_x, idx],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
blocks.sort(key=lambda b: (b["_sort"][0], b["_sort"][1], b["_sort"][2]))
|
||||||
|
for block in blocks:
|
||||||
|
block.pop("_sort", None)
|
||||||
|
return blocks
|
||||||
|
|
||||||
|
|
||||||
|
def _pick_shell(explicit_shell: str | None) -> str:
|
||||||
|
shell_name = (explicit_shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip()
|
||||||
|
if shell_name not in {"powershell", "bash", "cmd"}:
|
||||||
|
raise HTTPException(status_code=400, detail="unsupported shell")
|
||||||
|
return shell_name
|
||||||
|
|
||||||
|
|
||||||
|
def _truncate_text(text: str, limit: int) -> tuple[str, bool]:
|
||||||
|
if len(text) <= limit:
|
||||||
|
return text, False
|
||||||
|
return text[:limit], True
|
||||||
|
|
||||||
|
|
||||||
|
def _resolve_exec_program(shell_name: str, command: str) -> list[str]:
|
||||||
|
if shell_name == "powershell":
|
||||||
|
return ["powershell", "-NoProfile", "-NonInteractive", "-ExecutionPolicy", "Bypass", "-Command", command]
|
||||||
|
if shell_name == "bash":
|
||||||
|
return ["bash", "-lc", command]
|
||||||
|
if shell_name == "cmd":
|
||||||
|
return ["cmd", "/c", command]
|
||||||
|
raise HTTPException(status_code=400, detail="unsupported shell")
|
||||||
|
|
||||||
|
|
||||||
|
def _exec_command(req: ExecRequest) -> dict:
|
||||||
|
if not SETTINGS["exec_enabled"]:
|
||||||
|
raise HTTPException(status_code=403, detail="exec endpoint disabled")
|
||||||
|
if not SETTINGS["exec_secret"]:
|
||||||
|
raise HTTPException(status_code=403, detail="exec secret not configured")
|
||||||
|
|
||||||
|
run_dry = SETTINGS["dry_run"] or req.dry_run
|
||||||
|
shell_name = _pick_shell(req.shell)
|
||||||
|
|
||||||
|
timeout_s = req.timeout_s if req.timeout_s is not None else SETTINGS["exec_default_timeout_s"]
|
||||||
|
timeout_s = min(timeout_s, SETTINGS["exec_max_timeout_s"])
|
||||||
|
|
||||||
|
cwd = None
|
||||||
|
if req.cwd:
|
||||||
|
cwd = os.path.abspath(req.cwd)
|
||||||
|
if not os.path.isdir(cwd):
|
||||||
|
raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
|
||||||
|
|
||||||
|
argv = _resolve_exec_program(shell_name, req.command)
|
||||||
|
|
||||||
|
if run_dry:
|
||||||
|
return {
|
||||||
|
"executed": False,
|
||||||
|
"dry_run": True,
|
||||||
|
"shell": shell_name,
|
||||||
|
"command": req.command,
|
||||||
|
"argv": argv,
|
||||||
|
"timeout_s": timeout_s,
|
||||||
|
"cwd": cwd,
|
||||||
|
}
|
||||||
|
|
||||||
|
start = time.time()
|
||||||
|
try:
|
||||||
|
completed = subprocess.run(
|
||||||
|
argv,
|
||||||
|
cwd=cwd,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=timeout_s,
|
||||||
|
check=False,
|
||||||
|
)
|
||||||
|
except subprocess.TimeoutExpired as exc:
|
||||||
|
stdout = exc.stdout or ""
|
||||||
|
stderr = exc.stderr or ""
|
||||||
|
stdout, stdout_truncated = _truncate_text(str(stdout), SETTINGS["exec_max_output_chars"])
|
||||||
|
stderr, stderr_truncated = _truncate_text(str(stderr), SETTINGS["exec_max_output_chars"])
|
||||||
|
return {
|
||||||
|
"executed": True,
|
||||||
|
"timed_out": True,
|
||||||
|
"shell": shell_name,
|
||||||
|
"command": req.command,
|
||||||
|
"argv": argv,
|
||||||
|
"timeout_s": timeout_s,
|
||||||
|
"cwd": cwd,
|
||||||
|
"duration_ms": int((time.time() - start) * 1000),
|
||||||
|
"exit_code": None,
|
||||||
|
"stdout": stdout,
|
||||||
|
"stderr": stderr,
|
||||||
|
"stdout_truncated": stdout_truncated,
|
||||||
|
"stderr_truncated": stderr_truncated,
|
||||||
|
}
|
||||||
|
except FileNotFoundError as exc:
|
||||||
|
raise HTTPException(status_code=400, detail=f"shell executable not found: {exc}") from exc
|
||||||
|
|
||||||
|
stdout, stdout_truncated = _truncate_text(completed.stdout or "", SETTINGS["exec_max_output_chars"])
|
||||||
|
stderr, stderr_truncated = _truncate_text(completed.stderr or "", SETTINGS["exec_max_output_chars"])
|
||||||
|
|
||||||
|
return {
|
||||||
|
"executed": True,
|
||||||
|
"timed_out": False,
|
||||||
|
"shell": shell_name,
|
||||||
|
"command": req.command,
|
||||||
|
"argv": argv,
|
||||||
|
"timeout_s": timeout_s,
|
||||||
|
"cwd": cwd,
|
||||||
|
"duration_ms": int((time.time() - start) * 1000),
|
||||||
|
"exit_code": completed.returncode,
|
||||||
|
"stdout": stdout,
|
||||||
|
"stderr": stderr,
|
||||||
|
"stdout_truncated": stdout_truncated,
|
||||||
|
"stderr_truncated": stderr_truncated,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def _exec_action(req: ActionRequest) -> dict:
|
def _exec_action(req: ActionRequest) -> dict:
|
||||||
run_dry = SETTINGS["dry_run"] or req.dry_run
|
run_dry = SETTINGS["dry_run"] or req.dry_run
|
||||||
|
|
||||||
@@ -331,6 +575,13 @@ def health(_: None = Depends(_auth)):
|
|||||||
"request_id": _request_id(),
|
"request_id": _request_id(),
|
||||||
"dry_run": SETTINGS["dry_run"],
|
"dry_run": SETTINGS["dry_run"],
|
||||||
"allowed_region": SETTINGS["allowed_region"],
|
"allowed_region": SETTINGS["allowed_region"],
|
||||||
|
"exec": {
|
||||||
|
"enabled": SETTINGS["exec_enabled"],
|
||||||
|
"secret_configured": bool(SETTINGS["exec_secret"]),
|
||||||
|
"default_shell": SETTINGS["exec_default_shell"],
|
||||||
|
"default_timeout_s": SETTINGS["exec_default_timeout_s"],
|
||||||
|
"max_timeout_s": SETTINGS["exec_max_timeout_s"],
|
||||||
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -449,6 +700,76 @@ def action(req: ActionRequest, _: None = Depends(_auth)):
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/exec")
|
||||||
|
def exec_command(
|
||||||
|
req: ExecRequest,
|
||||||
|
x_clickthrough_exec_secret: Optional[str] = Header(default=None),
|
||||||
|
_: None = Depends(_auth),
|
||||||
|
):
|
||||||
|
expected = SETTINGS["exec_secret"]
|
||||||
|
if not expected:
|
||||||
|
raise HTTPException(status_code=403, detail="exec secret not configured")
|
||||||
|
if not x_clickthrough_exec_secret or not hmac.compare_digest(x_clickthrough_exec_secret, expected):
|
||||||
|
raise HTTPException(status_code=401, detail="invalid exec secret")
|
||||||
|
|
||||||
|
result = _exec_command(req)
|
||||||
|
return {
|
||||||
|
"ok": True,
|
||||||
|
"request_id": _request_id(),
|
||||||
|
"time_ms": _now_ms(),
|
||||||
|
"result": result,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/ocr")
|
||||||
|
def ocr(req: OCRRequest, _: None = Depends(_auth)):
|
||||||
|
source = req.mode
|
||||||
|
if source == "image":
|
||||||
|
image = _decode_image_base64(req.image_base64 or "")
|
||||||
|
region = {"x": 0, "y": 0, "width": image.size[0], "height": image.size[1]}
|
||||||
|
blocks = _run_ocr(image, req.language_hint, req.min_confidence, 0, 0)
|
||||||
|
else:
|
||||||
|
base_img, mon = _capture_screen()
|
||||||
|
if source == "screen":
|
||||||
|
image = base_img
|
||||||
|
region = {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}
|
||||||
|
offset_x = mon["x"]
|
||||||
|
offset_y = mon["y"]
|
||||||
|
else:
|
||||||
|
left = req.region_x - mon["x"]
|
||||||
|
top = req.region_y - mon["y"]
|
||||||
|
right = left + req.region_width
|
||||||
|
bottom = top + req.region_height
|
||||||
|
|
||||||
|
if left < 0 or top < 0 or right > base_img.size[0] or bottom > base_img.size[1]:
|
||||||
|
raise HTTPException(status_code=400, detail="requested region is outside the captured monitor")
|
||||||
|
|
||||||
|
image = base_img.crop((left, top, right, bottom))
|
||||||
|
region = {
|
||||||
|
"x": req.region_x,
|
||||||
|
"y": req.region_y,
|
||||||
|
"width": req.region_width,
|
||||||
|
"height": req.region_height,
|
||||||
|
}
|
||||||
|
offset_x = req.region_x
|
||||||
|
offset_y = req.region_y
|
||||||
|
|
||||||
|
blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"ok": True,
|
||||||
|
"request_id": _request_id(),
|
||||||
|
"time_ms": _now_ms(),
|
||||||
|
"result": {
|
||||||
|
"mode": source,
|
||||||
|
"language_hint": req.language_hint,
|
||||||
|
"min_confidence": req.min_confidence,
|
||||||
|
"region": region,
|
||||||
|
"blocks": blocks,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
@app.post("/batch")
|
@app.post("/batch")
|
||||||
def batch(req: BatchRequest, _: None = Depends(_auth)):
|
def batch(req: BatchRequest, _: None = Depends(_auth)):
|
||||||
results = []
|
results = []
|
||||||
|
|||||||
@@ -1,30 +1,88 @@
|
|||||||
---
|
---
|
||||||
name: clickthrough-http-control
|
name: clickthrough-http-control
|
||||||
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
|
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
|
||||||
---
|
---
|
||||||
|
|
||||||
# Clickthrough HTTP Control
|
# Clickthrough HTTP Control
|
||||||
|
|
||||||
Use a strict observe-decide-act-verify loop.
|
Use a strict observe-decide-act-verify loop.
|
||||||
|
|
||||||
## Workflow
|
## Getting a computer instance (user-owned setup)
|
||||||
|
|
||||||
|
The **user/operator** is responsible for provisioning and exposing the target machine.
|
||||||
|
The agent should not assume it can self-install this stack.
|
||||||
|
|
||||||
|
### What the user must do
|
||||||
|
|
||||||
|
1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
|
||||||
|
2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
|
||||||
|
3. Configure secrets on target machine:
|
||||||
|
- `CLICKTHROUGH_TOKEN` for general API auth
|
||||||
|
- `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
|
||||||
|
4. Share connection details with the agent through a secure channel:
|
||||||
|
- `base_url`
|
||||||
|
- `x-clickthrough-token`
|
||||||
|
- `x-clickthrough-exec-secret` (only when `/exec` is needed)
|
||||||
|
|
||||||
|
### What the agent should do
|
||||||
|
|
||||||
|
1. Validate connection with `GET /health` using provided headers.
|
||||||
|
2. Refuse `/exec` attempts when exec secret is missing/invalid.
|
||||||
|
3. Ask user for missing setup inputs instead of guessing infrastructure.
|
||||||
|
|
||||||
|
## Mini API map
|
||||||
|
|
||||||
|
- `GET /health` → server status + safety flags
|
||||||
|
- `GET /screen` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
|
||||||
|
- `POST /zoom` → cropped screenshot around point/region (also supports `asImage=true`)
|
||||||
|
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
|
||||||
|
- `POST /action` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
|
||||||
|
- `POST /batch` → sequential action list
|
||||||
|
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
|
||||||
|
|
||||||
|
### OCR usage
|
||||||
|
|
||||||
|
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
|
||||||
|
- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
|
||||||
|
- Use `language_hint` when known (for example `eng`) to improve consistency.
|
||||||
|
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
|
||||||
|
- Treat OCR as one signal, not the only signal, before high-impact clicks.
|
||||||
|
|
||||||
|
### Header requirements
|
||||||
|
|
||||||
|
- Always send `x-clickthrough-token` when token auth is enabled.
|
||||||
|
- For `/exec`, also send `x-clickthrough-exec-secret`.
|
||||||
|
|
||||||
|
## Core workflow (mandatory)
|
||||||
|
|
||||||
1. Call `GET /screen` with coarse grid (e.g., 12x12).
|
1. Call `GET /screen` with coarse grid (e.g., 12x12).
|
||||||
2. Identify likely cell/region for the target UI element.
|
2. Identify likely target region and compute an initial confidence score.
|
||||||
3. If confidence is low, call `POST /zoom` centered on the candidate and use denser grid (e.g., 20x20).
|
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
||||||
4. Execute one minimal action via `POST /action`.
|
4. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
||||||
5. Re-capture with `GET /screen` and verify the expected state change.
|
5. Execute one minimal action via `POST /action`.
|
||||||
6. Repeat until objective is complete.
|
6. Re-capture with `GET /screen` and verify the expected state change.
|
||||||
|
7. Repeat until objective is complete.
|
||||||
|
|
||||||
|
## Verify-before-click rules
|
||||||
|
|
||||||
|
- Never click if target identity is ambiguous.
|
||||||
|
- Require at least two matching signals before click (example: OCR text + expected UI region).
|
||||||
|
- If confidence is low, do not "test click"; zoom and re-localize first.
|
||||||
|
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
|
||||||
|
1) preview intended coordinate + reason
|
||||||
|
2) execute only after explicit confirmation.
|
||||||
|
|
||||||
## Precision rules
|
## Precision rules
|
||||||
|
|
||||||
- Prefer grid targets first, then use `dx/dy` for subcell precision.
|
- Prefer grid targets first, then use `dx/dy` for subcell precision.
|
||||||
- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
|
- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
|
||||||
- Use zoom before guessing offsets.
|
- Use zoom before guessing offsets.
|
||||||
|
- Avoid stale coordinates: re-capture before action if UI moved/scrolled.
|
||||||
|
|
||||||
## Safety rules
|
## Safety rules
|
||||||
|
|
||||||
- Respect `dry_run` and `allowed_region` restrictions from `/health`.
|
- Respect `dry_run` and `allowed_region` restrictions from `/health`.
|
||||||
|
- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
|
||||||
- Avoid destructive shortcuts unless explicitly requested.
|
- Avoid destructive shortcuts unless explicitly requested.
|
||||||
- Send one action at a time unless deterministic; then use `/batch`.
|
- Send one action at a time unless deterministic; then use `/batch`.
|
||||||
|
|
||||||
@@ -33,3 +91,20 @@ Use a strict observe-decide-act-verify loop.
|
|||||||
- After every meaningful action, verify with a fresh screenshot.
|
- After every meaningful action, verify with a fresh screenshot.
|
||||||
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
|
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
|
||||||
- Prefer short, reversible actions over long macros.
|
- Prefer short, reversible actions over long macros.
|
||||||
|
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
|
||||||
|
|
||||||
|
## App-specific playbooks (recommended)
|
||||||
|
|
||||||
|
Build per-app routines for repetitive tasks instead of generic clicking.
|
||||||
|
|
||||||
|
### Spotify playbook
|
||||||
|
|
||||||
|
- Focus app window before search/navigation.
|
||||||
|
- Prefer keyboard-first flow for song start:
|
||||||
|
1) `Ctrl+L` (search)
|
||||||
|
2) type exact query
|
||||||
|
3) Enter
|
||||||
|
4) verify exact song+artist text
|
||||||
|
5) click/double-click row
|
||||||
|
6) verify now-playing bar
|
||||||
|
- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.
|
||||||
|
|||||||
Reference in New Issue
Block a user