Compare commits

...

8 Commits

Author SHA1 Message Date
a8f2e01bb9 fix(ocr): allow configuring tesseract path
All checks were successful
python-syntax / syntax-check (pull_request) Successful in 9s
python-syntax / syntax-check (push) Successful in 8s
2026-04-06 19:02:50 +02:00
dccf7b209a docs: add MIT license
All checks were successful
python-syntax / syntax-check (push) Successful in 4s
2026-04-06 18:31:48 +02:00
89cf228d13 feat(ocr): add /ocr endpoint for text extraction
All checks were successful
python-syntax / syntax-check (push) Successful in 6s
Merge PR #7: add OCR endpoint and skill/docs updates
2026-04-06 13:53:01 +02:00
a6d7e37beb docs(skill): include OCR endpoint workflow guidance
All checks were successful
python-syntax / syntax-check (push) Successful in 4s
python-syntax / syntax-check (pull_request) Successful in 10s
2026-04-06 13:50:34 +02:00
097c6a095c feat(ocr): add /ocr endpoint for screen, region, and image input
All checks were successful
python-syntax / syntax-check (push) Successful in 5s
python-syntax / syntax-check (pull_request) Successful in 4s
2026-04-06 13:48:33 +02:00
2955426f14 docs(skill): clarify user-owned instance setup responsibilities
All checks were successful
python-syntax / syntax-check (push) Successful in 4s
2026-04-05 20:35:35 +02:00
3a49560e82 docs(skill): add instance setup and mini API quick reference
All checks were successful
python-syntax / syntax-check (push) Successful in 4s
2026-04-05 20:34:14 +02:00
2b84bf95f1 docs(skill): add verify-first workflow and app-specific playbooks
All checks were successful
python-syntax / syntax-check (push) Successful in 9s
2026-04-05 20:32:19 +02:00
8 changed files with 355 additions and 8 deletions

View File

@@ -14,3 +14,4 @@ CLICKTHROUGH_EXEC_DEFAULT_SHELL=powershell
CLICKTHROUGH_EXEC_TIMEOUT_S=30
CLICKTHROUGH_EXEC_MAX_TIMEOUT_S=120
CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS=20000
# CLICKTHROUGH_TESSERACT_CMD=/usr/bin/tesseract

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2026 Paul W.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@@ -7,6 +7,7 @@ Let an Agent interact with your computer over HTTP, with grid-aware screenshots
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
- **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr`
- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
- **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
@@ -23,6 +24,8 @@ CLICKTHROUGH_TOKEN=change-me python -m server.app
Server defaults to `127.0.0.1:8123`.
For OCR support, install the native `tesseract` binary on the host (in addition to Python deps), or point `CLICKTHROUGH_TESSERACT_CMD` at the executable if it lives somewhere weird.
`python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically.
## Minimal API flow
@@ -55,6 +58,7 @@ Environment variables:
- `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`)
- `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`)
- `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`)
- `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable)
## Gitea CI

View File

@@ -21,5 +21,8 @@
- [x] Add exec configuration via env (`CLICKTHROUGH_EXEC_*`)
- [x] Document exec API + config
- [x] Create backlog issues for OCR/find/window/input/session-state improvements
- [ ] Open PR for exec feature branch and review/merge
- [x] Open PR for exec feature branch and review/merge
- [x] Require configured exec secret + per-request exec secret header
- [x] Upgrade skill with verify-before-click rules, confidence thresholds, two-phase risky actions, and Spotify playbook
- [x] Add top-level skill section for instance setup + mini API docs
- [x] Clarify user-owned setup responsibilities vs agent responsibilities in skill docs

View File

@@ -143,6 +143,78 @@ Hotkey:
}
```
## `POST /ocr`
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
Body:
```json
{
"mode": "screen",
"language_hint": "eng",
"min_confidence": 0.4
}
```
Modes:
- `screen` (default): OCR over full captured monitor
- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
Region mode example:
```json
{
"mode": "region",
"region_x": 220,
"region_y": 160,
"region_width": 900,
"region_height": 400,
"language_hint": "eng",
"min_confidence": 0.5
}
```
Image mode example:
```json
{
"mode": "image",
"image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
"language_hint": "eng"
}
```
Response shape:
```json
{
"ok": true,
"request_id": "...",
"time_ms": 1710000000000,
"result": {
"mode": "screen",
"language_hint": "eng",
"min_confidence": 0.4,
"region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
"blocks": [
{
"text": "Settings",
"confidence": 0.9821,
"bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
}
]
}
}
```
Notes:
- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
- Requires `tesseract` executable plus Python package `pytesseract`.
- If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.
## `POST /exec`
Execute a shell command on the host running Clickthrough.

View File

@@ -4,3 +4,4 @@ python-dotenv>=1.0.1
mss>=9.0.1
pillow>=10.4.0
pyautogui>=0.9.54
pytesseract>=0.3.10

View File

@@ -51,6 +51,7 @@ SETTINGS = {
"exec_max_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_MAX_TIMEOUT_S", "120")),
"exec_max_output_chars": int(os.getenv("CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS", "20000")),
"exec_secret": os.getenv("CLICKTHROUGH_EXEC_SECRET", "").strip(),
"tesseract_cmd": os.getenv("CLICKTHROUGH_TESSERACT_CMD", "").strip(),
}
@@ -146,6 +147,27 @@ class ExecRequest(BaseModel):
dry_run: bool = False
class OCRRequest(BaseModel):
mode: Literal["screen", "region", "image"] = "screen"
region_x: int | None = Field(default=None, ge=0)
region_y: int | None = Field(default=None, ge=0)
region_width: int | None = Field(default=None, gt=0)
region_height: int | None = Field(default=None, gt=0)
image_base64: str | None = None
language_hint: str | None = Field(default=None, min_length=1, max_length=64)
min_confidence: float = Field(default=0.0, ge=0.0, le=1.0)
@model_validator(mode="after")
def _validate_mode_inputs(self):
if self.mode == "region":
required = [self.region_x, self.region_y, self.region_width, self.region_height]
if any(v is None for v in required):
raise ValueError("region_x, region_y, region_width, region_height are required for mode=region")
if self.mode == "image" and not self.image_base64:
raise ValueError("image_base64 is required for mode=image")
return self
def _auth(x_clickthrough_token: Optional[str] = Header(default=None)):
token = SETTINGS["token"]
if token and x_clickthrough_token != token:
@@ -275,6 +297,105 @@ def _import_input_lib():
raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc
def _import_ocr_libs():
try:
import pytesseract
from pytesseract import Output
tesseract_cmd = SETTINGS["tesseract_cmd"]
if tesseract_cmd:
pytesseract.pytesseract.tesseract_cmd = tesseract_cmd
return pytesseract, Output
except Exception as exc:
raise HTTPException(status_code=500, detail=f"ocr backend unavailable: {exc}") from exc
def _decode_image_base64(value: str):
Image, _, _ = _import_capture_libs()
payload = value.strip()
if payload.startswith("data:"):
parts = payload.split(",", 1)
if len(parts) != 2:
raise HTTPException(status_code=400, detail="invalid data URL image payload")
payload = parts[1]
try:
image_bytes = base64.b64decode(payload, validate=True)
except Exception as exc:
raise HTTPException(status_code=400, detail="invalid image_base64 payload") from exc
try:
image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
except Exception as exc:
raise HTTPException(status_code=400, detail="unsupported or unreadable image bytes") from exc
return image
def _run_ocr(image, language_hint: str | None, min_confidence: float, offset_x: int = 0, offset_y: int = 0) -> list[dict]:
pytesseract, Output = _import_ocr_libs()
config = "--oem 3 --psm 6"
kwargs = {
"image": image,
"output_type": Output.DICT,
"config": config,
}
if language_hint:
kwargs["lang"] = language_hint
try:
data = pytesseract.image_to_data(**kwargs)
except pytesseract.TesseractNotFoundError as exc:
raise HTTPException(status_code=500, detail="tesseract executable not found") from exc
except pytesseract.TesseractError as exc:
raise HTTPException(status_code=400, detail=f"ocr failed: {exc}") from exc
blocks = []
count = len(data.get("text", []))
for idx in range(count):
text = (data["text"][idx] or "").strip()
if not text:
continue
raw_conf = str(data["conf"][idx]).strip()
try:
conf_0_100 = float(raw_conf)
except ValueError:
conf_0_100 = -1.0
if conf_0_100 < 0:
continue
confidence = round(conf_0_100 / 100.0, 4)
if confidence < min_confidence:
continue
left = int(data["left"][idx])
top = int(data["top"][idx])
width = int(data["width"][idx])
height = int(data["height"][idx])
blocks.append(
{
"text": text,
"confidence": confidence,
"bbox": {
"x": left + offset_x,
"y": top + offset_y,
"width": width,
"height": height,
},
"_sort": [top + offset_y, left + offset_x, idx],
}
)
blocks.sort(key=lambda b: (b["_sort"][0], b["_sort"][1], b["_sort"][2]))
for block in blocks:
block.pop("_sort", None)
return blocks
def _pick_shell(explicit_shell: str | None) -> str:
shell_name = (explicit_shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip()
if shell_name not in {"powershell", "bash", "cmd"}:
@@ -600,6 +721,55 @@ def exec_command(
}
@app.post("/ocr")
def ocr(req: OCRRequest, _: None = Depends(_auth)):
source = req.mode
if source == "image":
image = _decode_image_base64(req.image_base64 or "")
region = {"x": 0, "y": 0, "width": image.size[0], "height": image.size[1]}
blocks = _run_ocr(image, req.language_hint, req.min_confidence, 0, 0)
else:
base_img, mon = _capture_screen()
if source == "screen":
image = base_img
region = {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}
offset_x = mon["x"]
offset_y = mon["y"]
else:
left = req.region_x - mon["x"]
top = req.region_y - mon["y"]
right = left + req.region_width
bottom = top + req.region_height
if left < 0 or top < 0 or right > base_img.size[0] or bottom > base_img.size[1]:
raise HTTPException(status_code=400, detail="requested region is outside the captured monitor")
image = base_img.crop((left, top, right, bottom))
region = {
"x": req.region_x,
"y": req.region_y,
"width": req.region_width,
"height": req.region_height,
}
offset_x = req.region_x
offset_y = req.region_y
blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y)
return {
"ok": True,
"request_id": _request_id(),
"time_ms": _now_ms(),
"result": {
"mode": source,
"language_hint": req.language_hint,
"min_confidence": req.min_confidence,
"region": region,
"blocks": blocks,
},
}
@app.post("/batch")
def batch(req: BatchRequest, _: None = Depends(_auth)):
results = []

View File

@@ -1,30 +1,88 @@
---
name: clickthrough-http-control
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
---
# Clickthrough HTTP Control
Use a strict observe-decide-act-verify loop.
## Workflow
## Getting a computer instance (user-owned setup)
The **user/operator** is responsible for provisioning and exposing the target machine.
The agent should not assume it can self-install this stack.
### What the user must do
1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
3. Configure secrets on target machine:
- `CLICKTHROUGH_TOKEN` for general API auth
- `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
4. Share connection details with the agent through a secure channel:
- `base_url`
- `x-clickthrough-token`
- `x-clickthrough-exec-secret` (only when `/exec` is needed)
### What the agent should do
1. Validate connection with `GET /health` using provided headers.
2. Refuse `/exec` attempts when exec secret is missing/invalid.
3. Ask user for missing setup inputs instead of guessing infrastructure.
## Mini API map
- `GET /health` → server status + safety flags
- `GET /screen` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
- `POST /zoom` → cropped screenshot around point/region (also supports `asImage=true`)
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
- `POST /action` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
- `POST /batch` → sequential action list
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
### OCR usage
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
- Use `language_hint` when known (for example `eng`) to improve consistency.
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
- Treat OCR as one signal, not the only signal, before high-impact clicks.
### Header requirements
- Always send `x-clickthrough-token` when token auth is enabled.
- For `/exec`, also send `x-clickthrough-exec-secret`.
## Core workflow (mandatory)
1. Call `GET /screen` with coarse grid (e.g., 12x12).
2. Identify likely cell/region for the target UI element.
3. If confidence is low, call `POST /zoom` centered on the candidate and use denser grid (e.g., 20x20).
4. Execute one minimal action via `POST /action`.
5. Re-capture with `GET /screen` and verify the expected state change.
6. Repeat until objective is complete.
2. Identify likely target region and compute an initial confidence score.
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
4. **Before any click**, verify target identity (OCR text/icon/location consistency).
5. Execute one minimal action via `POST /action`.
6. Re-capture with `GET /screen` and verify the expected state change.
7. Repeat until objective is complete.
## Verify-before-click rules
- Never click if target identity is ambiguous.
- Require at least two matching signals before click (example: OCR text + expected UI region).
- If confidence is low, do not "test click"; zoom and re-localize first.
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
1) preview intended coordinate + reason
2) execute only after explicit confirmation.
## Precision rules
- Prefer grid targets first, then use `dx/dy` for subcell precision.
- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
- Use zoom before guessing offsets.
- Avoid stale coordinates: re-capture before action if UI moved/scrolled.
## Safety rules
- Respect `dry_run` and `allowed_region` restrictions from `/health`.
- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
- Avoid destructive shortcuts unless explicitly requested.
- Send one action at a time unless deterministic; then use `/batch`.
@@ -33,3 +91,20 @@ Use a strict observe-decide-act-verify loop.
- After every meaningful action, verify with a fresh screenshot.
- On mismatch, do not spam clicks: zoom, re-localize, and retry once.
- Prefer short, reversible actions over long macros.
- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
## App-specific playbooks (recommended)
Build per-app routines for repetitive tasks instead of generic clicking.
### Spotify playbook
- Focus app window before search/navigation.
- Prefer keyboard-first flow for song start:
1) `Ctrl+L` (search)
2) type exact query
3) Enter
4) verify exact song+artist text
5) click/double-click row
6) verify now-playing bar
- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.