Compare commits

...

5 Commits

Author SHA1 Message Date
a8f2e01bb9 fix(ocr): allow configuring tesseract path
All checks were successful
python-syntax / syntax-check (pull_request) Successful in 9s
python-syntax / syntax-check (push) Successful in 8s
2026-04-06 19:02:50 +02:00
dccf7b209a docs: add MIT license
All checks were successful
python-syntax / syntax-check (push) Successful in 4s
2026-04-06 18:31:48 +02:00
89cf228d13 feat(ocr): add /ocr endpoint for text extraction
All checks were successful
python-syntax / syntax-check (push) Successful in 6s
Merge PR #7: add OCR endpoint and skill/docs updates
2026-04-06 13:53:01 +02:00
a6d7e37beb docs(skill): include OCR endpoint workflow guidance
All checks were successful
python-syntax / syntax-check (push) Successful in 4s
python-syntax / syntax-check (pull_request) Successful in 10s
2026-04-06 13:50:34 +02:00
097c6a095c feat(ocr): add /ocr endpoint for screen, region, and image input
All checks were successful
python-syntax / syntax-check (push) Successful in 5s
python-syntax / syntax-check (pull_request) Successful in 4s
2026-04-06 13:48:33 +02:00
7 changed files with 281 additions and 3 deletions

View File

@@ -14,3 +14,4 @@ CLICKTHROUGH_EXEC_DEFAULT_SHELL=powershell
CLICKTHROUGH_EXEC_TIMEOUT_S=30 CLICKTHROUGH_EXEC_TIMEOUT_S=30
CLICKTHROUGH_EXEC_MAX_TIMEOUT_S=120 CLICKTHROUGH_EXEC_MAX_TIMEOUT_S=120
CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS=20000 CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS=20000
# CLICKTHROUGH_TESSERACT_CMD=/usr/bin/tesseract

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2026 Paul W.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@@ -7,6 +7,7 @@ Let an Agent interact with your computer over HTTP, with grid-aware screenshots
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes) - **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported) - **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey - **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
- **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr`
- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec` - **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels - **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
- **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction - **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
@@ -23,6 +24,8 @@ CLICKTHROUGH_TOKEN=change-me python -m server.app
Server defaults to `127.0.0.1:8123`. Server defaults to `127.0.0.1:8123`.
For OCR support, install the native `tesseract` binary on the host (in addition to Python deps), or point `CLICKTHROUGH_TESSERACT_CMD` at the executable if it lives somewhere weird.
`python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically. `python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically.
## Minimal API flow ## Minimal API flow
@@ -55,6 +58,7 @@ Environment variables:
- `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`) - `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`)
- `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`) - `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`)
- `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`) - `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`)
- `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable)
## Gitea CI ## Gitea CI

View File

@@ -143,6 +143,78 @@ Hotkey:
} }
``` ```
## `POST /ocr`
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
Body:
```json
{
"mode": "screen",
"language_hint": "eng",
"min_confidence": 0.4
}
```
Modes:
- `screen` (default): OCR over full captured monitor
- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
Region mode example:
```json
{
"mode": "region",
"region_x": 220,
"region_y": 160,
"region_width": 900,
"region_height": 400,
"language_hint": "eng",
"min_confidence": 0.5
}
```
Image mode example:
```json
{
"mode": "image",
"image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
"language_hint": "eng"
}
```
Response shape:
```json
{
"ok": true,
"request_id": "...",
"time_ms": 1710000000000,
"result": {
"mode": "screen",
"language_hint": "eng",
"min_confidence": 0.4,
"region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
"blocks": [
{
"text": "Settings",
"confidence": 0.9821,
"bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
}
]
}
}
```
Notes:
- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
- Requires `tesseract` executable plus Python package `pytesseract`.
- If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.
## `POST /exec` ## `POST /exec`
Execute a shell command on the host running Clickthrough. Execute a shell command on the host running Clickthrough.

View File

@@ -4,3 +4,4 @@ python-dotenv>=1.0.1
mss>=9.0.1 mss>=9.0.1
pillow>=10.4.0 pillow>=10.4.0
pyautogui>=0.9.54 pyautogui>=0.9.54
pytesseract>=0.3.10

View File

@@ -51,6 +51,7 @@ SETTINGS = {
"exec_max_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_MAX_TIMEOUT_S", "120")), "exec_max_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_MAX_TIMEOUT_S", "120")),
"exec_max_output_chars": int(os.getenv("CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS", "20000")), "exec_max_output_chars": int(os.getenv("CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS", "20000")),
"exec_secret": os.getenv("CLICKTHROUGH_EXEC_SECRET", "").strip(), "exec_secret": os.getenv("CLICKTHROUGH_EXEC_SECRET", "").strip(),
"tesseract_cmd": os.getenv("CLICKTHROUGH_TESSERACT_CMD", "").strip(),
} }
@@ -146,6 +147,27 @@ class ExecRequest(BaseModel):
dry_run: bool = False dry_run: bool = False
class OCRRequest(BaseModel):
mode: Literal["screen", "region", "image"] = "screen"
region_x: int | None = Field(default=None, ge=0)
region_y: int | None = Field(default=None, ge=0)
region_width: int | None = Field(default=None, gt=0)
region_height: int | None = Field(default=None, gt=0)
image_base64: str | None = None
language_hint: str | None = Field(default=None, min_length=1, max_length=64)
min_confidence: float = Field(default=0.0, ge=0.0, le=1.0)
@model_validator(mode="after")
def _validate_mode_inputs(self):
if self.mode == "region":
required = [self.region_x, self.region_y, self.region_width, self.region_height]
if any(v is None for v in required):
raise ValueError("region_x, region_y, region_width, region_height are required for mode=region")
if self.mode == "image" and not self.image_base64:
raise ValueError("image_base64 is required for mode=image")
return self
def _auth(x_clickthrough_token: Optional[str] = Header(default=None)): def _auth(x_clickthrough_token: Optional[str] = Header(default=None)):
token = SETTINGS["token"] token = SETTINGS["token"]
if token and x_clickthrough_token != token: if token and x_clickthrough_token != token:
@@ -275,6 +297,105 @@ def _import_input_lib():
raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc
def _import_ocr_libs():
try:
import pytesseract
from pytesseract import Output
tesseract_cmd = SETTINGS["tesseract_cmd"]
if tesseract_cmd:
pytesseract.pytesseract.tesseract_cmd = tesseract_cmd
return pytesseract, Output
except Exception as exc:
raise HTTPException(status_code=500, detail=f"ocr backend unavailable: {exc}") from exc
def _decode_image_base64(value: str):
Image, _, _ = _import_capture_libs()
payload = value.strip()
if payload.startswith("data:"):
parts = payload.split(",", 1)
if len(parts) != 2:
raise HTTPException(status_code=400, detail="invalid data URL image payload")
payload = parts[1]
try:
image_bytes = base64.b64decode(payload, validate=True)
except Exception as exc:
raise HTTPException(status_code=400, detail="invalid image_base64 payload") from exc
try:
image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
except Exception as exc:
raise HTTPException(status_code=400, detail="unsupported or unreadable image bytes") from exc
return image
def _run_ocr(image, language_hint: str | None, min_confidence: float, offset_x: int = 0, offset_y: int = 0) -> list[dict]:
pytesseract, Output = _import_ocr_libs()
config = "--oem 3 --psm 6"
kwargs = {
"image": image,
"output_type": Output.DICT,
"config": config,
}
if language_hint:
kwargs["lang"] = language_hint
try:
data = pytesseract.image_to_data(**kwargs)
except pytesseract.TesseractNotFoundError as exc:
raise HTTPException(status_code=500, detail="tesseract executable not found") from exc
except pytesseract.TesseractError as exc:
raise HTTPException(status_code=400, detail=f"ocr failed: {exc}") from exc
blocks = []
count = len(data.get("text", []))
for idx in range(count):
text = (data["text"][idx] or "").strip()
if not text:
continue
raw_conf = str(data["conf"][idx]).strip()
try:
conf_0_100 = float(raw_conf)
except ValueError:
conf_0_100 = -1.0
if conf_0_100 < 0:
continue
confidence = round(conf_0_100 / 100.0, 4)
if confidence < min_confidence:
continue
left = int(data["left"][idx])
top = int(data["top"][idx])
width = int(data["width"][idx])
height = int(data["height"][idx])
blocks.append(
{
"text": text,
"confidence": confidence,
"bbox": {
"x": left + offset_x,
"y": top + offset_y,
"width": width,
"height": height,
},
"_sort": [top + offset_y, left + offset_x, idx],
}
)
blocks.sort(key=lambda b: (b["_sort"][0], b["_sort"][1], b["_sort"][2]))
for block in blocks:
block.pop("_sort", None)
return blocks
def _pick_shell(explicit_shell: str | None) -> str: def _pick_shell(explicit_shell: str | None) -> str:
shell_name = (explicit_shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip() shell_name = (explicit_shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip()
if shell_name not in {"powershell", "bash", "cmd"}: if shell_name not in {"powershell", "bash", "cmd"}:
@@ -600,6 +721,55 @@ def exec_command(
} }
@app.post("/ocr")
def ocr(req: OCRRequest, _: None = Depends(_auth)):
source = req.mode
if source == "image":
image = _decode_image_base64(req.image_base64 or "")
region = {"x": 0, "y": 0, "width": image.size[0], "height": image.size[1]}
blocks = _run_ocr(image, req.language_hint, req.min_confidence, 0, 0)
else:
base_img, mon = _capture_screen()
if source == "screen":
image = base_img
region = {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}
offset_x = mon["x"]
offset_y = mon["y"]
else:
left = req.region_x - mon["x"]
top = req.region_y - mon["y"]
right = left + req.region_width
bottom = top + req.region_height
if left < 0 or top < 0 or right > base_img.size[0] or bottom > base_img.size[1]:
raise HTTPException(status_code=400, detail="requested region is outside the captured monitor")
image = base_img.crop((left, top, right, bottom))
region = {
"x": req.region_x,
"y": req.region_y,
"width": req.region_width,
"height": req.region_height,
}
offset_x = req.region_x
offset_y = req.region_y
blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y)
return {
"ok": True,
"request_id": _request_id(),
"time_ms": _now_ms(),
"result": {
"mode": source,
"language_hint": req.language_hint,
"min_confidence": req.min_confidence,
"region": region,
"blocks": blocks,
},
}
@app.post("/batch") @app.post("/batch")
def batch(req: BatchRequest, _: None = Depends(_auth)): def batch(req: BatchRequest, _: None = Depends(_auth)):
results = [] results = []

View File

@@ -1,6 +1,6 @@
--- ---
name: clickthrough-http-control name: clickthrough-http-control
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification. description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
--- ---
# Clickthrough HTTP Control # Clickthrough HTTP Control
@@ -35,10 +35,19 @@ The agent should not assume it can self-install this stack.
- `GET /health` → server status + safety flags - `GET /health` → server status + safety flags
- `GET /screen` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`) - `GET /screen` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
- `POST /zoom` → cropped screenshot around point/region (also supports `asImage=true`) - `POST /zoom` → cropped screenshot around point/region (also supports `asImage=true`)
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
- `POST /action` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...) - `POST /action` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
- `POST /batch` → sequential action list - `POST /batch` → sequential action list
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header) - `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
### OCR usage
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
- Use `language_hint` when known (for example `eng`) to improve consistency.
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
- Treat OCR as one signal, not the only signal, before high-impact clicks.
### Header requirements ### Header requirements
- Always send `x-clickthrough-token` when token auth is enabled. - Always send `x-clickthrough-token` when token auth is enabled.
@@ -49,7 +58,7 @@ The agent should not assume it can self-install this stack.
1. Call `GET /screen` with coarse grid (e.g., 12x12). 1. Call `GET /screen` with coarse grid (e.g., 12x12).
2. Identify likely target region and compute an initial confidence score. 2. Identify likely target region and compute an initial confidence score.
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate. 3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
4. **Before any click**, verify target identity (text/icon/location consistency). 4. **Before any click**, verify target identity (OCR text/icon/location consistency).
5. Execute one minimal action via `POST /action`. 5. Execute one minimal action via `POST /action`.
6. Re-capture with `GET /screen` and verify the expected state change. 6. Re-capture with `GET /screen` and verify the expected state change.
7. Repeat until objective is complete. 7. Repeat until objective is complete.
@@ -57,7 +66,7 @@ The agent should not assume it can self-install this stack.
## Verify-before-click rules ## Verify-before-click rules
- Never click if target identity is ambiguous. - Never click if target identity is ambiguous.
- Require at least two matching signals before click (example: expected text + expected UI region). - Require at least two matching signals before click (example: OCR text + expected UI region).
- If confidence is low, do not "test click"; zoom and re-localize first. - If confidence is low, do not "test click"; zoom and re-localize first.
- For high-impact actions (close/delete/send/purchase), use two-phase flow: - For high-impact actions (close/delete/send/purchase), use two-phase flow:
1) preview intended coordinate + reason 1) preview intended coordinate + reason