fix(ocr): allow configuring tesseract path

docs: add MIT license
feat(ocr): add /ocr endpoint for text extraction
2026-04-06 19:02:50 +02:00 · 2026-04-06 18:31:48 +02:00 · 2026-04-06 13:53:01 +02:00 · 2026-04-06 13:50:34 +02:00 · 2026-04-06 13:48:33 +02:00 · 2026-04-05 20:35:35 +02:00
8 changed files with 355 additions and 8 deletions
--- a/.env.example
+++ b/.env.example
@@ -14,3 +14,4 @@ CLICKTHROUGH_EXEC_DEFAULT_SHELL=powershell
 CLICKTHROUGH_EXEC_TIMEOUT_S=30
 CLICKTHROUGH_EXEC_MAX_TIMEOUT_S=120
 CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS=20000
+# CLICKTHROUGH_TESSERACT_CMD=/usr/bin/tesseract
--- a/21
+++ b/21
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2026 Paul W.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
@@ -7,6 +7,7 @@ Let an Agent interact with your computer over HTTP, with grid-aware screenshots
 - **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
 - **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
 - **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
+- **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr`
 - **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
 - **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
 - **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
@@ -23,6 +24,8 @@ CLICKTHROUGH_TOKEN=change-me python -m server.app

 Server defaults to `127.0.0.1:8123`.

+For OCR support, install the native `tesseract` binary on the host (in addition to Python deps), or point `CLICKTHROUGH_TESSERACT_CMD` at the executable if it lives somewhere weird.
+
 `python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically.

 ## Minimal API flow
@@ -55,6 +58,7 @@ Environment variables:
 - `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`)
 - `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`)
 - `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`)
+- `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable)

 ## Gitea CI

--- a/TODO.md
+++ b/TODO.md
@@ -21,5 +21,8 @@
 - [x] Add exec configuration via env (`CLICKTHROUGH_EXEC_*`)
 - [x] Document exec API + config
 - [x] Create backlog issues for OCR/find/window/input/session-state improvements
- [ ] Open PR for exec feature branch and review/merge
+- [x] Open PR for exec feature branch and review/merge
 - [x] Require configured exec secret + per-request exec secret header
+- [x] Upgrade skill with verify-before-click rules, confidence thresholds, two-phase risky actions, and Spotify playbook
+- [x] Add top-level skill section for instance setup + mini API docs
+- [x] Clarify user-owned setup responsibilities vs agent responsibilities in skill docs
--- a/docs/API.md
+++ b/docs/API.md
@@ -143,6 +143,78 @@ Hotkey:
 }
 ```

+## `POST /ocr`
+
+Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
+
+Body:
+
+```json
+{
+  "mode": "screen",
+  "language_hint": "eng",
+  "min_confidence": 0.4
+}
+```
+
+Modes:
+- `screen` (default): OCR over full captured monitor
+- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
+- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
+
+Region mode example:
+
+```json
+{
+  "mode": "region",
+  "region_x": 220,
+  "region_y": 160,
+  "region_width": 900,
+  "region_height": 400,
+  "language_hint": "eng",
+  "min_confidence": 0.5
+}
+```
+
+Image mode example:
+
+```json
+{
+  "mode": "image",
+  "image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
+  "language_hint": "eng"
+}
+```
+
+Response shape:
+
+```json
+{
+  "ok": true,
+  "request_id": "...",
+  "time_ms": 1710000000000,
+  "result": {
+    "mode": "screen",
+    "language_hint": "eng",
+    "min_confidence": 0.4,
+    "region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
+    "blocks": [
+      {
+        "text": "Settings",
+        "confidence": 0.9821,
+        "bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
+      }
+    ]
+  }
+}
+```
+
+Notes:
+- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
+- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
+- Requires `tesseract` executable plus Python package `pytesseract`.
+- If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.
+
 ## `POST /exec`

 Execute a shell command on the host running Clickthrough.
--- a/requirements.txt
+++ b/requirements.txt
@@ -4,3 +4,4 @@ python-dotenv>=1.0.1
 mss>=9.0.1
 pillow>=10.4.0
 pyautogui>=0.9.54
+pytesseract>=0.3.10
--- a/server/app.py
+++ b/server/app.py
@@ -51,6 +51,7 @@ SETTINGS = {
    "exec_max_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_MAX_TIMEOUT_S", "120")),
    "exec_max_output_chars": int(os.getenv("CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS", "20000")),
    "exec_secret": os.getenv("CLICKTHROUGH_EXEC_SECRET", "").strip(),
+    "tesseract_cmd": os.getenv("CLICKTHROUGH_TESSERACT_CMD", "").strip(),
 }


@@ -146,6 +147,27 @@ class ExecRequest(BaseModel):
    dry_run: bool = False


+class OCRRequest(BaseModel):
+    mode: Literal["screen", "region", "image"] = "screen"
+    region_x: int | None = Field(default=None, ge=0)
+    region_y: int | None = Field(default=None, ge=0)
+    region_width: int | None = Field(default=None, gt=0)
+    region_height: int | None = Field(default=None, gt=0)
+    image_base64: str | None = None
+    language_hint: str | None = Field(default=None, min_length=1, max_length=64)
+    min_confidence: float = Field(default=0.0, ge=0.0, le=1.0)
+
+    @model_validator(mode="after")
+    def _validate_mode_inputs(self):
+        if self.mode == "region":
+            required = [self.region_x, self.region_y, self.region_width, self.region_height]
+            if any(v is None for v in required):
+                raise ValueError("region_x, region_y, region_width, region_height are required for mode=region")
+        if self.mode == "image" and not self.image_base64:
+            raise ValueError("image_base64 is required for mode=image")
+        return self
+
+
 def _auth(x_clickthrough_token: Optional[str] = Header(default=None)):
    token = SETTINGS["token"]
    if token and x_clickthrough_token != token:
@@ -275,6 +297,105 @@ def _import_input_lib():
        raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc


+def _import_ocr_libs():
+    try:
+        import pytesseract
+        from pytesseract import Output
+
+        tesseract_cmd = SETTINGS["tesseract_cmd"]
+        if tesseract_cmd:
+            pytesseract.pytesseract.tesseract_cmd = tesseract_cmd
+
+        return pytesseract, Output
+    except Exception as exc:
+        raise HTTPException(status_code=500, detail=f"ocr backend unavailable: {exc}") from exc
+
+
+def _decode_image_base64(value: str):
+    Image, _, _ = _import_capture_libs()
+    payload = value.strip()
+    if payload.startswith("data:"):
+        parts = payload.split(",", 1)
+        if len(parts) != 2:
+            raise HTTPException(status_code=400, detail="invalid data URL image payload")
+        payload = parts[1]
+
+    try:
+        image_bytes = base64.b64decode(payload, validate=True)
+    except Exception as exc:
+        raise HTTPException(status_code=400, detail="invalid image_base64 payload") from exc
+
+    try:
+        image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
+    except Exception as exc:
+        raise HTTPException(status_code=400, detail="unsupported or unreadable image bytes") from exc
+
+    return image
+
+
+def _run_ocr(image, language_hint: str | None, min_confidence: float, offset_x: int = 0, offset_y: int = 0) -> list[dict]:
+    pytesseract, Output = _import_ocr_libs()
+
+    config = "--oem 3 --psm 6"
+    kwargs = {
+        "image": image,
+        "output_type": Output.DICT,
+        "config": config,
+    }
+    if language_hint:
+        kwargs["lang"] = language_hint
+
+    try:
+        data = pytesseract.image_to_data(**kwargs)
+    except pytesseract.TesseractNotFoundError as exc:
+        raise HTTPException(status_code=500, detail="tesseract executable not found") from exc
+    except pytesseract.TesseractError as exc:
+        raise HTTPException(status_code=400, detail=f"ocr failed: {exc}") from exc
+
+    blocks = []
+    count = len(data.get("text", []))
+    for idx in range(count):
+        text = (data["text"][idx] or "").strip()
+        if not text:
+            continue
+
+        raw_conf = str(data["conf"][idx]).strip()
+        try:
+            conf_0_100 = float(raw_conf)
+        except ValueError:
+            conf_0_100 = -1.0
+        if conf_0_100 < 0:
+            continue
+
+        confidence = round(conf_0_100 / 100.0, 4)
+        if confidence < min_confidence:
+            continue
+
+        left = int(data["left"][idx])
+        top = int(data["top"][idx])
+        width = int(data["width"][idx])
+        height = int(data["height"][idx])
+
+        blocks.append(
+            {
+                "text": text,
+                "confidence": confidence,
+                "bbox": {
+                    "x": left + offset_x,
+                    "y": top + offset_y,
+                    "width": width,
+                    "height": height,
+                },
+                "_sort": [top + offset_y, left + offset_x, idx],
+            }
+        )
+
+    blocks.sort(key=lambda b: (b["_sort"][0], b["_sort"][1], b["_sort"][2]))
+    for block in blocks:
+        block.pop("_sort", None)
+    return blocks
+
+
 def _pick_shell(explicit_shell: str | None) -> str:
    shell_name = (explicit_shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip()
    if shell_name not in {"powershell", "bash", "cmd"}:
@@ -600,6 +721,55 @@ def exec_command(
    }


+@app.post("/ocr")
+def ocr(req: OCRRequest, _: None = Depends(_auth)):
+    source = req.mode
+    if source == "image":
+        image = _decode_image_base64(req.image_base64 or "")
+        region = {"x": 0, "y": 0, "width": image.size[0], "height": image.size[1]}
+        blocks = _run_ocr(image, req.language_hint, req.min_confidence, 0, 0)
+    else:
+        base_img, mon = _capture_screen()
+        if source == "screen":
+            image = base_img
+            region = {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}
+            offset_x = mon["x"]
+            offset_y = mon["y"]
+        else:
+            left = req.region_x - mon["x"]
+            top = req.region_y - mon["y"]
+            right = left + req.region_width
+            bottom = top + req.region_height
+
+            if left < 0 or top < 0 or right > base_img.size[0] or bottom > base_img.size[1]:
+                raise HTTPException(status_code=400, detail="requested region is outside the captured monitor")
+
+            image = base_img.crop((left, top, right, bottom))
+            region = {
+                "x": req.region_x,
+                "y": req.region_y,
+                "width": req.region_width,
+                "height": req.region_height,
+            }
+            offset_x = req.region_x
+            offset_y = req.region_y
+
+        blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y)
+
+    return {
+        "ok": True,
+        "request_id": _request_id(),
+        "time_ms": _now_ms(),
+        "result": {
+            "mode": source,
+            "language_hint": req.language_hint,
+            "min_confidence": req.min_confidence,
+            "region": region,
+            "blocks": blocks,
+        },
+    }
+
+
@app.post("/batch")
 def batch(req: BatchRequest, _: None = Depends(_auth)):
    results = []
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -1,30 +1,88 @@
 ---
 name: clickthrough-http-control
-description: Control a local computer through the Clickthrough HTTP server using screenshot grids, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
+description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
 ---

 # Clickthrough HTTP Control

 Use a strict observe-decide-act-verify loop.

-## Workflow
+## Getting a computer instance (user-owned setup)
+
+The **user/operator** is responsible for provisioning and exposing the target machine.
+The agent should not assume it can self-install this stack.
+
+### What the user must do
+
+1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
+2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
+3. Configure secrets on target machine:
+   - `CLICKTHROUGH_TOKEN` for general API auth
+   - `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
+4. Share connection details with the agent through a secure channel:
+   - `base_url`
+   - `x-clickthrough-token`
+   - `x-clickthrough-exec-secret` (only when `/exec` is needed)
+
+### What the agent should do
+
+1. Validate connection with `GET /health` using provided headers.
+2. Refuse `/exec` attempts when exec secret is missing/invalid.
+3. Ask user for missing setup inputs instead of guessing infrastructure.
+
+## Mini API map
+
+- `GET /health` → server status + safety flags
+- `GET /screen` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
+- `POST /zoom` → cropped screenshot around point/region (also supports `asImage=true`)
+- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
+- `POST /action` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
+- `POST /batch` → sequential action list
+- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
+
+### OCR usage
+
+- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
+- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
+- Use `language_hint` when known (for example `eng`) to improve consistency.
+- Filter noise with `min_confidence` (start around `0.4` and tune per app).
+- Treat OCR as one signal, not the only signal, before high-impact clicks.
+
+### Header requirements
+
+- Always send `x-clickthrough-token` when token auth is enabled.
+- For `/exec`, also send `x-clickthrough-exec-secret`.
+
+## Core workflow (mandatory)

 1. Call `GET /screen` with coarse grid (e.g., 12x12).
-2. Identify likely cell/region for the target UI element.
-3. If confidence is low, call `POST /zoom` centered on the candidate and use denser grid (e.g., 20x20).
-4. Execute one minimal action via `POST /action`.
-5. Re-capture with `GET /screen` and verify the expected state change.
-6. Repeat until objective is complete.
+2. Identify likely target region and compute an initial confidence score.
+3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
+4. **Before any click**, verify target identity (OCR text/icon/location consistency).
+5. Execute one minimal action via `POST /action`.
+6. Re-capture with `GET /screen` and verify the expected state change.
+7. Repeat until objective is complete.
+
+## Verify-before-click rules
+
+- Never click if target identity is ambiguous.
+- Require at least two matching signals before click (example: OCR text + expected UI region).
+- If confidence is low, do not "test click"; zoom and re-localize first.
+- For high-impact actions (close/delete/send/purchase), use two-phase flow:
+  1) preview intended coordinate + reason
+  2) execute only after explicit confirmation.

 ## Precision rules

 - Prefer grid targets first, then use `dx/dy` for subcell precision.
 - Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
 - Use zoom before guessing offsets.
+- Avoid stale coordinates: re-capture before action if UI moved/scrolled.

 ## Safety rules

 - Respect `dry_run` and `allowed_region` restrictions from `/health`.
+- Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
 - Avoid destructive shortcuts unless explicitly requested.
 - Send one action at a time unless deterministic; then use `/batch`.

@@ -33,3 +91,20 @@ Use a strict observe-decide-act-verify loop.
 - After every meaningful action, verify with a fresh screenshot.
 - On mismatch, do not spam clicks: zoom, re-localize, and retry once.
 - Prefer short, reversible actions over long macros.
+- If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
+
+## App-specific playbooks (recommended)
+
+Build per-app routines for repetitive tasks instead of generic clicking.
+
+### Spotify playbook
+
+- Focus app window before search/navigation.
+- Prefer keyboard-first flow for song start:
+  1) `Ctrl+L` (search)
+  2) type exact query
+  3) Enter
+  4) verify exact song+artist text
+  5) click/double-click row
+  6) verify now-playing bar
+- If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.
Author	SHA1	Message	Date
Luna	a8f2e01bb9	fix(ocr): allow configuring tesseract path All checks were successful python-syntax / syntax-check (pull_request) Successful in 9s Details python-syntax / syntax-check (push) Successful in 8s Details	2026-04-06 19:02:50 +02:00
Luna	dccf7b209a	docs: add MIT license All checks were successful python-syntax / syntax-check (push) Successful in 4s Details	2026-04-06 18:31:48 +02:00
Luna	89cf228d13	feat(ocr): add /ocr endpoint for text extraction All checks were successful python-syntax / syntax-check (push) Successful in 6s Details Merge PR #7: add OCR endpoint and skill/docs updates	2026-04-06 13:53:01 +02:00
Luna	a6d7e37beb	docs(skill): include OCR endpoint workflow guidance All checks were successful python-syntax / syntax-check (push) Successful in 4s Details python-syntax / syntax-check (pull_request) Successful in 10s Details	2026-04-06 13:50:34 +02:00
Luna	097c6a095c	feat(ocr): add /ocr endpoint for screen, region, and image input All checks were successful python-syntax / syntax-check (push) Successful in 5s Details python-syntax / syntax-check (pull_request) Successful in 4s Details	2026-04-06 13:48:33 +02:00
Luna	2955426f14	docs(skill): clarify user-owned instance setup responsibilities All checks were successful python-syntax / syntax-check (push) Successful in 4s Details	2026-04-05 20:35:35 +02:00
Luna	3a49560e82	docs(skill): add instance setup and mini API quick reference All checks were successful python-syntax / syntax-check (push) Successful in 4s Details	2026-04-05 20:34:14 +02:00
Luna	2b84bf95f1	docs(skill): add verify-first workflow and app-specific playbooks All checks were successful python-syntax / syntax-check (push) Successful in 9s Details	2026-04-05 20:32:19 +02:00