fix(ocr): allow configuring tesseract path

docs: add MIT license
feat(ocr): add /ocr endpoint for text extraction
2026-04-06 19:02:50 +02:00 · 2026-04-06 18:31:48 +02:00 · 2026-04-06 13:53:01 +02:00 · 2026-04-06 13:50:34 +02:00 · 2026-04-06 13:48:33 +02:00 · 2026-04-05 20:35:35 +02:00
8 changed files with 553 additions and 10 deletions
--- a/.env.example
+++ b/.env.example
@@ -7,3 +7,11 @@ CLICKTHROUGH_DRY_RUN=false
 CLICKTHROUGH_GRID_ROWS=12
 CLICKTHROUGH_GRID_COLS=12
 # CLICKTHROUGH_ALLOWED_REGION=0,0,1920,1080
 CLICKTHROUGH_EXEC_ENABLED=true
 CLICKTHROUGH_EXEC_SECRET=replace-with-a-strong-random-secret
 CLICKTHROUGH_EXEC_DEFAULT_SHELL=powershell
 CLICKTHROUGH_EXEC_TIMEOUT_S=30
 CLICKTHROUGH_EXEC_MAX_TIMEOUT_S=120
 CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS=20000
 # CLICKTHROUGH_TESSERACT_CMD=/usr/bin/tesseract
--- a/21
+++ b/21
@@ -0,0 +1,21 @@
 MIT License
 Copyright (c) 2026 Paul W.
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
--- a/README.md
+++ b/README.md
@@ -7,6 +7,8 @@ Let an Agent interact with your computer over HTTP, with grid-aware screenshots
 - **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
 - **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
 - **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
 - **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr`
 - **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
 - **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
 - **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
@@ -22,6 +24,8 @@ CLICKTHROUGH_TOKEN=change-me python -m server.app
 Server defaults to `127.0.0.1:8123`.
 For OCR support, install the native `tesseract` binary on the host (in addition to Python deps), or point `CLICKTHROUGH_TESSERACT_CMD` at the executable if it lives somewhere weird.
 `python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically.
 ## Minimal API flow
@@ -48,6 +52,13 @@ Environment variables:
 - `CLICKTHROUGH_GRID_ROWS` (default `12`)
 - `CLICKTHROUGH_GRID_COLS` (default `12`)
 - `CLICKTHROUGH_ALLOWED_REGION` (optional `x,y,width,height`)
 - `CLICKTHROUGH_EXEC_ENABLED` (default `true`)
 - `CLICKTHROUGH_EXEC_SECRET` (**required for `/exec` to run**)
 - `CLICKTHROUGH_EXEC_DEFAULT_SHELL` (default `powershell`; one of `powershell`, `bash`, `cmd`)
 - `CLICKTHROUGH_EXEC_TIMEOUT_S` (default `30`)
 - `CLICKTHROUGH_EXEC_MAX_TIMEOUT_S` (default `120`)
 - `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`)
 - `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable)
 ## Gitea CI
--- a/TODO.md
+++ b/TODO.md
@@ -17,5 +17,12 @@
 - CI workflow runs syntax checks on push + PR
 ## Next
- Manual runtime test on a desktop session (capture + click loop)
+- [x] Add `POST /exec` endpoint (PowerShell/Bash/CMD) with timeout + stdout/stderr
- Optional: add monitor selection and OCR helper endpoint
+- [x] Add exec configuration via env (`CLICKTHROUGH_EXEC_*`)
 - [x] Document exec API + config
 - [x] Create backlog issues for OCR/find/window/input/session-state improvements
 - [x] Open PR for exec feature branch and review/merge
 - [x] Require configured exec secret + per-request exec secret header
 - [x] Upgrade skill with verify-before-click rules, confidence thresholds, two-phase risky actions, and Spotify playbook
 - [x] Add top-level skill section for instance setup + mini API docs
 - [x] Clarify user-owned setup responsibilities vs agent responsibilities in skill docs
--- a/docs/API.md
+++ b/docs/API.md
@@ -10,7 +10,7 @@ x-clickthrough-token: <token>
 ## `GET /health`
-Returns status and runtime safety flags.
+Returns status and runtime safety flags, including `exec` capability config.
 ## `GET /screen`
@@ -143,6 +143,105 @@ Hotkey:
 }
 ```
 ## `POST /ocr`
 Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
 Body:
 ```json
 {
  "mode": "screen",
  "language_hint": "eng",
  "min_confidence": 0.4
 }
 ```
 Modes:
 - `screen` (default): OCR over full captured monitor
 - `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
 - `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
 Region mode example:
 ```json
 {
  "mode": "region",
  "region_x": 220,
  "region_y": 160,
  "region_width": 900,
  "region_height": 400,
  "language_hint": "eng",
  "min_confidence": 0.5
 }
 ```
 Image mode example:
 ```json
 {
  "mode": "image",
  "image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
  "language_hint": "eng"
 }
 ```
 Response shape:
 ```json
 {
  "ok": true,
  "request_id": "...",
  "time_ms": 1710000000000,
  "result": {
    "mode": "screen",
    "language_hint": "eng",
    "min_confidence": 0.4,
    "region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
    "blocks": [
      {
        "text": "Settings",
        "confidence": 0.9821,
        "bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
      }
    ]
  }
 }
 ```
 Notes:
 - Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
 - `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
 - Requires `tesseract` executable plus Python package `pytesseract`.
 - If `tesseract` is not on `PATH`, set `CLICKTHROUGH_TESSERACT_CMD` to the full executable path.
 ## `POST /exec`
 Execute a shell command on the host running Clickthrough.
 Requirements:
 - `CLICKTHROUGH_EXEC_SECRET` must be configured on the server
 - send header `x-clickthrough-exec-secret: <secret>`
 ```json
 {
  "command": "Get-Process | Select-Object -First 5",
  "shell": "powershell",
  "timeout_s": 20,
  "cwd": "C:/Users/Paul",
  "dry_run": false
 }
 ```
 Notes:
 - `shell` supports `powershell`, `bash`, `cmd`
 - if `shell` is omitted, server uses `CLICKTHROUGH_EXEC_DEFAULT_SHELL`
 - output is truncated based on `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS`
 - endpoint can be disabled with `CLICKTHROUGH_EXEC_ENABLED=false`
 - if `CLICKTHROUGH_EXEC_SECRET` is missing, `/exec` is blocked (`403`)
 Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution metadata.
 ## `POST /batch`
 Runs multiple `action` payloads sequentially.
--- a/requirements.txt
+++ b/requirements.txt
@@ -4,3 +4,4 @@ python-dotenv>=1.0.1
 mss>=9.0.1
 pillow>=10.4.0
 pyautogui>=0.9.54
 pytesseract>=0.3.10
--- a/server/app.py
+++ b/server/app.py
@@ -1,6 +1,8 @@
 import base64
 import hmac
 import io
 import os
 import subprocess
 import time
 import uuid
 from typing import Literal, Optional
@@ -43,6 +45,13 @@ SETTINGS = {
    "default_grid_rows": int(os.getenv("CLICKTHROUGH_GRID_ROWS", "12")),
    "default_grid_cols": int(os.getenv("CLICKTHROUGH_GRID_COLS", "12")),
    "allowed_region": _parse_allowed_region(),
    "exec_enabled": _env_bool("CLICKTHROUGH_EXEC_ENABLED", True),
    "exec_default_shell": os.getenv("CLICKTHROUGH_EXEC_DEFAULT_SHELL", "powershell").strip().lower(),
    "exec_default_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_TIMEOUT_S", "30")),
    "exec_max_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_MAX_TIMEOUT_S", "120")),
    "exec_max_output_chars": int(os.getenv("CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS", "20000")),
    "exec_secret": os.getenv("CLICKTHROUGH_EXEC_SECRET", "").strip(),
    "tesseract_cmd": os.getenv("CLICKTHROUGH_TESSERACT_CMD", "").strip(),
 }
@@ -130,6 +139,35 @@ class BatchRequest(BaseModel):
    stop_on_error: bool = True
 class ExecRequest(BaseModel):
    command: str = Field(min_length=1, max_length=10000)
    shell: Literal["powershell", "bash", "cmd"] | None = None
    timeout_s: int | None = Field(default=None, ge=1, le=600)
    cwd: str | None = None
    dry_run: bool = False
 class OCRRequest(BaseModel):
    mode: Literal["screen", "region", "image"] = "screen"
    region_x: int | None = Field(default=None, ge=0)
    region_y: int | None = Field(default=None, ge=0)
    region_width: int | None = Field(default=None, gt=0)
    region_height: int | None = Field(default=None, gt=0)
    image_base64: str | None = None
    language_hint: str | None = Field(default=None, min_length=1, max_length=64)
    min_confidence: float = Field(default=0.0, ge=0.0, le=1.0)
    @model_validator(mode="after")
    def _validate_mode_inputs(self):
        if self.mode == "region":
            required = [self.region_x, self.region_y, self.region_width, self.region_height]
            if any(v is None for v in required):
                raise ValueError("region_x, region_y, region_width, region_height are required for mode=region")
        if self.mode == "image" and not self.image_base64:
            raise ValueError("image_base64 is required for mode=image")
        return self
 def _auth(x_clickthrough_token: Optional[str] = Header(default=None)):
    token = SETTINGS["token"]
    if token and x_clickthrough_token != token:
@@ -259,6 +297,212 @@ def _import_input_lib():
        raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc
 def _import_ocr_libs():
    try:
        import pytesseract
        from pytesseract import Output
        tesseract_cmd = SETTINGS["tesseract_cmd"]
        if tesseract_cmd:
            pytesseract.pytesseract.tesseract_cmd = tesseract_cmd
        return pytesseract, Output
    except Exception as exc:
        raise HTTPException(status_code=500, detail=f"ocr backend unavailable: {exc}") from exc
 def _decode_image_base64(value: str):
    Image, _, _ = _import_capture_libs()
    payload = value.strip()
    if payload.startswith("data:"):
        parts = payload.split(",", 1)
        if len(parts) != 2:
            raise HTTPException(status_code=400, detail="invalid data URL image payload")
        payload = parts[1]
    try:
        image_bytes = base64.b64decode(payload, validate=True)
    except Exception as exc:
        raise HTTPException(status_code=400, detail="invalid image_base64 payload") from exc
    try:
        image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
    except Exception as exc:
        raise HTTPException(status_code=400, detail="unsupported or unreadable image bytes") from exc
    return image
 def _run_ocr(image, language_hint: str | None, min_confidence: float, offset_x: int = 0, offset_y: int = 0) -> list[dict]:
    pytesseract, Output = _import_ocr_libs()
    config = "--oem 3 --psm 6"
    kwargs = {
        "image": image,
        "output_type": Output.DICT,
        "config": config,
    }
    if language_hint:
        kwargs["lang"] = language_hint
    try:
        data = pytesseract.image_to_data(**kwargs)
    except pytesseract.TesseractNotFoundError as exc:
        raise HTTPException(status_code=500, detail="tesseract executable not found") from exc
    except pytesseract.TesseractError as exc:
        raise HTTPException(status_code=400, detail=f"ocr failed: {exc}") from exc
    blocks = []
    count = len(data.get("text", []))
    for idx in range(count):
        text = (data["text"][idx] or "").strip()
        if not text:
            continue
        raw_conf = str(data["conf"][idx]).strip()
        try:
            conf_0_100 = float(raw_conf)
        except ValueError:
            conf_0_100 = -1.0
        if conf_0_100 < 0:
            continue
        confidence = round(conf_0_100 / 100.0, 4)
        if confidence < min_confidence:
            continue
        left = int(data["left"][idx])
        top = int(data["top"][idx])
        width = int(data["width"][idx])
        height = int(data["height"][idx])
        blocks.append(
            {
                "text": text,
                "confidence": confidence,
                "bbox": {
                    "x": left + offset_x,
                    "y": top + offset_y,
                    "width": width,
                    "height": height,
                },
                "_sort": [top + offset_y, left + offset_x, idx],
            }
        )
    blocks.sort(key=lambda b: (b["_sort"][0], b["_sort"][1], b["_sort"][2]))
    for block in blocks:
        block.pop("_sort", None)
    return blocks
 def _pick_shell(explicit_shell: str | None) -> str:
    shell_name = (explicit_shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip()
    if shell_name not in {"powershell", "bash", "cmd"}:
        raise HTTPException(status_code=400, detail="unsupported shell")
    return shell_name
 def _truncate_text(text: str, limit: int) -> tuple[str, bool]:
    if len(text) <= limit:
        return text, False
    return text[:limit], True
 def _resolve_exec_program(shell_name: str, command: str) -> list[str]:
    if shell_name == "powershell":
        return ["powershell", "-NoProfile", "-NonInteractive", "-ExecutionPolicy", "Bypass", "-Command", command]
    if shell_name == "bash":
        return ["bash", "-lc", command]
    if shell_name == "cmd":
        return ["cmd", "/c", command]
    raise HTTPException(status_code=400, detail="unsupported shell")
 def _exec_command(req: ExecRequest) -> dict:
    if not SETTINGS["exec_enabled"]:
        raise HTTPException(status_code=403, detail="exec endpoint disabled")
    if not SETTINGS["exec_secret"]:
        raise HTTPException(status_code=403, detail="exec secret not configured")
    run_dry = SETTINGS["dry_run"] or req.dry_run
    shell_name = _pick_shell(req.shell)
    timeout_s = req.timeout_s if req.timeout_s is not None else SETTINGS["exec_default_timeout_s"]
    timeout_s = min(timeout_s, SETTINGS["exec_max_timeout_s"])
    cwd = None
    if req.cwd:
        cwd = os.path.abspath(req.cwd)
        if not os.path.isdir(cwd):
            raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
    argv = _resolve_exec_program(shell_name, req.command)
    if run_dry:
        return {
            "executed": False,
            "dry_run": True,
            "shell": shell_name,
            "command": req.command,
            "argv": argv,
            "timeout_s": timeout_s,
            "cwd": cwd,
        }
    start = time.time()
    try:
        completed = subprocess.run(
            argv,
            cwd=cwd,
            capture_output=True,
            text=True,
            timeout=timeout_s,
            check=False,
        )
    except subprocess.TimeoutExpired as exc:
        stdout = exc.stdout or ""
        stderr = exc.stderr or ""
        stdout, stdout_truncated = _truncate_text(str(stdout), SETTINGS["exec_max_output_chars"])
        stderr, stderr_truncated = _truncate_text(str(stderr), SETTINGS["exec_max_output_chars"])
        return {
            "executed": True,
            "timed_out": True,
            "shell": shell_name,
            "command": req.command,
            "argv": argv,
            "timeout_s": timeout_s,
            "cwd": cwd,
            "duration_ms": int((time.time() - start) * 1000),
            "exit_code": None,
            "stdout": stdout,
            "stderr": stderr,
            "stdout_truncated": stdout_truncated,
            "stderr_truncated": stderr_truncated,
        }
    except FileNotFoundError as exc:
        raise HTTPException(status_code=400, detail=f"shell executable not found: {exc}") from exc
    stdout, stdout_truncated = _truncate_text(completed.stdout or "", SETTINGS["exec_max_output_chars"])
    stderr, stderr_truncated = _truncate_text(completed.stderr or "", SETTINGS["exec_max_output_chars"])
    return {
        "executed": True,
        "timed_out": False,
        "shell": shell_name,
        "command": req.command,
        "argv": argv,
        "timeout_s": timeout_s,
        "cwd": cwd,
        "duration_ms": int((time.time() - start) * 1000),
        "exit_code": completed.returncode,
        "stdout": stdout,
        "stderr": stderr,
        "stdout_truncated": stdout_truncated,
        "stderr_truncated": stderr_truncated,
    }
 def _exec_action(req: ActionRequest) -> dict:
    run_dry = SETTINGS["dry_run"] or req.dry_run
@@ -331,6 +575,13 @@ def health(_: None = Depends(_auth)):
        "request_id": _request_id(),
        "dry_run": SETTINGS["dry_run"],
        "allowed_region": SETTINGS["allowed_region"],
        "exec": {
            "enabled": SETTINGS["exec_enabled"],
            "secret_configured": bool(SETTINGS["exec_secret"]),
            "default_shell": SETTINGS["exec_default_shell"],
            "default_timeout_s": SETTINGS["exec_default_timeout_s"],
            "max_timeout_s": SETTINGS["exec_max_timeout_s"],
        },
    }
@@ -449,6 +700,76 @@ def action(req: ActionRequest, _: None = Depends(_auth)):
    }
@app.post("/exec")
 def exec_command(
    req: ExecRequest,
    x_clickthrough_exec_secret: Optional[str] = Header(default=None),
    _: None = Depends(_auth),
 ):
    expected = SETTINGS["exec_secret"]
    if not expected:
        raise HTTPException(status_code=403, detail="exec secret not configured")
    if not x_clickthrough_exec_secret or not hmac.compare_digest(x_clickthrough_exec_secret, expected):
        raise HTTPException(status_code=401, detail="invalid exec secret")
    result = _exec_command(req)
    return {
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "result": result,
    }
@app.post("/ocr")
 def ocr(req: OCRRequest, _: None = Depends(_auth)):
    source = req.mode
    if source == "image":
        image = _decode_image_base64(req.image_base64 or "")
        region = {"x": 0, "y": 0, "width": image.size[0], "height": image.size[1]}
        blocks = _run_ocr(image, req.language_hint, req.min_confidence, 0, 0)
    else:
        base_img, mon = _capture_screen()
        if source == "screen":
            image = base_img
            region = {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}
            offset_x = mon["x"]
            offset_y = mon["y"]
        else:
            left = req.region_x - mon["x"]
            top = req.region_y - mon["y"]
            right = left + req.region_width
            bottom = top + req.region_height
            if left < 0 or top < 0 or right > base_img.size[0] or bottom > base_img.size[1]:
                raise HTTPException(status_code=400, detail="requested region is outside the captured monitor")
            image = base_img.crop((left, top, right, bottom))
            region = {
                "x": req.region_x,
                "y": req.region_y,
                "width": req.region_width,
                "height": req.region_height,
            }
            offset_x = req.region_x
            offset_y = req.region_y
        blocks = _run_ocr(image, req.language_hint, req.min_confidence, offset_x, offset_y)
    return {
        "ok": True,
        "request_id": _request_id(),
        "time_ms": _now_ms(),
        "result": {
            "mode": source,
            "language_hint": req.language_hint,
            "min_confidence": req.min_confidence,
            "region": region,
            "blocks": blocks,
        },
    }
@app.post("/batch")
 def batch(req: BatchRequest, _: None = Depends(_auth)):
    results = []
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -1,30 +1,88 @@
 ---
 name: clickthrough-http-control
-description: Control a local computer through the Clickthrough HTTP server using screenshot grids, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
+description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
 ---
 # Clickthrough HTTP Control
 Use a strict observe-decide-act-verify loop.
-## Workflow
+## Getting a computer instance (user-owned setup)
 The **user/operator** is responsible for provisioning and exposing the target machine.
 The agent should not assume it can self-install this stack.
 ### What the user must do
 1. Install dependencies and run Clickthrough on the target computer (default bind: `127.0.0.1:8123`).
 2. Expose access path to the agent (LAN/Tailscale/reverse proxy) and provide the base URL.
 3. Configure secrets on target machine:
   - `CLICKTHROUGH_TOKEN` for general API auth
   - `CLICKTHROUGH_EXEC_SECRET` for `/exec` calls
 4. Share connection details with the agent through a secure channel:
   - `base_url`
   - `x-clickthrough-token`
   - `x-clickthrough-exec-secret` (only when `/exec` is needed)
 ### What the agent should do
 1. Validate connection with `GET /health` using provided headers.
 2. Refuse `/exec` attempts when exec secret is missing/invalid.
 3. Ask user for missing setup inputs instead of guessing infrastructure.
 ## Mini API map
 - `GET /health` → server status + safety flags
 - `GET /screen` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
 - `POST /zoom` → cropped screenshot around point/region (also supports `asImage=true`)
 - `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
 - `POST /action` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
 - `POST /batch` → sequential action list
 - `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
 ### OCR usage
 - Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
 - Use `mode=screen` for discovery, then `mode=region` for precision and speed.
 - Use `language_hint` when known (for example `eng`) to improve consistency.
 - Filter noise with `min_confidence` (start around `0.4` and tune per app).
 - Treat OCR as one signal, not the only signal, before high-impact clicks.
 ### Header requirements
 - Always send `x-clickthrough-token` when token auth is enabled.
 - For `/exec`, also send `x-clickthrough-exec-secret`.
 ## Core workflow (mandatory)
 1. Call `GET /screen` with coarse grid (e.g., 12x12).
-2. Identify likely cell/region for the target UI element.
+2. Identify likely target region and compute an initial confidence score.
-3. If confidence is low, call `POST /zoom` centered on the candidate and use denser grid (e.g., 20x20).
+3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
-4. Execute one minimal action via `POST /action`.
+4. **Before any click**, verify target identity (OCR text/icon/location consistency).
-5. Re-capture with `GET /screen` and verify the expected state change.
+5. Execute one minimal action via `POST /action`.
-6. Repeat until objective is complete.
+6. Re-capture with `GET /screen` and verify the expected state change.
 7. Repeat until objective is complete.
 ## Verify-before-click rules
 - Never click if target identity is ambiguous.
 - Require at least two matching signals before click (example: OCR text + expected UI region).
 - If confidence is low, do not "test click"; zoom and re-localize first.
 - For high-impact actions (close/delete/send/purchase), use two-phase flow:
  1) preview intended coordinate + reason
  2) execute only after explicit confirmation.
 ## Precision rules
 - Prefer grid targets first, then use `dx/dy` for subcell precision.
 - Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed.
 - Use zoom before guessing offsets.
 - Avoid stale coordinates: re-capture before action if UI moved/scrolled.
 ## Safety rules
 - Respect `dry_run` and `allowed_region` restrictions from `/health`.
 - Respect `/exec` security requirements (`CLICKTHROUGH_EXEC_SECRET` + `x-clickthrough-exec-secret`).
 - Avoid destructive shortcuts unless explicitly requested.
 - Send one action at a time unless deterministic; then use `/batch`.
@@ -33,3 +91,20 @@ Use a strict observe-decide-act-verify loop.
 - After every meaningful action, verify with a fresh screenshot.
 - On mismatch, do not spam clicks: zoom, re-localize, and retry once.
 - Prefer short, reversible actions over long macros.
 - If two retries fail, switch strategy (hotkey/window focus/search) instead of repeating the same click.
 ## App-specific playbooks (recommended)
 Build per-app routines for repetitive tasks instead of generic clicking.
 ### Spotify playbook
 - Focus app window before search/navigation.
 - Prefer keyboard-first flow for song start:
  1) `Ctrl+L` (search)
  2) type exact query
  3) Enter
  4) verify exact song+artist text
  5) click/double-click row
  6) verify now-playing bar
 - If now-playing does not match target track, stop and re-localize; do not keep clicking nearby rows.
Author	SHA1	Message	Date
Luna	a8f2e01bb9	fix(ocr): allow configuring tesseract path All checks were successful python-syntax / syntax-check (pull_request) Successful in 9s Details python-syntax / syntax-check (push) Successful in 8s Details	2026-04-06 19:02:50 +02:00
Luna	dccf7b209a	docs: add MIT license All checks were successful python-syntax / syntax-check (push) Successful in 4s Details	2026-04-06 18:31:48 +02:00
Luna	89cf228d13	feat(ocr): add /ocr endpoint for text extraction All checks were successful python-syntax / syntax-check (push) Successful in 6s Details Merge PR #7: add OCR endpoint and skill/docs updates	2026-04-06 13:53:01 +02:00
Luna	a6d7e37beb	docs(skill): include OCR endpoint workflow guidance All checks were successful python-syntax / syntax-check (push) Successful in 4s Details python-syntax / syntax-check (pull_request) Successful in 10s Details	2026-04-06 13:50:34 +02:00
Luna	097c6a095c	feat(ocr): add /ocr endpoint for screen, region, and image input All checks were successful python-syntax / syntax-check (push) Successful in 5s Details python-syntax / syntax-check (pull_request) Successful in 4s Details	2026-04-06 13:48:33 +02:00
Luna	2955426f14	docs(skill): clarify user-owned instance setup responsibilities All checks were successful python-syntax / syntax-check (push) Successful in 4s Details	2026-04-05 20:35:35 +02:00
Luna	3a49560e82	docs(skill): add instance setup and mini API quick reference All checks were successful python-syntax / syntax-check (push) Successful in 4s Details	2026-04-05 20:34:14 +02:00
Luna	2b84bf95f1	docs(skill): add verify-first workflow and app-specific playbooks All checks were successful python-syntax / syntax-check (push) Successful in 9s Details	2026-04-05 20:32:19 +02:00
Luna	1efe999331	feat(exec): add low-friction shell execution endpoint All checks were successful python-syntax / syntax-check (push) Successful in 10s Details	2026-04-05 20:23:48 +02:00
Luna	38c1127347	feat(exec): require configured secret and header auth for /exec All checks were successful python-syntax / syntax-check (push) Successful in 4s Details python-syntax / syntax-check (pull_request) Successful in 5s Details	2026-04-05 20:22:18 +02:00
Luna	930cdd2887	feat(exec): add shell command execution endpoint All checks were successful python-syntax / syntax-check (push) Successful in 8s Details python-syntax / syntax-check (pull_request) Successful in 4s Details	2026-04-05 20:18:07 +02:00