refactor: simplify to see/interact/exec and split server modules

2026-05-03 20:07:12 +02:00
parent aced5be25e
commit 1c03cab457
8 changed files with 911 additions and 1928 deletions
--- a/README.md
+++ b/README.md
@@ -1,49 +1,37 @@
 # Clickthrough
-Let an agent interact with a computer over HTTP.
+Clickthrough is a lightweight HTTP control layer that lets an AI safely operate a real computer by repeatedly capturing structured screenshots with coordinate-aware grids (`see`), executing precise mouse/keyboard actions from those coordinates (`interact`), and optionally running authenticated shell commands for system-level tasks (`exec`) under a consistent response contract.
-## Primary mode (v2)
+## Core Methods
-Use the v2 contract for faster, less OCR-heavy control loops:
+- `POST /see`: Capture a full screen or region, optionally with a click-ready grid overlay.
- `POST /v2/observe`
+- `POST /see/zoom`: Capture a tighter crop around a point and draw a denser grid for precise targeting.
- `POST /v2/localize`
+- `POST /interact`: Perform one mouse or keyboard action (`click`, `scroll`, `type`, `hotkey`, etc.).
- `POST /v2/act`
+- `POST /exec`: Run PowerShell/Bash/CMD commands when shell-level control is needed.
 - `POST /v2/act-verify`
-This is optimized for agents that cannot directly see the screen and must use screenshot/image tools.
+## Why this works for AI agents
-## What this provides
+- Agents do not need live vision; they iterate on snapshots.
 - Grid metadata bridges image understanding to deterministic click coordinates.
 - Interaction stays explicit and auditable (one action per request).
 - A unified response envelope (`ok`, `data`, `error`) reduces agent-side branching.
- Screen/region capture with optional OCR and timing stats
+## Minimal Agent Loop
 - Observation IDs for deterministic follow-up localization
 - Text localization and image-tool coordinate localization
 - Action execution with resolved target IDs
 - Risk-aware action+verification defaults
 - Unified response envelope across all endpoints
-## Quick start
+1. Call `see` with a coarse grid.
 2. If uncertain, call `see/zoom` with a denser grid.
 3. Call `interact` once.
 4. Call `see` again to verify state change.
 5. Use `exec` only for explicit shell/system tasks.
-```bash
+## Safety and Auth
 cd /root/external-projects/clickthrough
 python3 -m venv .venv
 . .venv/bin/activate
 pip install -r requirements.txt
 CLICKTHROUGH_TOKEN=change-me python -m server.app
 ```
-Server defaults to `127.0.0.1:8123`.
+- `x-clickthrough-token` protects API access when enabled.
 - `x-clickthrough-exec-secret` is required for `/exec`.
 - Optional dry-run and allowed-region constraints reduce accidental risk.
-## Fast control loop
+## Docs
-1. `POST /v2/observe` on a tight region
+- API: `docs/API.md`
-2. If OCR is enough, `POST /v2/localize` with `text_query`
+- Agent procedure: `skill/SKILL.md`
-3. If ambiguous, ask image tool for one x,y in observation bounds
+- Coordinate system details: `docs/coordinate-system.md`
 4. `POST /v2/localize` with `image_tool_point`
 5. `POST /v2/act` or `POST /v2/act-verify`
 6. Re-observe only changed region
 ## See docs
 - `docs/API.md`
 - `skill/SKILL.md`
 - `docs/coordinate-system.md`
--- a/docs/API.md
+++ b/docs/API.md
@@ -1,116 +1,21 @@
-# API Reference (v2)
+# API Reference
 Base URL: `http://127.0.0.1:8123`
-If `CLICKTHROUGH_TOKEN` is set, include:
+Auth header when enabled:
 ```http
 x-clickthrough-token: <token>
 ```
-## Endpoints
+This API is intended for AI computer control through 3 methods only:
 - `see`
 - `interact`
 - `exec`
- `POST /v2/observe`
+All responses use one envelope.
 - `POST /v2/localize`
 - `POST /v2/act`
 - `POST /v2/act-verify`
 - `GET /health`
 - `GET /displays`
 - `GET /windows`
 - `POST /windows/action`
 - `POST /launch`
 - `POST /exec`
-No v1 endpoints are supported.
+## Response Envelope
 ## `POST /v2/observe`
 ```json
 {
  "mode": "region",
  "region_x": 800,
  "region_y": 420,
  "region_width": 700,
  "region_height": 420,
  "include_image": true,
  "image_format": "jpeg",
  "jpeg_quality": 75,
  "ocr_mode": "region",
  "language_hint": "eng",
  "min_confidence": 0.45,
  "max_ocr_area_px": 1500000,
  "group_lines": true
 }
 ```
 Returns observation metadata, optional image, OCR blocks/lines, and timing fields.
 ## `POST /v2/localize`
 Text localization:
 ```json
 {
  "observation_id": "...",
  "text_query": "Save",
  "text_match": "exact",
  "candidate_index": 0
 }
 ```
 Image-tool point localization:
 ```json
 {
  "observation_id": "...",
  "image_tool_point": {"x": 312, "y": 188}
 }
 ```
 Returns `resolved_target_id`, global pixel, and `localization_confidence`.
 ## `POST /v2/act`
 ```json
 {
  "action": {
    "action": "click",
    "target": {"resolved_target_id": "..."},
    "button": "left",
    "clicks": 1
  }
 }
 ```
 ## `POST /v2/act-verify`
 ```json
 {
  "action": {
    "action": "click",
    "target": {"resolved_target_id": "..."}
  },
  "condition": {
    "kind": "text",
    "mode": "region",
    "text": "Saved",
    "match": "contains",
    "present": true,
    "region_x": 820,
    "region_y": 420,
    "region_width": 500,
    "region_height": 140,
    "min_confidence": 0.4
  },
  "risk_level": "low"
 }
 ```
 Risk defaults:
 - `low`: retries `0`, timeout `2500ms`
 - `high`: retries `1`, timeout `6000ms`
 ## Response envelope
 Success:
@@ -133,9 +38,124 @@ Error:
  "time_ms": 1710000000000,
  "data": null,
  "error": {
-    "code": "http_error",
+    "code": "validation_error",
-    "message": "...",
+    "message": "request validation failed",
-    "details": {}
+    "details": []
  }
 }
 ```
 ## 1) See
 ### `POST /see`
 Capture a full screen or a region. Optional grid overlay returns coordinate metadata for click mapping.
 ```json
 {
  "screen": 0,
  "region_x": null,
  "region_y": null,
  "region_width": null,
  "region_height": null,
  "with_grid": true,
  "grid_rows": 12,
  "grid_cols": 12,
  "include_labels": true,
  "image_format": "png",
  "jpeg_quality": 85
 }
 ```
 Returns:
 - `data.image.base64`
 - `data.meta.region` (global desktop coords)
 - `data.meta.grid` (rows/cols/cell size + formula)
 ### `POST /see/zoom`
 Capture a tighter crop around a global point and draw another grid over that crop.
 ```json
 {
  "screen": 0,
  "center_x": 1200,
  "center_y": 720,
  "width": 500,
  "height": 350,
  "with_grid": true,
  "grid_rows": 20,
  "grid_cols": 20,
  "include_labels": true,
  "image_format": "png",
  "jpeg_quality": 90
 }
 ```
 Use this for precision before clicking tiny controls.
 ## 2) Interact
 ### `POST /interact`
 Mouse/keyboard action execution.
 ```json
 {
  "screen": 0,
  "action": {
    "action": "click",
    "target": {
      "mode": "grid",
      "region_x": 0,
      "region_y": 0,
      "region_width": 1920,
      "region_height": 1080,
      "rows": 12,
      "cols": 12,
      "row": 7,
      "col": 3,
      "dx": 0.0,
      "dy": 0.0
    },
    "button": "left",
    "clicks": 1
  }
 }
 ```
 Supported actions:
 - `move`, `click`, `right_click`, `double_click`, `middle_click`
 - `scroll` (`scroll_amount`)
 - `type` (`text`, `interval_ms`)
 - `hotkey` (`keys`)
 Target modes:
 - `pixel`: absolute global `x,y`
 - `grid`: grid cell from a `see`/`see/zoom` response
 ## 3) Exec
 ### `POST /exec`
 Run host shell commands (PowerShell/Bash/CMD).
 ```json
 {
  "command": "Get-Process | Select-Object -First 5",
  "shell": "powershell",
  "timeout_s": 20,
  "cwd": "C:/Users/Paul",
  "dry_run": false
 }
 ```
 Required header:
 ```http
 x-clickthrough-exec-secret: <secret>
 ```
 ## Minimal Procedure for Agents
 1. `see` full screen with coarse grid.
 2. If uncertain, `see/zoom` target area with denser grid.
 3. `interact` one action.
 4. `see` again to confirm state change.
 5. Use `exec` only when GUI interaction is not the right tool.
--- a/examples/quickstart.py
+++ b/examples/quickstart.py
@@ -15,24 +15,25 @@ if TOKEN:
 def main():
    health = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10)
    health.raise_for_status()
-    print("health ok:", health.json().get("ok"))
+    print("health:", health.json()["data"])
-    observe = requests.post(
+    see = requests.post(
-        f"{BASE_URL}/v2/observe",
+        f"{BASE_URL}/see",
        headers=headers,
        params={"screen": SCREEN},
        json={
-            "mode": "screen",
+            "screen": SCREEN,
-            "include_image": False,
+            "with_grid": True,
-            "ocr_mode": "none",
+            "grid_rows": 12,
            "grid_cols": 12,
            "image_format": "jpeg",
            "jpeg_quality": 70,
        },
-        timeout=20,
+        timeout=30,
    )
-    observe.raise_for_status()
+    see.raise_for_status()
-    payload = observe.json()["data"]
+    payload = see.json()["data"]
-    print("observation_id:", payload["observation_id"])
+    print("region:", payload["meta"]["region"])
-    print("region:", payload["region"])
+    print("grid:", payload["meta"].get("grid", {}))
    print("timing_ms:", payload["timing_ms"])
 if __name__ == "__main__":
--- a/server/app.py
+++ b/server/app.py
--- a/server/config.py
+++ b/server/config.py
@@ -0,0 +1,42 @@
 import os
 from typing import Optional
 from dotenv import load_dotenv
 load_dotenv(dotenv_path=".env", override=False)
 def _env_bool(name: str, default: bool) -> bool:
    raw = os.getenv(name)
    if raw is None:
        return default
    return raw.strip().lower() in {"1", "true", "yes", "on"}
 def _parse_allowed_region() -> Optional[tuple[int, int, int, int]]:
    raw = os.getenv("CLICKTHROUGH_ALLOWED_REGION")
    if not raw:
        return None
    parts = [p.strip() for p in raw.split(",")]
    if len(parts) != 4:
        raise ValueError("CLICKTHROUGH_ALLOWED_REGION must be x,y,width,height")
    x, y, w, h = (int(p) for p in parts)
    if w <= 0 or h <= 0:
        raise ValueError("CLICKTHROUGH_ALLOWED_REGION width/height must be > 0")
    return x, y, w, h
 SETTINGS = {
    "host": os.getenv("CLICKTHROUGH_HOST", "127.0.0.1"),
    "port": int(os.getenv("CLICKTHROUGH_PORT", "8123")),
    "token": os.getenv("CLICKTHROUGH_TOKEN", "").strip(),
    "dry_run": _env_bool("CLICKTHROUGH_DRY_RUN", False),
    "allowed_region": _parse_allowed_region(),
    "exec_enabled": _env_bool("CLICKTHROUGH_EXEC_ENABLED", True),
    "exec_default_shell": os.getenv("CLICKTHROUGH_EXEC_DEFAULT_SHELL", "powershell").strip().lower(),
    "exec_default_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_TIMEOUT_S", "30")),
    "exec_max_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_MAX_TIMEOUT_S", "120")),
    "exec_max_output_chars": int(os.getenv("CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS", "20000")),
    "exec_secret": os.getenv("CLICKTHROUGH_EXEC_SECRET", "").strip(),
 }
--- a/server/models.py
+++ b/server/models.py
@@ -0,0 +1,124 @@
 from typing import Literal, Optional
 from pydantic import BaseModel, Field, model_validator
 class PixelTarget(BaseModel):
    mode: Literal["pixel"]
    x: int
    y: int
    dx: int = 0
    dy: int = 0
 class GridTarget(BaseModel):
    mode: Literal["grid"]
    region_x: int
    region_y: int
    region_width: int = Field(gt=0)
    region_height: int = Field(gt=0)
    rows: int = Field(gt=0)
    cols: int = Field(gt=0)
    row: int = Field(ge=0)
    col: int = Field(ge=0)
    dx: float = 0.0
    dy: float = 0.0
    @model_validator(mode="after")
    def _validate_indices(self):
        if self.row >= self.rows or self.col >= self.cols:
            raise ValueError("row/col must be inside rows/cols")
        if not -1.0 <= self.dx <= 1.0:
            raise ValueError("dx must be in [-1, 1]")
        if not -1.0 <= self.dy <= 1.0:
            raise ValueError("dy must be in [-1, 1]")
        return self
 Target = PixelTarget | GridTarget
 class ActionRequest(BaseModel):
    action: Literal[
        "move",
        "click",
        "right_click",
        "double_click",
        "middle_click",
        "scroll",
        "type",
        "hotkey",
    ]
    target: Optional[Target] = None
    duration_ms: int = Field(default=0, ge=0, le=20000)
    button: Literal["left", "right", "middle"] = "left"
    clicks: int = Field(default=1, ge=1, le=10)
    scroll_amount: int = 0
    text: str = ""
    keys: list[str] = Field(default_factory=list)
    interval_ms: int = Field(default=20, ge=0, le=5000)
    dry_run: bool = False
 class ExecRequest(BaseModel):
    command: str = Field(min_length=1, max_length=10000)
    shell: Literal["powershell", "bash", "cmd"] | None = None
    timeout_s: int | None = Field(default=None, ge=1, le=600)
    cwd: str | None = None
    dry_run: bool = False
 class WindowQuery(BaseModel):
    title_contains: str | None = Field(default=None, max_length=512)
    title_regex: str | None = Field(default=None, max_length=512)
    process_name: str | None = Field(default=None, max_length=260)
    hwnd: int | None = Field(default=None, ge=1)
    visible_only: bool = True
 class WindowActionRequest(WindowQuery):
    action: Literal["focus", "restore", "minimize", "maximize", "close"]
    timeout_ms: int = Field(default=3000, ge=0, le=60000)
 class LaunchRequest(BaseModel):
    executable: str = Field(min_length=1, max_length=2048)
    args: list[str] = Field(default_factory=list, max_length=100)
    cwd: str | None = None
    wait_for_window: bool = False
    match: WindowQuery | None = None
    timeout_ms: int = Field(default=5000, ge=0, le=120000)
    dry_run: bool = False
 class SeeRequest(BaseModel):
    screen: int = 0
    region_x: int | None = Field(default=None, ge=0)
    region_y: int | None = Field(default=None, ge=0)
    region_width: int | None = Field(default=None, gt=0)
    region_height: int | None = Field(default=None, gt=0)
    with_grid: bool = True
    grid_rows: int = Field(default=12, ge=1, le=300)
    grid_cols: int = Field(default=12, ge=1, le=300)
    include_labels: bool = True
    image_format: Literal["png", "jpeg"] = "png"
    jpeg_quality: int = Field(default=85, ge=1, le=100)
 class SeeZoomRequest(BaseModel):
    screen: int = 0
    center_x: int = Field(ge=0)
    center_y: int = Field(ge=0)
    width: int = Field(default=500, ge=10)
    height: int = Field(default=350, ge=10)
    with_grid: bool = True
    grid_rows: int = Field(default=20, ge=1, le=300)
    grid_cols: int = Field(default=20, ge=1, le=300)
    include_labels: bool = True
    image_format: Literal["png", "jpeg"] = "png"
    jpeg_quality: int = Field(default=90, ge=1, le=100)
 class InteractRequest(BaseModel):
    screen: int = 0
    action: ActionRequest
--- a/server/services.py
+++ b/server/services.py
@@ -0,0 +1,462 @@
 import ctypes
 import io
 import os
 import re
 import subprocess
 import sys
 import time
 from typing import Literal
 from fastapi import HTTPException
 from PIL import ImageChops, ImageStat
 from .config import SETTINGS
 from .models import ActionRequest, GridTarget, LaunchRequest, PixelTarget, Target, WindowActionRequest, WindowQuery
 def import_capture_libs():
    try:
        from PIL import Image, ImageDraw
        import mss
        return Image, ImageDraw, mss
    except Exception as exc:
        raise HTTPException(status_code=500, detail=f"capture backend unavailable: {exc}") from exc
 def display_region(mon: dict, screen: int, mss_index: int, primary: bool) -> dict:
    return {
        "screen": screen,
        "mss_index": mss_index,
        "primary": primary,
        "x": mon["left"],
        "y": mon["top"],
        "width": mon["width"],
        "height": mon["height"],
    }
 def ordered_displays(sct) -> list[dict]:
    raw_monitors = list(enumerate(sct.monitors[1:], start=1))
    if not raw_monitors:
        raise HTTPException(status_code=500, detail="no displays detected")
    primary_pos = next((idx for idx, (_, mon) in enumerate(raw_monitors) if mon["left"] == 0 and mon["top"] == 0), 0)
    ordered = [raw_monitors[primary_pos]] + [item for idx, item in enumerate(raw_monitors) if idx != primary_pos]
    return [display_region(mon, screen=index, mss_index=mss_index, primary=(index == 0)) for index, (mss_index, mon) in enumerate(ordered)]
 def get_displays() -> list[dict]:
    _, _, mss = import_capture_libs()
    with mss.mss() as sct:
        return ordered_displays(sct)
 def select_display(screen: int) -> tuple[dict, list[dict], dict]:
    displays = get_displays()
    selected = displays[screen] if 0 <= screen < len(displays) else displays[0]
    return selected, displays, {"requested": screen, "selected": selected["screen"], "fallback": selected["screen"] != screen}
 def capture_screen(screen: int = 0):
    Image, _, mss = import_capture_libs()
    with mss.mss() as sct:
        displays = ordered_displays(sct)
        mon = displays[screen] if 0 <= screen < len(displays) else displays[0]
        shot = sct.grab({"left": mon["x"], "top": mon["y"], "width": mon["width"], "height": mon["height"]})
        image = Image.frombytes("RGB", shot.size, shot.rgb)
        selection = {"requested": screen, "selected": mon["screen"], "fallback": mon["screen"] != screen}
        return image, mon, displays, selection
 def capture_region_image(screen: int, region_x: int | None, region_y: int | None, region_width: int | None, region_height: int | None):
    base_img, mon, displays, screen_selection = capture_screen(screen)
    if None in {region_x, region_y, region_width, region_height}:
        return base_img, {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}, mon, displays, screen_selection
    left = region_x - mon["x"]
    top = region_y - mon["y"]
    right = left + region_width
    bottom = top + region_height
    if left < 0 or top < 0 or right > base_img.size[0] or bottom > base_img.size[1]:
        raise HTTPException(status_code=400, detail="requested region is outside the captured monitor")
    crop = base_img.crop((left, top, right, bottom))
    return crop, {"x": region_x, "y": region_y, "width": region_width, "height": region_height}, mon, displays, screen_selection
 def serialize_image(image, image_format: str, jpeg_quality: int) -> bytes:
    buf = io.BytesIO()
    if image_format == "jpeg":
        image.save(buf, format="JPEG", quality=jpeg_quality)
    else:
        image.save(buf, format="PNG")
    return buf.getvalue()
 def encode_image(image, image_format: str, jpeg_quality: int) -> str:
    import base64
    return base64.b64encode(serialize_image(image, image_format, jpeg_quality)).decode("ascii")
 def draw_grid(image, region_x: int, region_y: int, rows: int, cols: int, include_labels: bool):
    _, ImageDraw, _ = import_capture_libs()
    out = image.copy()
    draw = ImageDraw.Draw(out)
    w, h = out.size
    cell_w = w / cols
    cell_h = h / rows
    for c in range(1, cols):
        x = int(round(c * cell_w))
        draw.line([(x, 0), (x, h)], fill=(255, 0, 0), width=1)
    for r in range(1, rows):
        y = int(round(r * cell_h))
        draw.line([(0, y), (w, y)], fill=(255, 0, 0), width=1)
    draw.rectangle([(0, 0), (w - 1, h - 1)], outline=(255, 0, 0), width=2)
    if include_labels:
        for r in range(rows):
            for c in range(cols):
                cx = int((c + 0.5) * cell_w)
                cy = int((r + 0.5) * cell_h)
                draw.text((cx - 12, cy - 6), f"{r},{c}", fill=(255, 255, 0))
    meta = {
        "region": {"x": region_x, "y": region_y, "width": w, "height": h},
        "grid": {
            "rows": rows,
            "cols": cols,
            "cell_width": cell_w,
            "cell_height": cell_h,
            "indexing": "zero-based",
            "point_formula": {
                "pixel_x": "region.x + ((col + 0.5 + dx*0.5) * cell_width)",
                "pixel_y": "region.y + ((row + 0.5 + dy*0.5) * cell_height)",
                "dx_range": "[-1,1]",
                "dy_range": "[-1,1]",
            },
        },
    }
    return out, meta
 def resolve_target(target: Target) -> tuple[int, int, dict]:
    if isinstance(target, PixelTarget):
        x = target.x + target.dx
        y = target.y + target.dy
        return x, y, {"mode": "pixel", "source": target.model_dump()}
    cell_w = target.region_width / target.cols
    cell_h = target.region_height / target.rows
    x = target.region_x + int(round((target.col + 0.5 + (target.dx * 0.5)) * cell_w))
    y = target.region_y + int(round((target.row + 0.5 + (target.dy * 0.5)) * cell_h))
    return x, y, {"mode": "grid", "source": target.model_dump(), "derived": {"cell_width": cell_w, "cell_height": cell_h}}
 def enforce_allowed_region(x: int, y: int):
    region = SETTINGS["allowed_region"]
    if region is None:
        return
    rx, ry, rw, rh = region
    if not (rx <= x < rx + rw and ry <= y < ry + rh):
        raise HTTPException(status_code=403, detail="point outside allowed region")
 def import_input_lib():
    try:
        import pyautogui
        pyautogui.FAILSAFE = True
        return pyautogui
    except Exception as exc:
        raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc
 def exec_action(req: ActionRequest, screen: int = 0) -> dict:
    run_dry = SETTINGS["dry_run"] or req.dry_run
    selected_display, _, screen_selection = select_display(screen)
    pyautogui = None if run_dry else import_input_lib()
    resolved_target = None
    if req.target is not None:
        x, y, info = resolve_target(req.target)
        enforce_allowed_region(x, y)
        resolved_target = {"x": x, "y": y, "target_info": info}
    duration_sec = req.duration_ms / 1000.0
    if req.action in {"move", "click", "right_click", "double_click", "middle_click"} and resolved_target is None:
        raise HTTPException(status_code=400, detail="target is required for pointer actions")
    if req.action == "scroll" and resolved_target is None:
        raise HTTPException(status_code=400, detail="target is required for scroll")
    if not run_dry:
        if req.action == "move":
            pyautogui.moveTo(resolved_target["x"], resolved_target["y"], duration=duration_sec)
        elif req.action == "click":
            pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], clicks=req.clicks, interval=req.interval_ms / 1000.0, button=req.button, duration=duration_sec)
        elif req.action == "right_click":
            pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], button="right", duration=duration_sec)
        elif req.action == "double_click":
            pyautogui.doubleClick(x=resolved_target["x"], y=resolved_target["y"], interval=req.interval_ms / 1000.0)
        elif req.action == "middle_click":
            pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], button="middle", duration=duration_sec)
        elif req.action == "scroll":
            pyautogui.moveTo(resolved_target["x"], resolved_target["y"], duration=duration_sec)
            pyautogui.scroll(req.scroll_amount)
        elif req.action == "type":
            pyautogui.write(req.text, interval=req.interval_ms / 1000.0)
        elif req.action == "hotkey":
            if len(req.keys) < 1:
                raise HTTPException(status_code=400, detail="keys is required for hotkey")
            pyautogui.hotkey(*req.keys)
    return {"action": req.action, "executed": not run_dry, "dry_run": run_dry, "screen": screen_selection, "display": selected_display, "resolved_target": resolved_target}
 def windows_only(feature: str):
    if sys.platform != "win32":
        raise HTTPException(status_code=501, detail=f"{feature} is currently supported on Windows hosts only")
 def tasklist_process_name(pid: int) -> str | None:
    try:
        completed = subprocess.run(["tasklist", "/FI", f"PID eq {pid}", "/FO", "CSV", "/NH"], capture_output=True, text=True, timeout=5, check=False)
    except Exception:
        return None
    line = (completed.stdout or "").strip().splitlines()
    if not line:
        return None
    row = line[0].strip()
    if not row or row.startswith("INFO:"):
        return None
    if row.startswith('"') and '","' in row:
        return row.split('","', 1)[0].strip('"')
    return None
 def list_windows(query: WindowQuery | None = None) -> list[dict]:
    windows_only("window endpoints")
    query = query or WindowQuery()
    user32 = ctypes.windll.user32
    kernel32 = ctypes.windll.kernel32
    psapi = ctypes.windll.psapi
    user32.GetWindowTextLengthW.argtypes = [ctypes.c_void_p]
    user32.GetWindowTextLengthW.restype = ctypes.c_int
    user32.GetWindowTextW.argtypes = [ctypes.c_void_p, ctypes.c_wchar_p, ctypes.c_int]
    user32.GetWindowTextW.restype = ctypes.c_int
    user32.IsWindowVisible.argtypes = [ctypes.c_void_p]
    user32.IsWindowVisible.restype = ctypes.c_bool
    user32.IsWindowEnabled.argtypes = [ctypes.c_void_p]
    user32.IsWindowEnabled.restype = ctypes.c_bool
    user32.IsIconic.argtypes = [ctypes.c_void_p]
    user32.IsIconic.restype = ctypes.c_bool
    user32.IsZoomed.argtypes = [ctypes.c_void_p]
    user32.IsZoomed.restype = ctypes.c_bool
    user32.GetForegroundWindow.restype = ctypes.c_void_p
    user32.GetWindowRect.argtypes = [ctypes.c_void_p, ctypes.POINTER(ctypes.wintypes.RECT)]
    user32.GetWindowRect.restype = ctypes.c_bool
    user32.GetClassNameW.argtypes = [ctypes.c_void_p, ctypes.c_wchar_p, ctypes.c_int]
    user32.GetClassNameW.restype = ctypes.c_int
    kernel32.OpenProcess.argtypes = [ctypes.wintypes.DWORD, ctypes.wintypes.BOOL, ctypes.wintypes.DWORD]
    kernel32.OpenProcess.restype = ctypes.wintypes.HANDLE
    kernel32.CloseHandle.argtypes = [ctypes.wintypes.HANDLE]
    kernel32.CloseHandle.restype = ctypes.wintypes.BOOL
    psapi.GetModuleBaseNameW.argtypes = [ctypes.wintypes.HANDLE, ctypes.wintypes.HMODULE, ctypes.c_wchar_p, ctypes.wintypes.DWORD]
    psapi.GetModuleBaseNameW.restype = ctypes.wintypes.DWORD
    foreground = int(user32.GetForegroundWindow() or 0)
    results: list[dict] = []
    def callback(hwnd, _lparam):
        hwnd_int = int(hwnd)
        if query.hwnd and hwnd_int != query.hwnd:
            return True
        visible = bool(user32.IsWindowVisible(hwnd))
        if query.visible_only and not visible:
            return True
        length = user32.GetWindowTextLengthW(hwnd)
        title_buf = ctypes.create_unicode_buffer(max(1, length + 1))
        user32.GetWindowTextW(hwnd, title_buf, len(title_buf))
        title = title_buf.value or ""
        if query.title_contains and query.title_contains.lower() not in title.lower():
            return True
        if query.title_regex and re.search(query.title_regex, title, flags=re.IGNORECASE) is None:
            return True
        pid = ctypes.wintypes.DWORD(0)
        user32.GetWindowThreadProcessId(hwnd, ctypes.byref(pid))
        process_name = tasklist_process_name(pid.value)
        if query.process_name and (process_name or "").lower() != query.process_name.lower():
            return True
        class_buf = ctypes.create_unicode_buffer(256)
        user32.GetClassNameW(hwnd, class_buf, len(class_buf))
        rect = ctypes.wintypes.RECT()
        user32.GetWindowRect(hwnd, ctypes.byref(rect))
        results.append(
            {
                "hwnd": hwnd_int,
                "title": title,
                "class_name": class_buf.value,
                "pid": int(pid.value),
                "process_name": process_name,
                "visible": visible,
                "enabled": bool(user32.IsWindowEnabled(hwnd)),
                "minimized": bool(user32.IsIconic(hwnd)),
                "maximized": bool(user32.IsZoomed(hwnd)),
                "foreground": hwnd_int == foreground,
                "rect": {"x": int(rect.left), "y": int(rect.top), "width": int(rect.right - rect.left), "height": int(rect.bottom - rect.top)},
            }
        )
        return True
    enum_proc = ctypes.WINFUNCTYPE(ctypes.c_bool, ctypes.c_void_p, ctypes.c_void_p)(callback)
    user32.EnumWindows(enum_proc, 0)
    results.sort(key=lambda item: (not item["foreground"], item["title"].lower(), item["hwnd"]))
    return results
 def _pick_single_window(query: WindowQuery) -> dict:
    matches = list_windows(query)
    if not matches:
        raise HTTPException(status_code=404, detail="no window matched")
    if len(matches) > 1:
        raise HTTPException(status_code=409, detail={"message": "multiple windows matched", "matches": matches[:10]})
    return matches[0]
 def apply_window_action(req: WindowActionRequest) -> dict:
    windows_only("window endpoints")
    match = _pick_single_window(req)
    hwnd = match["hwnd"]
    user32 = ctypes.windll.user32
    SW_RESTORE, SW_MINIMIZE, SW_MAXIMIZE = 9, 6, 3
    WM_CLOSE = 0x0010
    if req.action == "focus":
        user32.ShowWindow(hwnd, SW_RESTORE)
        ok = bool(user32.SetForegroundWindow(hwnd))
        if not ok:
            raise HTTPException(status_code=500, detail="failed to focus window")
    elif req.action == "restore":
        user32.ShowWindow(hwnd, SW_RESTORE)
    elif req.action == "minimize":
        user32.ShowWindow(hwnd, SW_MINIMIZE)
    elif req.action == "maximize":
        user32.ShowWindow(hwnd, SW_MAXIMIZE)
    elif req.action == "close":
        user32.PostMessageW(hwnd, WM_CLOSE, 0, 0)
    deadline = time.time() + (req.timeout_ms / 1000.0)
    final = None
    while time.time() <= deadline:
        current = list_windows(WindowQuery(hwnd=hwnd, visible_only=False))
        if not current:
            if req.action == "close":
                return {"matched": match, "closed": True, "final": None}
            time.sleep(0.05)
            continue
        final = current[0]
        if req.action == "focus" and final.get("foreground"):
            break
        if req.action in {"restore", "minimize", "maximize"}:
            break
        time.sleep(0.05)
    return {"matched": match, "closed": False, "final": final}
 def launch_app(req: LaunchRequest) -> dict:
    if req.cwd and not os.path.isdir(req.cwd):
        raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
    argv = [req.executable, *req.args]
    cwd = req.cwd or None
    if req.dry_run or SETTINGS["dry_run"]:
        return {"executed": False, "dry_run": True, "argv": argv, "cwd": cwd}
    try:
        proc = subprocess.Popen(argv, cwd=cwd)
    except FileNotFoundError as exc:
        raise HTTPException(status_code=400, detail=f"executable not found: {exc}") from exc
    except OSError as exc:
        raise HTTPException(status_code=400, detail=f"failed to launch process: {exc}") from exc
    result = {"executed": True, "dry_run": False, "argv": argv, "cwd": cwd, "pid": proc.pid}
    if req.wait_for_window:
        query = req.match or WindowQuery(process_name=os.path.basename(req.executable), visible_only=True)
        deadline = time.time() + (req.timeout_ms / 1000.0)
        match = None
        while time.time() <= deadline:
            matches = list_windows(query)
            if matches:
                match = matches[0]
                break
            time.sleep(0.2)
        result["window"] = match
        result["window_found"] = match is not None
    return result
 def _truncate_text(text: str, limit: int) -> tuple[str, bool]:
    if len(text) <= limit:
        return text, False
    return text[:limit], True
 def _resolve_exec_program(shell_name: str, command: str) -> list[str]:
    if shell_name == "powershell":
        return ["powershell", "-NoProfile", "-NonInteractive", "-ExecutionPolicy", "Bypass", "-Command", command]
    if shell_name == "bash":
        return ["bash", "-lc", command]
    if shell_name == "cmd":
        return ["cmd", "/c", command]
    raise HTTPException(status_code=400, detail="unsupported shell")
 def exec_command(req):
    if not SETTINGS["exec_enabled"]:
        raise HTTPException(status_code=403, detail="exec endpoint disabled")
    if not SETTINGS["exec_secret"]:
        raise HTTPException(status_code=403, detail="exec secret not configured")
    shell_name = (req.shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip()
    if shell_name not in {"powershell", "bash", "cmd"}:
        raise HTTPException(status_code=400, detail="unsupported shell")
    run_dry = SETTINGS["dry_run"] or req.dry_run
    timeout_s = req.timeout_s if req.timeout_s is not None else SETTINGS["exec_default_timeout_s"]
    timeout_s = min(timeout_s, SETTINGS["exec_max_timeout_s"])
    cwd = None
    if req.cwd:
        cwd = os.path.abspath(req.cwd)
        if not os.path.isdir(cwd):
            raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
    argv = _resolve_exec_program(shell_name, req.command)
    if run_dry:
        return {"executed": False, "dry_run": True, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd}
    start = time.time()
    try:
        completed = subprocess.run(argv, cwd=cwd, capture_output=True, text=True, timeout=timeout_s, check=False)
    except subprocess.TimeoutExpired as exc:
        stdout, stdout_truncated = _truncate_text(str(exc.stdout or ""), SETTINGS["exec_max_output_chars"])
        stderr, stderr_truncated = _truncate_text(str(exc.stderr or ""), SETTINGS["exec_max_output_chars"])
        return {"executed": True, "timed_out": True, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd, "duration_ms": int((time.time() - start) * 1000), "exit_code": None, "stdout": stdout, "stderr": stderr, "stdout_truncated": stdout_truncated, "stderr_truncated": stderr_truncated}
    except FileNotFoundError as exc:
        raise HTTPException(status_code=400, detail=f"shell executable not found: {exc}") from exc
    stdout, stdout_truncated = _truncate_text(completed.stdout or "", SETTINGS["exec_max_output_chars"])
    stderr, stderr_truncated = _truncate_text(completed.stderr or "", SETTINGS["exec_max_output_chars"])
    return {"executed": True, "timed_out": False, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd, "duration_ms": int((time.time() - start) * 1000), "exit_code": completed.returncode, "stdout": stdout, "stderr": stderr, "stdout_truncated": stdout_truncated, "stderr_truncated": stderr_truncated}
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -1,97 +1,60 @@
 ---
 name: clickthrough-http-control
-description: Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops.
+description: Use 3 methods to control a computer: see (screenshot+grid), interact (mouse/keyboard), and exec (shell).
 ---
-# Clickthrough HTTP Control (v2)
+# Clickthrough Computer Control
-Agents do not see live desktop video. They operate on snapshots.
+Use exactly 3 methods:
-Use this loop: **observe -> localize -> act -> verify**.
+- `see`
 - `interact`
 - `exec`
-## Fast defaults
+## Method 1: See
- Start with `POST /v2/observe` on a tight region, not full screen.
+Use `POST /see` to capture full screen or a region with a grid overlay.
- Set `ocr_mode` to `none` unless text is required immediately.
+Use `POST /see/zoom` to capture a tighter crop with a denser grid.
 - Use `image` tool localization for icon-heavy or dense controls.
 - Use `POST /v2/act-verify` instead of manual sleep/poll loops.
 ## Mandatory image-tool click localization
 When OCR is weak or ambiguous, ask image tool for one coordinate in bounds.
 Prompt template:
 - "Return one click point as JSON `{\"x\":<int>,\"y\":<int>}` inside this image (`width=W`, `height=H`) for the **<exact target>** control."
 Rules:
- Ask for one point only.
+- Start with coarse grid (`12x12`).
- Include bounds in the prompt.
+- For precision, zoom and use denser grid (`20x20` or higher).
- If answer is not parseable `x,y`, re-ask once with stricter format.
+- Always use returned `meta.region` and `meta.grid` when computing click targets.
- Send returned point to `POST /v2/localize` via `image_tool_point`.
+- Coordinates are global desktop coordinates.
-## API playbook
+## Method 2: Interact
-1. **Observe**
+Use `POST /interact` for one action at a time.
-```json
+Mouse actions:
-POST /v2/observe?screen=0
+- `move`, `click`, `right_click`, `double_click`, `middle_click`, `scroll`
 {
  "mode": "region",
  "region_x": 820,
  "region_y": 420,
  "region_width": 700,
  "region_height": 420,
  "include_image": true,
  "ocr_mode": "none"
 }
 ```
-2. **Localize** (choose one)
+Keyboard actions:
 - `type`, `hotkey`
-Text:
+Rules:
-```json
+- Prefer `grid` targets derived from fresh `see`/`see/zoom` captures.
-POST /v2/localize
+- Use `pixel` only when you already have reliable coordinates.
-{"observation_id":"...","text_query":"Save","text_match":"exact"}
+- After each important action, call `see` again before continuing.
 ```
-Image-tool point:
+## Method 3: Exec
 ```json
 POST /v2/localize
 {"observation_id":"...","image_tool_point":{"x":312,"y":188}}
 ```
-3. **Act**
+Use `POST /exec` only for shell/system tasks.
-```json
+Rules:
-POST /v2/act?screen=0
+- Requires `x-clickthrough-exec-secret`.
-{"action":{"action":"click","target":{"resolved_target_id":"..."}}}
+- Do not use exec for normal clicking/typing flows.
-```
+- Prefer GUI interaction first; exec is fallback or explicit shell task.
-4. **Verify**
+## Lightweight Procedure
-```json
+1. `see` capture.
-POST /v2/act-verify?screen=0
+2. If needed, `see/zoom` refine.
-{
+3. `interact` one step.
-  "action":{"action":"click","target":{"resolved_target_id":"..."}},
+4. `see` verify.
-  "condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420},
+5. Repeat.
  "risk_level":"low"
 }
 ```
-## Risk policy
+## Quick Safety Rules
- Low risk (navigation, focus, benign clicks): single verification signal.
+- Never click with stale screenshots.
- High risk (delete/send/purchase/close-lossy): use `risk_level=high` and require two checks before act.
+- Never send multiple uncertain clicks in a row.
- Never do speculative repeat clicks; switch strategy after one failed verify.
+- If localization is ambiguous, re-capture with a tighter zoom.
 ## Anti-latency rules
 - Never repeat full-screen OCR by default.
 - Re-observe only the active pane/region.
 - Prefer keyboard + window APIs for app switching.
 - Use OCR on region only and cap area with `max_ocr_area_px`.
 ## Setup and auth
 - Include `x-clickthrough-token` when token auth is enabled.
 - `/exec` additionally requires `x-clickthrough-exec-secret`.
 - Validate server first: `GET /health`.