refactor: simplify to see/interact/exec and split server modules

2026-05-03 20:07:12 +02:00
parent aced5be25e
commit 1c03cab457
8 changed files with 911 additions and 1928 deletions
--- a/README.md
+++ b/README.md
@@ -1,49 +1,37 @@
 # Clickthrough

-Let an agent interact with a computer over HTTP.
+Clickthrough is a lightweight HTTP control layer that lets an AI safely operate a real computer by repeatedly capturing structured screenshots with coordinate-aware grids (`see`), executing precise mouse/keyboard actions from those coordinates (`interact`), and optionally running authenticated shell commands for system-level tasks (`exec`) under a consistent response contract.

-## Primary mode (v2)
+## Core Methods

-Use the v2 contract for faster, less OCR-heavy control loops:
- `POST /v2/observe`
- `POST /v2/localize`
- `POST /v2/act`
- `POST /v2/act-verify`
+- `POST /see`: Capture a full screen or region, optionally with a click-ready grid overlay.
+- `POST /see/zoom`: Capture a tighter crop around a point and draw a denser grid for precise targeting.
+- `POST /interact`: Perform one mouse or keyboard action (`click`, `scroll`, `type`, `hotkey`, etc.).
+- `POST /exec`: Run PowerShell/Bash/CMD commands when shell-level control is needed.

-This is optimized for agents that cannot directly see the screen and must use screenshot/image tools.
+## Why this works for AI agents

-## What this provides
+- Agents do not need live vision; they iterate on snapshots.
+- Grid metadata bridges image understanding to deterministic click coordinates.
+- Interaction stays explicit and auditable (one action per request).
+- A unified response envelope (`ok`, `data`, `error`) reduces agent-side branching.

- Screen/region capture with optional OCR and timing stats
- Observation IDs for deterministic follow-up localization
- Text localization and image-tool coordinate localization
- Action execution with resolved target IDs
- Risk-aware action+verification defaults
- Unified response envelope across all endpoints
+## Minimal Agent Loop

-## Quick start
+1. Call `see` with a coarse grid.
+2. If uncertain, call `see/zoom` with a denser grid.
+3. Call `interact` once.
+4. Call `see` again to verify state change.
+5. Use `exec` only for explicit shell/system tasks.

-```bash
-cd /root/external-projects/clickthrough
-python3 -m venv .venv
-. .venv/bin/activate
-pip install -r requirements.txt
-CLICKTHROUGH_TOKEN=change-me python -m server.app
-```
+## Safety and Auth

-Server defaults to `127.0.0.1:8123`.
+- `x-clickthrough-token` protects API access when enabled.
+- `x-clickthrough-exec-secret` is required for `/exec`.
+- Optional dry-run and allowed-region constraints reduce accidental risk.

-## Fast control loop
+## Docs

-1. `POST /v2/observe` on a tight region
-2. If OCR is enough, `POST /v2/localize` with `text_query`
-3. If ambiguous, ask image tool for one x,y in observation bounds
-4. `POST /v2/localize` with `image_tool_point`
-5. `POST /v2/act` or `POST /v2/act-verify`
-6. Re-observe only changed region
-
-## See docs
-
- `docs/API.md`
- `skill/SKILL.md`
- `docs/coordinate-system.md`
+- API: `docs/API.md`
+- Agent procedure: `skill/SKILL.md`
+- Coordinate system details: `docs/coordinate-system.md`
--- a/docs/API.md
+++ b/docs/API.md
@@ -1,116 +1,21 @@
-# API Reference (v2)
+# API Reference

 Base URL: `http://127.0.0.1:8123`

-If `CLICKTHROUGH_TOKEN` is set, include:
+Auth header when enabled:

 ```http
 x-clickthrough-token: <token>
 ```

-## Endpoints
+This API is intended for AI computer control through 3 methods only:
+- `see`
+- `interact`
+- `exec`

- `POST /v2/observe`
- `POST /v2/localize`
- `POST /v2/act`
- `POST /v2/act-verify`
- `GET /health`
- `GET /displays`
- `GET /windows`
- `POST /windows/action`
- `POST /launch`
- `POST /exec`
+All responses use one envelope.

-No v1 endpoints are supported.
-
-## `POST /v2/observe`
-
-```json
-{
-  "mode": "region",
-  "region_x": 800,
-  "region_y": 420,
-  "region_width": 700,
-  "region_height": 420,
-  "include_image": true,
-  "image_format": "jpeg",
-  "jpeg_quality": 75,
-  "ocr_mode": "region",
-  "language_hint": "eng",
-  "min_confidence": 0.45,
-  "max_ocr_area_px": 1500000,
-  "group_lines": true
-}
-```
-
-Returns observation metadata, optional image, OCR blocks/lines, and timing fields.
-
-## `POST /v2/localize`
-
-Text localization:
-
-```json
-{
-  "observation_id": "...",
-  "text_query": "Save",
-  "text_match": "exact",
-  "candidate_index": 0
-}
-```
-
-Image-tool point localization:
-
-```json
-{
-  "observation_id": "...",
-  "image_tool_point": {"x": 312, "y": 188}
-}
-```
-
-Returns `resolved_target_id`, global pixel, and `localization_confidence`.
-
-## `POST /v2/act`
-
-```json
-{
-  "action": {
-    "action": "click",
-    "target": {"resolved_target_id": "..."},
-    "button": "left",
-    "clicks": 1
-  }
-}
-```
-
-## `POST /v2/act-verify`
-
-```json
-{
-  "action": {
-    "action": "click",
-    "target": {"resolved_target_id": "..."}
-  },
-  "condition": {
-    "kind": "text",
-    "mode": "region",
-    "text": "Saved",
-    "match": "contains",
-    "present": true,
-    "region_x": 820,
-    "region_y": 420,
-    "region_width": 500,
-    "region_height": 140,
-    "min_confidence": 0.4
-  },
-  "risk_level": "low"
-}
-```
-
-Risk defaults:
- `low`: retries `0`, timeout `2500ms`
- `high`: retries `1`, timeout `6000ms`
-
-## Response envelope
+## Response Envelope

 Success:

@@ -133,9 +38,124 @@ Error:
  "time_ms": 1710000000000,
  "data": null,
  "error": {
-    "code": "http_error",
-    "message": "...",
-    "details": {}
+    "code": "validation_error",
+    "message": "request validation failed",
+    "details": []
  }
 }
 ```
+
+## 1) See
+
+### `POST /see`
+Capture a full screen or a region. Optional grid overlay returns coordinate metadata for click mapping.
+
+```json
+{
+  "screen": 0,
+  "region_x": null,
+  "region_y": null,
+  "region_width": null,
+  "region_height": null,
+  "with_grid": true,
+  "grid_rows": 12,
+  "grid_cols": 12,
+  "include_labels": true,
+  "image_format": "png",
+  "jpeg_quality": 85
+}
+```
+
+Returns:
+- `data.image.base64`
+- `data.meta.region` (global desktop coords)
+- `data.meta.grid` (rows/cols/cell size + formula)
+
+### `POST /see/zoom`
+Capture a tighter crop around a global point and draw another grid over that crop.
+
+```json
+{
+  "screen": 0,
+  "center_x": 1200,
+  "center_y": 720,
+  "width": 500,
+  "height": 350,
+  "with_grid": true,
+  "grid_rows": 20,
+  "grid_cols": 20,
+  "include_labels": true,
+  "image_format": "png",
+  "jpeg_quality": 90
+}
+```
+
+Use this for precision before clicking tiny controls.
+
+## 2) Interact
+
+### `POST /interact`
+Mouse/keyboard action execution.
+
+```json
+{
+  "screen": 0,
+  "action": {
+    "action": "click",
+    "target": {
+      "mode": "grid",
+      "region_x": 0,
+      "region_y": 0,
+      "region_width": 1920,
+      "region_height": 1080,
+      "rows": 12,
+      "cols": 12,
+      "row": 7,
+      "col": 3,
+      "dx": 0.0,
+      "dy": 0.0
+    },
+    "button": "left",
+    "clicks": 1
+  }
+}
+```
+
+Supported actions:
+- `move`, `click`, `right_click`, `double_click`, `middle_click`
+- `scroll` (`scroll_amount`)
+- `type` (`text`, `interval_ms`)
+- `hotkey` (`keys`)
+
+Target modes:
+- `pixel`: absolute global `x,y`
+- `grid`: grid cell from a `see`/`see/zoom` response
+
+## 3) Exec
+
+### `POST /exec`
+Run host shell commands (PowerShell/Bash/CMD).
+
+```json
+{
+  "command": "Get-Process | Select-Object -First 5",
+  "shell": "powershell",
+  "timeout_s": 20,
+  "cwd": "C:/Users/Paul",
+  "dry_run": false
+}
+```
+
+Required header:
+
+```http
+x-clickthrough-exec-secret: <secret>
+```
+
+## Minimal Procedure for Agents
+
+1. `see` full screen with coarse grid.
+2. If uncertain, `see/zoom` target area with denser grid.
+3. `interact` one action.
+4. `see` again to confirm state change.
+5. Use `exec` only when GUI interaction is not the right tool.
--- a/examples/quickstart.py
+++ b/examples/quickstart.py
@@ -15,24 +15,25 @@ if TOKEN:
 def main():
    health = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10)
    health.raise_for_status()
-    print("health ok:", health.json().get("ok"))
+    print("health:", health.json()["data"])

-    observe = requests.post(
-        f"{BASE_URL}/v2/observe",
+    see = requests.post(
+        f"{BASE_URL}/see",
        headers=headers,
-        params={"screen": SCREEN},
        json={
-            "mode": "screen",
-            "include_image": False,
-            "ocr_mode": "none",
+            "screen": SCREEN,
+            "with_grid": True,
+            "grid_rows": 12,
+            "grid_cols": 12,
+            "image_format": "jpeg",
+            "jpeg_quality": 70,
        },
-        timeout=20,
+        timeout=30,
    )
-    observe.raise_for_status()
-    payload = observe.json()["data"]
-    print("observation_id:", payload["observation_id"])
-    print("region:", payload["region"])
-    print("timing_ms:", payload["timing_ms"])
+    see.raise_for_status()
+    payload = see.json()["data"]
+    print("region:", payload["meta"]["region"])
+    print("grid:", payload["meta"].get("grid", {}))


 if __name__ == "__main__":
--- a/server/app.py
+++ b/server/app.py
--- a/server/config.py
+++ b/server/config.py
@@ -0,0 +1,42 @@
+import os
+from typing import Optional
+
+from dotenv import load_dotenv
+
+
+load_dotenv(dotenv_path=".env", override=False)
+
+
+def _env_bool(name: str, default: bool) -> bool:
+    raw = os.getenv(name)
+    if raw is None:
+        return default
+    return raw.strip().lower() in {"1", "true", "yes", "on"}
+
+
+def _parse_allowed_region() -> Optional[tuple[int, int, int, int]]:
+    raw = os.getenv("CLICKTHROUGH_ALLOWED_REGION")
+    if not raw:
+        return None
+    parts = [p.strip() for p in raw.split(",")]
+    if len(parts) != 4:
+        raise ValueError("CLICKTHROUGH_ALLOWED_REGION must be x,y,width,height")
+    x, y, w, h = (int(p) for p in parts)
+    if w <= 0 or h <= 0:
+        raise ValueError("CLICKTHROUGH_ALLOWED_REGION width/height must be > 0")
+    return x, y, w, h
+
+
+SETTINGS = {
+    "host": os.getenv("CLICKTHROUGH_HOST", "127.0.0.1"),
+    "port": int(os.getenv("CLICKTHROUGH_PORT", "8123")),
+    "token": os.getenv("CLICKTHROUGH_TOKEN", "").strip(),
+    "dry_run": _env_bool("CLICKTHROUGH_DRY_RUN", False),
+    "allowed_region": _parse_allowed_region(),
+    "exec_enabled": _env_bool("CLICKTHROUGH_EXEC_ENABLED", True),
+    "exec_default_shell": os.getenv("CLICKTHROUGH_EXEC_DEFAULT_SHELL", "powershell").strip().lower(),
+    "exec_default_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_TIMEOUT_S", "30")),
+    "exec_max_timeout_s": int(os.getenv("CLICKTHROUGH_EXEC_MAX_TIMEOUT_S", "120")),
+    "exec_max_output_chars": int(os.getenv("CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS", "20000")),
+    "exec_secret": os.getenv("CLICKTHROUGH_EXEC_SECRET", "").strip(),
+}
--- a/server/models.py
+++ b/server/models.py
@@ -0,0 +1,124 @@
+from typing import Literal, Optional
+
+from pydantic import BaseModel, Field, model_validator
+
+
+class PixelTarget(BaseModel):
+    mode: Literal["pixel"]
+    x: int
+    y: int
+    dx: int = 0
+    dy: int = 0
+
+
+class GridTarget(BaseModel):
+    mode: Literal["grid"]
+    region_x: int
+    region_y: int
+    region_width: int = Field(gt=0)
+    region_height: int = Field(gt=0)
+    rows: int = Field(gt=0)
+    cols: int = Field(gt=0)
+    row: int = Field(ge=0)
+    col: int = Field(ge=0)
+    dx: float = 0.0
+    dy: float = 0.0
+
+    @model_validator(mode="after")
+    def _validate_indices(self):
+        if self.row >= self.rows or self.col >= self.cols:
+            raise ValueError("row/col must be inside rows/cols")
+        if not -1.0 <= self.dx <= 1.0:
+            raise ValueError("dx must be in [-1, 1]")
+        if not -1.0 <= self.dy <= 1.0:
+            raise ValueError("dy must be in [-1, 1]")
+        return self
+
+
+Target = PixelTarget | GridTarget
+
+
+class ActionRequest(BaseModel):
+    action: Literal[
+        "move",
+        "click",
+        "right_click",
+        "double_click",
+        "middle_click",
+        "scroll",
+        "type",
+        "hotkey",
+    ]
+    target: Optional[Target] = None
+    duration_ms: int = Field(default=0, ge=0, le=20000)
+    button: Literal["left", "right", "middle"] = "left"
+    clicks: int = Field(default=1, ge=1, le=10)
+    scroll_amount: int = 0
+    text: str = ""
+    keys: list[str] = Field(default_factory=list)
+    interval_ms: int = Field(default=20, ge=0, le=5000)
+    dry_run: bool = False
+
+
+class ExecRequest(BaseModel):
+    command: str = Field(min_length=1, max_length=10000)
+    shell: Literal["powershell", "bash", "cmd"] | None = None
+    timeout_s: int | None = Field(default=None, ge=1, le=600)
+    cwd: str | None = None
+    dry_run: bool = False
+
+
+class WindowQuery(BaseModel):
+    title_contains: str | None = Field(default=None, max_length=512)
+    title_regex: str | None = Field(default=None, max_length=512)
+    process_name: str | None = Field(default=None, max_length=260)
+    hwnd: int | None = Field(default=None, ge=1)
+    visible_only: bool = True
+
+
+class WindowActionRequest(WindowQuery):
+    action: Literal["focus", "restore", "minimize", "maximize", "close"]
+    timeout_ms: int = Field(default=3000, ge=0, le=60000)
+
+
+class LaunchRequest(BaseModel):
+    executable: str = Field(min_length=1, max_length=2048)
+    args: list[str] = Field(default_factory=list, max_length=100)
+    cwd: str | None = None
+    wait_for_window: bool = False
+    match: WindowQuery | None = None
+    timeout_ms: int = Field(default=5000, ge=0, le=120000)
+    dry_run: bool = False
+
+
+class SeeRequest(BaseModel):
+    screen: int = 0
+    region_x: int | None = Field(default=None, ge=0)
+    region_y: int | None = Field(default=None, ge=0)
+    region_width: int | None = Field(default=None, gt=0)
+    region_height: int | None = Field(default=None, gt=0)
+    with_grid: bool = True
+    grid_rows: int = Field(default=12, ge=1, le=300)
+    grid_cols: int = Field(default=12, ge=1, le=300)
+    include_labels: bool = True
+    image_format: Literal["png", "jpeg"] = "png"
+    jpeg_quality: int = Field(default=85, ge=1, le=100)
+
+
+class SeeZoomRequest(BaseModel):
+    screen: int = 0
+    center_x: int = Field(ge=0)
+    center_y: int = Field(ge=0)
+    width: int = Field(default=500, ge=10)
+    height: int = Field(default=350, ge=10)
+    with_grid: bool = True
+    grid_rows: int = Field(default=20, ge=1, le=300)
+    grid_cols: int = Field(default=20, ge=1, le=300)
+    include_labels: bool = True
+    image_format: Literal["png", "jpeg"] = "png"
+    jpeg_quality: int = Field(default=90, ge=1, le=100)
+
+
+class InteractRequest(BaseModel):
+    screen: int = 0
+    action: ActionRequest
--- a/server/services.py
+++ b/server/services.py
@@ -0,0 +1,462 @@
+import ctypes
+import io
+import os
+import re
+import subprocess
+import sys
+import time
+from typing import Literal
+
+from fastapi import HTTPException
+from PIL import ImageChops, ImageStat
+
+from .config import SETTINGS
+from .models import ActionRequest, GridTarget, LaunchRequest, PixelTarget, Target, WindowActionRequest, WindowQuery
+
+
+def import_capture_libs():
+    try:
+        from PIL import Image, ImageDraw
+        import mss
+
+        return Image, ImageDraw, mss
+    except Exception as exc:
+        raise HTTPException(status_code=500, detail=f"capture backend unavailable: {exc}") from exc
+
+
+def display_region(mon: dict, screen: int, mss_index: int, primary: bool) -> dict:
+    return {
+        "screen": screen,
+        "mss_index": mss_index,
+        "primary": primary,
+        "x": mon["left"],
+        "y": mon["top"],
+        "width": mon["width"],
+        "height": mon["height"],
+    }
+
+
+def ordered_displays(sct) -> list[dict]:
+    raw_monitors = list(enumerate(sct.monitors[1:], start=1))
+    if not raw_monitors:
+        raise HTTPException(status_code=500, detail="no displays detected")
+
+    primary_pos = next((idx for idx, (_, mon) in enumerate(raw_monitors) if mon["left"] == 0 and mon["top"] == 0), 0)
+    ordered = [raw_monitors[primary_pos]] + [item for idx, item in enumerate(raw_monitors) if idx != primary_pos]
+    return [display_region(mon, screen=index, mss_index=mss_index, primary=(index == 0)) for index, (mss_index, mon) in enumerate(ordered)]
+
+
+def get_displays() -> list[dict]:
+    _, _, mss = import_capture_libs()
+    with mss.mss() as sct:
+        return ordered_displays(sct)
+
+
+def select_display(screen: int) -> tuple[dict, list[dict], dict]:
+    displays = get_displays()
+    selected = displays[screen] if 0 <= screen < len(displays) else displays[0]
+    return selected, displays, {"requested": screen, "selected": selected["screen"], "fallback": selected["screen"] != screen}
+
+
+def capture_screen(screen: int = 0):
+    Image, _, mss = import_capture_libs()
+    with mss.mss() as sct:
+        displays = ordered_displays(sct)
+        mon = displays[screen] if 0 <= screen < len(displays) else displays[0]
+        shot = sct.grab({"left": mon["x"], "top": mon["y"], "width": mon["width"], "height": mon["height"]})
+        image = Image.frombytes("RGB", shot.size, shot.rgb)
+        selection = {"requested": screen, "selected": mon["screen"], "fallback": mon["screen"] != screen}
+        return image, mon, displays, selection
+
+
+def capture_region_image(screen: int, region_x: int | None, region_y: int | None, region_width: int | None, region_height: int | None):
+    base_img, mon, displays, screen_selection = capture_screen(screen)
+    if None in {region_x, region_y, region_width, region_height}:
+        return base_img, {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}, mon, displays, screen_selection
+
+    left = region_x - mon["x"]
+    top = region_y - mon["y"]
+    right = left + region_width
+    bottom = top + region_height
+    if left < 0 or top < 0 or right > base_img.size[0] or bottom > base_img.size[1]:
+        raise HTTPException(status_code=400, detail="requested region is outside the captured monitor")
+
+    crop = base_img.crop((left, top, right, bottom))
+    return crop, {"x": region_x, "y": region_y, "width": region_width, "height": region_height}, mon, displays, screen_selection
+
+
+def serialize_image(image, image_format: str, jpeg_quality: int) -> bytes:
+    buf = io.BytesIO()
+    if image_format == "jpeg":
+        image.save(buf, format="JPEG", quality=jpeg_quality)
+    else:
+        image.save(buf, format="PNG")
+    return buf.getvalue()
+
+
+def encode_image(image, image_format: str, jpeg_quality: int) -> str:
+    import base64
+
+    return base64.b64encode(serialize_image(image, image_format, jpeg_quality)).decode("ascii")
+
+
+def draw_grid(image, region_x: int, region_y: int, rows: int, cols: int, include_labels: bool):
+    _, ImageDraw, _ = import_capture_libs()
+    out = image.copy()
+    draw = ImageDraw.Draw(out)
+    w, h = out.size
+    cell_w = w / cols
+    cell_h = h / rows
+
+    for c in range(1, cols):
+        x = int(round(c * cell_w))
+        draw.line([(x, 0), (x, h)], fill=(255, 0, 0), width=1)
+    for r in range(1, rows):
+        y = int(round(r * cell_h))
+        draw.line([(0, y), (w, y)], fill=(255, 0, 0), width=1)
+
+    draw.rectangle([(0, 0), (w - 1, h - 1)], outline=(255, 0, 0), width=2)
+    if include_labels:
+        for r in range(rows):
+            for c in range(cols):
+                cx = int((c + 0.5) * cell_w)
+                cy = int((r + 0.5) * cell_h)
+                draw.text((cx - 12, cy - 6), f"{r},{c}", fill=(255, 255, 0))
+
+    meta = {
+        "region": {"x": region_x, "y": region_y, "width": w, "height": h},
+        "grid": {
+            "rows": rows,
+            "cols": cols,
+            "cell_width": cell_w,
+            "cell_height": cell_h,
+            "indexing": "zero-based",
+            "point_formula": {
+                "pixel_x": "region.x + ((col + 0.5 + dx*0.5) * cell_width)",
+                "pixel_y": "region.y + ((row + 0.5 + dy*0.5) * cell_height)",
+                "dx_range": "[-1,1]",
+                "dy_range": "[-1,1]",
+            },
+        },
+    }
+    return out, meta
+
+
+def resolve_target(target: Target) -> tuple[int, int, dict]:
+    if isinstance(target, PixelTarget):
+        x = target.x + target.dx
+        y = target.y + target.dy
+        return x, y, {"mode": "pixel", "source": target.model_dump()}
+
+    cell_w = target.region_width / target.cols
+    cell_h = target.region_height / target.rows
+    x = target.region_x + int(round((target.col + 0.5 + (target.dx * 0.5)) * cell_w))
+    y = target.region_y + int(round((target.row + 0.5 + (target.dy * 0.5)) * cell_h))
+    return x, y, {"mode": "grid", "source": target.model_dump(), "derived": {"cell_width": cell_w, "cell_height": cell_h}}
+
+
+def enforce_allowed_region(x: int, y: int):
+    region = SETTINGS["allowed_region"]
+    if region is None:
+        return
+    rx, ry, rw, rh = region
+    if not (rx <= x < rx + rw and ry <= y < ry + rh):
+        raise HTTPException(status_code=403, detail="point outside allowed region")
+
+
+def import_input_lib():
+    try:
+        import pyautogui
+
+        pyautogui.FAILSAFE = True
+        return pyautogui
+    except Exception as exc:
+        raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc
+
+
+def exec_action(req: ActionRequest, screen: int = 0) -> dict:
+    run_dry = SETTINGS["dry_run"] or req.dry_run
+    selected_display, _, screen_selection = select_display(screen)
+    pyautogui = None if run_dry else import_input_lib()
+    resolved_target = None
+
+    if req.target is not None:
+        x, y, info = resolve_target(req.target)
+        enforce_allowed_region(x, y)
+        resolved_target = {"x": x, "y": y, "target_info": info}
+
+    duration_sec = req.duration_ms / 1000.0
+    if req.action in {"move", "click", "right_click", "double_click", "middle_click"} and resolved_target is None:
+        raise HTTPException(status_code=400, detail="target is required for pointer actions")
+    if req.action == "scroll" and resolved_target is None:
+        raise HTTPException(status_code=400, detail="target is required for scroll")
+
+    if not run_dry:
+        if req.action == "move":
+            pyautogui.moveTo(resolved_target["x"], resolved_target["y"], duration=duration_sec)
+        elif req.action == "click":
+            pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], clicks=req.clicks, interval=req.interval_ms / 1000.0, button=req.button, duration=duration_sec)
+        elif req.action == "right_click":
+            pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], button="right", duration=duration_sec)
+        elif req.action == "double_click":
+            pyautogui.doubleClick(x=resolved_target["x"], y=resolved_target["y"], interval=req.interval_ms / 1000.0)
+        elif req.action == "middle_click":
+            pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], button="middle", duration=duration_sec)
+        elif req.action == "scroll":
+            pyautogui.moveTo(resolved_target["x"], resolved_target["y"], duration=duration_sec)
+            pyautogui.scroll(req.scroll_amount)
+        elif req.action == "type":
+            pyautogui.write(req.text, interval=req.interval_ms / 1000.0)
+        elif req.action == "hotkey":
+            if len(req.keys) < 1:
+                raise HTTPException(status_code=400, detail="keys is required for hotkey")
+            pyautogui.hotkey(*req.keys)
+
+    return {"action": req.action, "executed": not run_dry, "dry_run": run_dry, "screen": screen_selection, "display": selected_display, "resolved_target": resolved_target}
+
+
+def windows_only(feature: str):
+    if sys.platform != "win32":
+        raise HTTPException(status_code=501, detail=f"{feature} is currently supported on Windows hosts only")
+
+
+def tasklist_process_name(pid: int) -> str | None:
+    try:
+        completed = subprocess.run(["tasklist", "/FI", f"PID eq {pid}", "/FO", "CSV", "/NH"], capture_output=True, text=True, timeout=5, check=False)
+    except Exception:
+        return None
+    line = (completed.stdout or "").strip().splitlines()
+    if not line:
+        return None
+    row = line[0].strip()
+    if not row or row.startswith("INFO:"):
+        return None
+    if row.startswith('"') and '","' in row:
+        return row.split('","', 1)[0].strip('"')
+    return None
+
+
+def list_windows(query: WindowQuery | None = None) -> list[dict]:
+    windows_only("window endpoints")
+    query = query or WindowQuery()
+
+    user32 = ctypes.windll.user32
+    kernel32 = ctypes.windll.kernel32
+    psapi = ctypes.windll.psapi
+
+    user32.GetWindowTextLengthW.argtypes = [ctypes.c_void_p]
+    user32.GetWindowTextLengthW.restype = ctypes.c_int
+    user32.GetWindowTextW.argtypes = [ctypes.c_void_p, ctypes.c_wchar_p, ctypes.c_int]
+    user32.GetWindowTextW.restype = ctypes.c_int
+    user32.IsWindowVisible.argtypes = [ctypes.c_void_p]
+    user32.IsWindowVisible.restype = ctypes.c_bool
+    user32.IsWindowEnabled.argtypes = [ctypes.c_void_p]
+    user32.IsWindowEnabled.restype = ctypes.c_bool
+    user32.IsIconic.argtypes = [ctypes.c_void_p]
+    user32.IsIconic.restype = ctypes.c_bool
+    user32.IsZoomed.argtypes = [ctypes.c_void_p]
+    user32.IsZoomed.restype = ctypes.c_bool
+    user32.GetForegroundWindow.restype = ctypes.c_void_p
+    user32.GetWindowRect.argtypes = [ctypes.c_void_p, ctypes.POINTER(ctypes.wintypes.RECT)]
+    user32.GetWindowRect.restype = ctypes.c_bool
+    user32.GetClassNameW.argtypes = [ctypes.c_void_p, ctypes.c_wchar_p, ctypes.c_int]
+    user32.GetClassNameW.restype = ctypes.c_int
+
+    kernel32.OpenProcess.argtypes = [ctypes.wintypes.DWORD, ctypes.wintypes.BOOL, ctypes.wintypes.DWORD]
+    kernel32.OpenProcess.restype = ctypes.wintypes.HANDLE
+    kernel32.CloseHandle.argtypes = [ctypes.wintypes.HANDLE]
+    kernel32.CloseHandle.restype = ctypes.wintypes.BOOL
+    psapi.GetModuleBaseNameW.argtypes = [ctypes.wintypes.HANDLE, ctypes.wintypes.HMODULE, ctypes.c_wchar_p, ctypes.wintypes.DWORD]
+    psapi.GetModuleBaseNameW.restype = ctypes.wintypes.DWORD
+
+    foreground = int(user32.GetForegroundWindow() or 0)
+    results: list[dict] = []
+
+    def callback(hwnd, _lparam):
+        hwnd_int = int(hwnd)
+        if query.hwnd and hwnd_int != query.hwnd:
+            return True
+        visible = bool(user32.IsWindowVisible(hwnd))
+        if query.visible_only and not visible:
+            return True
+
+        length = user32.GetWindowTextLengthW(hwnd)
+        title_buf = ctypes.create_unicode_buffer(max(1, length + 1))
+        user32.GetWindowTextW(hwnd, title_buf, len(title_buf))
+        title = title_buf.value or ""
+
+        if query.title_contains and query.title_contains.lower() not in title.lower():
+            return True
+        if query.title_regex and re.search(query.title_regex, title, flags=re.IGNORECASE) is None:
+            return True
+
+        pid = ctypes.wintypes.DWORD(0)
+        user32.GetWindowThreadProcessId(hwnd, ctypes.byref(pid))
+        process_name = tasklist_process_name(pid.value)
+        if query.process_name and (process_name or "").lower() != query.process_name.lower():
+            return True
+
+        class_buf = ctypes.create_unicode_buffer(256)
+        user32.GetClassNameW(hwnd, class_buf, len(class_buf))
+        rect = ctypes.wintypes.RECT()
+        user32.GetWindowRect(hwnd, ctypes.byref(rect))
+
+        results.append(
+            {
+                "hwnd": hwnd_int,
+                "title": title,
+                "class_name": class_buf.value,
+                "pid": int(pid.value),
+                "process_name": process_name,
+                "visible": visible,
+                "enabled": bool(user32.IsWindowEnabled(hwnd)),
+                "minimized": bool(user32.IsIconic(hwnd)),
+                "maximized": bool(user32.IsZoomed(hwnd)),
+                "foreground": hwnd_int == foreground,
+                "rect": {"x": int(rect.left), "y": int(rect.top), "width": int(rect.right - rect.left), "height": int(rect.bottom - rect.top)},
+            }
+        )
+        return True
+
+    enum_proc = ctypes.WINFUNCTYPE(ctypes.c_bool, ctypes.c_void_p, ctypes.c_void_p)(callback)
+    user32.EnumWindows(enum_proc, 0)
+    results.sort(key=lambda item: (not item["foreground"], item["title"].lower(), item["hwnd"]))
+    return results
+
+
+def _pick_single_window(query: WindowQuery) -> dict:
+    matches = list_windows(query)
+    if not matches:
+        raise HTTPException(status_code=404, detail="no window matched")
+    if len(matches) > 1:
+        raise HTTPException(status_code=409, detail={"message": "multiple windows matched", "matches": matches[:10]})
+    return matches[0]
+
+
+def apply_window_action(req: WindowActionRequest) -> dict:
+    windows_only("window endpoints")
+    match = _pick_single_window(req)
+    hwnd = match["hwnd"]
+    user32 = ctypes.windll.user32
+
+    SW_RESTORE, SW_MINIMIZE, SW_MAXIMIZE = 9, 6, 3
+    WM_CLOSE = 0x0010
+
+    if req.action == "focus":
+        user32.ShowWindow(hwnd, SW_RESTORE)
+        ok = bool(user32.SetForegroundWindow(hwnd))
+        if not ok:
+            raise HTTPException(status_code=500, detail="failed to focus window")
+    elif req.action == "restore":
+        user32.ShowWindow(hwnd, SW_RESTORE)
+    elif req.action == "minimize":
+        user32.ShowWindow(hwnd, SW_MINIMIZE)
+    elif req.action == "maximize":
+        user32.ShowWindow(hwnd, SW_MAXIMIZE)
+    elif req.action == "close":
+        user32.PostMessageW(hwnd, WM_CLOSE, 0, 0)
+
+    deadline = time.time() + (req.timeout_ms / 1000.0)
+    final = None
+    while time.time() <= deadline:
+        current = list_windows(WindowQuery(hwnd=hwnd, visible_only=False))
+        if not current:
+            if req.action == "close":
+                return {"matched": match, "closed": True, "final": None}
+            time.sleep(0.05)
+            continue
+        final = current[0]
+        if req.action == "focus" and final.get("foreground"):
+            break
+        if req.action in {"restore", "minimize", "maximize"}:
+            break
+        time.sleep(0.05)
+
+    return {"matched": match, "closed": False, "final": final}
+
+
+def launch_app(req: LaunchRequest) -> dict:
+    if req.cwd and not os.path.isdir(req.cwd):
+        raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
+    argv = [req.executable, *req.args]
+    cwd = req.cwd or None
+
+    if req.dry_run or SETTINGS["dry_run"]:
+        return {"executed": False, "dry_run": True, "argv": argv, "cwd": cwd}
+
+    try:
+        proc = subprocess.Popen(argv, cwd=cwd)
+    except FileNotFoundError as exc:
+        raise HTTPException(status_code=400, detail=f"executable not found: {exc}") from exc
+    except OSError as exc:
+        raise HTTPException(status_code=400, detail=f"failed to launch process: {exc}") from exc
+
+    result = {"executed": True, "dry_run": False, "argv": argv, "cwd": cwd, "pid": proc.pid}
+    if req.wait_for_window:
+        query = req.match or WindowQuery(process_name=os.path.basename(req.executable), visible_only=True)
+        deadline = time.time() + (req.timeout_ms / 1000.0)
+        match = None
+        while time.time() <= deadline:
+            matches = list_windows(query)
+            if matches:
+                match = matches[0]
+                break
+            time.sleep(0.2)
+        result["window"] = match
+        result["window_found"] = match is not None
+    return result
+
+
+def _truncate_text(text: str, limit: int) -> tuple[str, bool]:
+    if len(text) <= limit:
+        return text, False
+    return text[:limit], True
+
+
+def _resolve_exec_program(shell_name: str, command: str) -> list[str]:
+    if shell_name == "powershell":
+        return ["powershell", "-NoProfile", "-NonInteractive", "-ExecutionPolicy", "Bypass", "-Command", command]
+    if shell_name == "bash":
+        return ["bash", "-lc", command]
+    if shell_name == "cmd":
+        return ["cmd", "/c", command]
+    raise HTTPException(status_code=400, detail="unsupported shell")
+
+
+def exec_command(req):
+    if not SETTINGS["exec_enabled"]:
+        raise HTTPException(status_code=403, detail="exec endpoint disabled")
+    if not SETTINGS["exec_secret"]:
+        raise HTTPException(status_code=403, detail="exec secret not configured")
+
+    shell_name = (req.shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip()
+    if shell_name not in {"powershell", "bash", "cmd"}:
+        raise HTTPException(status_code=400, detail="unsupported shell")
+
+    run_dry = SETTINGS["dry_run"] or req.dry_run
+    timeout_s = req.timeout_s if req.timeout_s is not None else SETTINGS["exec_default_timeout_s"]
+    timeout_s = min(timeout_s, SETTINGS["exec_max_timeout_s"])
+
+    cwd = None
+    if req.cwd:
+        cwd = os.path.abspath(req.cwd)
+        if not os.path.isdir(cwd):
+            raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
+
+    argv = _resolve_exec_program(shell_name, req.command)
+    if run_dry:
+        return {"executed": False, "dry_run": True, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd}
+
+    start = time.time()
+    try:
+        completed = subprocess.run(argv, cwd=cwd, capture_output=True, text=True, timeout=timeout_s, check=False)
+    except subprocess.TimeoutExpired as exc:
+        stdout, stdout_truncated = _truncate_text(str(exc.stdout or ""), SETTINGS["exec_max_output_chars"])
+        stderr, stderr_truncated = _truncate_text(str(exc.stderr or ""), SETTINGS["exec_max_output_chars"])
+        return {"executed": True, "timed_out": True, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd, "duration_ms": int((time.time() - start) * 1000), "exit_code": None, "stdout": stdout, "stderr": stderr, "stdout_truncated": stdout_truncated, "stderr_truncated": stderr_truncated}
+    except FileNotFoundError as exc:
+        raise HTTPException(status_code=400, detail=f"shell executable not found: {exc}") from exc
+
+    stdout, stdout_truncated = _truncate_text(completed.stdout or "", SETTINGS["exec_max_output_chars"])
+    stderr, stderr_truncated = _truncate_text(completed.stderr or "", SETTINGS["exec_max_output_chars"])
+    return {"executed": True, "timed_out": False, "shell": shell_name, "command": req.command, "argv": argv, "timeout_s": timeout_s, "cwd": cwd, "duration_ms": int((time.time() - start) * 1000), "exit_code": completed.returncode, "stdout": stdout, "stderr": stderr, "stdout_truncated": stdout_truncated, "stderr_truncated": stderr_truncated}
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -1,97 +1,60 @@
 ---
 name: clickthrough-http-control
-description: Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops.
+description: Use 3 methods to control a computer: see (screenshot+grid), interact (mouse/keyboard), and exec (shell).
 ---

-# Clickthrough HTTP Control (v2)
+# Clickthrough Computer Control

-Agents do not see live desktop video. They operate on snapshots.
-Use this loop: **observe -> localize -> act -> verify**.
+Use exactly 3 methods:
+- `see`
+- `interact`
+- `exec`

-## Fast defaults
+## Method 1: See

- Start with `POST /v2/observe` on a tight region, not full screen.
- Set `ocr_mode` to `none` unless text is required immediately.
- Use `image` tool localization for icon-heavy or dense controls.
- Use `POST /v2/act-verify` instead of manual sleep/poll loops.
-
-## Mandatory image-tool click localization
-
-When OCR is weak or ambiguous, ask image tool for one coordinate in bounds.
-
-Prompt template:
- "Return one click point as JSON `{\"x\":<int>,\"y\":<int>}` inside this image (`width=W`, `height=H`) for the **<exact target>** control."
+Use `POST /see` to capture full screen or a region with a grid overlay.
+Use `POST /see/zoom` to capture a tighter crop with a denser grid.

 Rules:
- Ask for one point only.
- Include bounds in the prompt.
- If answer is not parseable `x,y`, re-ask once with stricter format.
- Send returned point to `POST /v2/localize` via `image_tool_point`.
+- Start with coarse grid (`12x12`).
+- For precision, zoom and use denser grid (`20x20` or higher).
+- Always use returned `meta.region` and `meta.grid` when computing click targets.
+- Coordinates are global desktop coordinates.

-## API playbook
+## Method 2: Interact

-1. **Observe**
+Use `POST /interact` for one action at a time.

-```json
-POST /v2/observe?screen=0
-{
-  "mode": "region",
-  "region_x": 820,
-  "region_y": 420,
-  "region_width": 700,
-  "region_height": 420,
-  "include_image": true,
-  "ocr_mode": "none"
-}
-```
+Mouse actions:
+- `move`, `click`, `right_click`, `double_click`, `middle_click`, `scroll`

-2. **Localize** (choose one)
+Keyboard actions:
+- `type`, `hotkey`

-Text:
-```json
-POST /v2/localize
-{"observation_id":"...","text_query":"Save","text_match":"exact"}
-```
+Rules:
+- Prefer `grid` targets derived from fresh `see`/`see/zoom` captures.
+- Use `pixel` only when you already have reliable coordinates.
+- After each important action, call `see` again before continuing.

-Image-tool point:
-```json
-POST /v2/localize
-{"observation_id":"...","image_tool_point":{"x":312,"y":188}}
-```
+## Method 3: Exec

-3. **Act**
+Use `POST /exec` only for shell/system tasks.

-```json
-POST /v2/act?screen=0
-{"action":{"action":"click","target":{"resolved_target_id":"..."}}}
-```
+Rules:
+- Requires `x-clickthrough-exec-secret`.
+- Do not use exec for normal clicking/typing flows.
+- Prefer GUI interaction first; exec is fallback or explicit shell task.

-4. **Verify**
+## Lightweight Procedure

-```json
-POST /v2/act-verify?screen=0
-{
-  "action":{"action":"click","target":{"resolved_target_id":"..."}},
-  "condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420},
-  "risk_level":"low"
-}
-```
+1. `see` capture.
+2. If needed, `see/zoom` refine.
+3. `interact` one step.
+4. `see` verify.
+5. Repeat.

-## Risk policy
+## Quick Safety Rules

- Low risk (navigation, focus, benign clicks): single verification signal.
- High risk (delete/send/purchase/close-lossy): use `risk_level=high` and require two checks before act.
- Never do speculative repeat clicks; switch strategy after one failed verify.
-
-## Anti-latency rules
-
- Never repeat full-screen OCR by default.
- Re-observe only the active pane/region.
- Prefer keyboard + window APIs for app switching.
- Use OCR on region only and cap area with `max_ocr_area_px`.
-
-## Setup and auth
-
- Include `x-clickthrough-token` when token auth is enabled.
- `/exec` additionally requires `x-clickthrough-exec-secret`.
- Validate server first: `GET /health`.
+- Never click with stale screenshots.
+- Never send multiple uncertain clicks in a row.
+- If localization is ambiguous, re-capture with a tighter zoom.