diff --git a/.gitea/workflows/python-syntax.yml b/.gitea/workflows/python-syntax.yml new file mode 100644 index 0000000..8d1df51 --- /dev/null +++ b/.gitea/workflows/python-syntax.yml @@ -0,0 +1,29 @@ +name: python-syntax + +on: + push: + pull_request: + +jobs: + syntax-check: + runs-on: ubuntu-latest + steps: + - name: Checkout + uses: actions/checkout@v4 + + - name: Setup Python + uses: actions/setup-python@v5 + with: + python-version: "3.11" + + - name: Syntax check (py_compile) + run: | + files=$(git ls-files '*.py') + if [ -z "$files" ]; then + echo "No Python files found" + exit 0 + fi + python -m py_compile $files + + - name: Compile all (sanity) + run: python -m compileall -q server examples diff --git a/.gitignore b/.gitignore index 065ac65..c5cdd1a 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,10 @@ -__pycache__ +__pycache__/ +*.pyc .env -venv \ No newline at end of file +venv/ +.venv/ +*.log +*.png +*.jpg +*.jpeg +*.webp \ No newline at end of file diff --git a/README.md b/README.md index 5a2c84e..dbf6627 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,53 @@ # Clickthrough -Let an Agent interact with your Computer. \ No newline at end of file + +Let an Agent interact with your computer over HTTP, with grid-aware screenshots and precise input actions. + +## What this provides + +- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells +- **Zoom endpoint**: crop around a point with denser grid for fine targeting +- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey +- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels +- **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction + +## Quick start + +```bash +cd /root/external-projects/clickthrough +python3 -m venv .venv +. .venv/bin/activate +pip install -r requirements.txt +CLICKTHROUGH_TOKEN=change-me python -m server.app +``` + +Server defaults to `127.0.0.1:8123`. + +## Minimal API flow + +1. `GET /screen` with grid +2. Decide cell / target +3. Optional `POST /zoom` for finer targeting +4. `POST /action` to execute +5. `GET /screen` again to verify result + +See: +- `docs/API.md` +- `docs/coordinate-system.md` +- `skill/SKILL.md` + +## Configuration + +Environment variables: + +- `CLICKTHROUGH_HOST` (default `127.0.0.1`) +- `CLICKTHROUGH_PORT` (default `8123`) +- `CLICKTHROUGH_TOKEN` (optional; if set, require `x-clickthrough-token` header) +- `CLICKTHROUGH_DRY_RUN` (`true`/`false`; default `false`) +- `CLICKTHROUGH_GRID_ROWS` (default `12`) +- `CLICKTHROUGH_GRID_COLS` (default `12`) +- `CLICKTHROUGH_ALLOWED_REGION` (optional `x,y,width,height`) + +## Gitea CI + +A Gitea Actions workflow is included at `.gitea/workflows/python-syntax.yml`. +It runs Python syntax checks (`py_compile`) on every push and pull request. diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..37240d6 --- /dev/null +++ b/TODO.md @@ -0,0 +1,21 @@ +# TODO + +## Project: Clickthrough v1 + +## Current Status +- [x] Draft implementation plan approved +- [x] Build FastAPI server with screenshot + grid + zoom + actions +- [x] Add auth + safety guardrails (token, dry-run, bounds) +- [x] Add AgentSkill docs for operating the API reliably +- [x] Add Gitea CI workflow for Python syntax safety +- [x] Add usage docs + quickstart +- [x] Run local syntax validation + +## Notes +- API responses now include request IDs, timestamps, and coordinate metadata +- Local syntax checks passed (`py_compile`, `compileall`) +- CI workflow runs syntax checks on push + PR + +## Next +- Manual runtime test on a desktop session (capture + click loop) +- Optional: add monitor selection and OCR helper endpoint diff --git a/docs/API.md b/docs/API.md new file mode 100644 index 0000000..659eeaf --- /dev/null +++ b/docs/API.md @@ -0,0 +1,153 @@ +# API Reference (v0.1) + +Base URL: `http://127.0.0.1:8123` + +If `CLICKTHROUGH_TOKEN` is set, include header: + +```http +x-clickthrough-token: +``` + +## `GET /health` + +Returns status and runtime safety flags. + +## `GET /screen` + +Query params: + +- `with_grid` (bool, default `true`) +- `grid_rows` (int, default env or `12`) +- `grid_cols` (int, default env or `12`) +- `include_labels` (bool, default `true`) +- `image_format` (`png`|`jpeg`, default `png`) +- `jpeg_quality` (1-100, default `85`) + +Response includes base64 image and metadata (`meta.region`, optional `meta.grid`). + +## `POST /zoom` + +Body: + +```json +{ + "center_x": 1200, + "center_y": 700, + "width": 500, + "height": 350, + "with_grid": true, + "grid_rows": 20, + "grid_cols": 20, + "include_labels": true, + "image_format": "png", + "jpeg_quality": 90 +} +``` + +Returns cropped image + region metadata in global pixel coordinates. + +## `POST /action` + +Body: one action. + +### Pointer target modes + +#### Pixel target + +```json +{ + "mode": "pixel", + "x": 100, + "y": 200, + "dx": 0, + "dy": 0 +} +``` + +#### Grid target + +```json +{ + "mode": "grid", + "region_x": 0, + "region_y": 0, + "region_width": 1920, + "region_height": 1080, + "rows": 12, + "cols": 12, + "row": 5, + "col": 9, + "dx": 0.0, + "dy": 0.0 +} +``` + +`dx`/`dy` are normalized offsets in `[-1, 1]` inside the selected cell. + +### Action examples + +Click: + +```json +{ + "action": "click", + "target": { + "mode": "grid", + "region_x": 0, + "region_y": 0, + "region_width": 1920, + "region_height": 1080, + "rows": 12, + "cols": 12, + "row": 7, + "col": 3, + "dx": 0.2, + "dy": -0.1 + }, + "clicks": 1, + "button": "left" +} +``` + +Scroll: + +```json +{ + "action": "scroll", + "target": {"mode": "pixel", "x": 1300, "y": 740}, + "scroll_amount": -500 +} +``` + +Type text: + +```json +{ + "action": "type", + "text": "hello world", + "interval_ms": 20 +} +``` + +Hotkey: + +```json +{ + "action": "hotkey", + "keys": ["ctrl", "l"] +} +``` + +## `POST /batch` + +Runs multiple `action` payloads sequentially. + +```json +{ + "actions": [ + {"action": "move", "target": {"mode": "pixel", "x": 100, "y": 100}}, + {"action": "click", "target": {"mode": "pixel", "x": 100, "y": 100}} + ], + "stop_on_error": true +} +``` diff --git a/docs/coordinate-system.md b/docs/coordinate-system.md new file mode 100644 index 0000000..07d5f34 --- /dev/null +++ b/docs/coordinate-system.md @@ -0,0 +1,43 @@ +# Coordinate System + +All interactions ultimately execute in **global pixel coordinates** of the primary monitor. + +## Regions + +Visual endpoints return a `region` object: + +```json +{"x": 0, "y": 0, "width": 1920, "height": 1080} +``` + +This describes where the image sits in global desktop space. + +## Grid indexing + +- Rows/cols are **zero-based** +- Cell `(row=0, col=0)` is top-left +- Each cell has: + - `cell_width = region.width / cols` + - `cell_height = region.height / rows` + +## Cell center formula + +Given `(row, col, dx, dy)` where `dx,dy ∈ [-1,1]`: + +- `x = region.x + ((col + 0.5 + dx*0.5) * cell_width)` +- `y = region.y + ((row + 0.5 + dy*0.5) * cell_height)` + +Interpretation: +- `dx = -1` -> left edge of cell +- `dx = 0` -> center +- `dx = 1` -> right edge +- same concept for `dy` + +## Recommended agent loop + +1. Capture `/screen` with coarse grid +2. Find candidate cell +3. If uncertain, use `/zoom` around candidate +4. Convert target to grid action +5. Execute `/action` +6. Re-capture and verify diff --git a/examples/quickstart.py b/examples/quickstart.py new file mode 100644 index 0000000..876d9d1 --- /dev/null +++ b/examples/quickstart.py @@ -0,0 +1,31 @@ +import os + +import requests + + +BASE_URL = os.getenv("CLICKTHROUGH_URL", "http://127.0.0.1:8123") +TOKEN = os.getenv("CLICKTHROUGH_TOKEN", "") + +headers = {} +if TOKEN: + headers["x-clickthrough-token"] = TOKEN + + +def main(): + r = requests.get(f"{BASE_URL}/health", headers=headers, timeout=10) + r.raise_for_status() + print("health:", r.json()) + + s = requests.get( + f"{BASE_URL}/screen", + headers=headers, + params={"with_grid": True, "grid_rows": 12, "grid_cols": 12}, + timeout=30, + ) + s.raise_for_status() + payload = s.json() + print("screen meta:", payload.get("meta", {})) + + +if __name__ == "__main__": + main() diff --git a/pyproject.toml b/pyproject.toml new file mode 100644 index 0000000..6e34dd0 --- /dev/null +++ b/pyproject.toml @@ -0,0 +1,8 @@ +[project] +name = "clickthrough" +version = "0.1.0" +description = "HTTP computer control bridge for agents" +requires-python = ">=3.11" + +[tool.pytest.ini_options] +pythonpath = ["."] diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..2ba21ce --- /dev/null +++ b/requirements.txt @@ -0,0 +1,5 @@ +fastapi>=0.115.0 +uvicorn>=0.30.0 +mss>=9.0.1 +pillow>=10.4.0 +pyautogui>=0.9.54 diff --git a/server/__init__.py b/server/__init__.py new file mode 100644 index 0000000..a32e63c --- /dev/null +++ b/server/__init__.py @@ -0,0 +1 @@ +__all__ = ["app"] diff --git a/server/app.py b/server/app.py new file mode 100644 index 0000000..f2fd12f --- /dev/null +++ b/server/app.py @@ -0,0 +1,457 @@ +import base64 +import io +import os +import time +import uuid +from typing import Literal, Optional + +from fastapi import Depends, FastAPI, Header, HTTPException +from pydantic import BaseModel, Field, model_validator + + +app = FastAPI(title="clickthrough", version="0.1.0") + + +def _env_bool(name: str, default: bool) -> bool: + raw = os.getenv(name) + if raw is None: + return default + return raw.strip().lower() in {"1", "true", "yes", "on"} + + +def _parse_allowed_region() -> Optional[tuple[int, int, int, int]]: + raw = os.getenv("CLICKTHROUGH_ALLOWED_REGION") + if not raw: + return None + parts = [p.strip() for p in raw.split(",")] + if len(parts) != 4: + raise ValueError("CLICKTHROUGH_ALLOWED_REGION must be x,y,width,height") + x, y, w, h = (int(p) for p in parts) + if w <= 0 or h <= 0: + raise ValueError("CLICKTHROUGH_ALLOWED_REGION width/height must be > 0") + return x, y, w, h + + +SETTINGS = { + "host": os.getenv("CLICKTHROUGH_HOST", "127.0.0.1"), + "port": int(os.getenv("CLICKTHROUGH_PORT", "8123")), + "token": os.getenv("CLICKTHROUGH_TOKEN", "").strip(), + "dry_run": _env_bool("CLICKTHROUGH_DRY_RUN", False), + "default_grid_rows": int(os.getenv("CLICKTHROUGH_GRID_ROWS", "12")), + "default_grid_cols": int(os.getenv("CLICKTHROUGH_GRID_COLS", "12")), + "allowed_region": _parse_allowed_region(), +} + + +class ScreenRequest(BaseModel): + with_grid: bool = True + grid_rows: int = Field(default=SETTINGS["default_grid_rows"], ge=1, le=200) + grid_cols: int = Field(default=SETTINGS["default_grid_cols"], ge=1, le=200) + include_labels: bool = True + image_format: Literal["png", "jpeg"] = "png" + jpeg_quality: int = Field(default=85, ge=1, le=100) + + +class ZoomRequest(BaseModel): + center_x: int = Field(ge=0) + center_y: int = Field(ge=0) + width: int = Field(default=500, ge=10) + height: int = Field(default=350, ge=10) + with_grid: bool = True + grid_rows: int = Field(default=20, ge=1, le=300) + grid_cols: int = Field(default=20, ge=1, le=300) + include_labels: bool = True + image_format: Literal["png", "jpeg"] = "png" + jpeg_quality: int = Field(default=90, ge=1, le=100) + + +class PixelTarget(BaseModel): + mode: Literal["pixel"] + x: int + y: int + dx: int = 0 + dy: int = 0 + + +class GridTarget(BaseModel): + mode: Literal["grid"] + region_x: int + region_y: int + region_width: int = Field(gt=0) + region_height: int = Field(gt=0) + rows: int = Field(gt=0) + cols: int = Field(gt=0) + row: int = Field(ge=0) + col: int = Field(ge=0) + dx: float = 0.0 + dy: float = 0.0 + + @model_validator(mode="after") + def _validate_indices(self): + if self.row >= self.rows or self.col >= self.cols: + raise ValueError("row/col must be inside rows/cols") + if not -1.0 <= self.dx <= 1.0: + raise ValueError("dx must be in [-1, 1]") + if not -1.0 <= self.dy <= 1.0: + raise ValueError("dy must be in [-1, 1]") + return self + + +Target = PixelTarget | GridTarget + + +class ActionRequest(BaseModel): + action: Literal[ + "move", + "click", + "right_click", + "double_click", + "middle_click", + "scroll", + "type", + "hotkey", + ] + target: Optional[Target] = None + duration_ms: int = Field(default=0, ge=0, le=20000) + button: Literal["left", "right", "middle"] = "left" + clicks: int = Field(default=1, ge=1, le=10) + scroll_amount: int = 0 + text: str = "" + keys: list[str] = Field(default_factory=list) + interval_ms: int = Field(default=20, ge=0, le=5000) + dry_run: bool = False + + +class BatchRequest(BaseModel): + actions: list[ActionRequest] = Field(min_length=1, max_length=100) + stop_on_error: bool = True + + +def _auth(x_clickthrough_token: Optional[str] = Header(default=None)): + token = SETTINGS["token"] + if token and x_clickthrough_token != token: + raise HTTPException(status_code=401, detail="invalid token") + + +def _now_ms() -> int: + return int(time.time() * 1000) + + +def _request_id() -> str: + return str(uuid.uuid4()) + + +def _import_capture_libs(): + try: + from PIL import Image, ImageDraw + import mss + + return Image, ImageDraw, mss + except Exception as exc: + raise HTTPException(status_code=500, detail=f"capture backend unavailable: {exc}") from exc + + +def _capture_screen(): + Image, _, mss = _import_capture_libs() + with mss.mss() as sct: + mon = sct.monitors[1] + shot = sct.grab(mon) + image = Image.frombytes("RGB", shot.size, shot.rgb) + return image, {"x": mon["left"], "y": mon["top"], "width": mon["width"], "height": mon["height"]} + + +def _encode_image(image, image_format: str, jpeg_quality: int) -> str: + buf = io.BytesIO() + if image_format == "jpeg": + image.save(buf, format="JPEG", quality=jpeg_quality) + else: + image.save(buf, format="PNG") + return base64.b64encode(buf.getvalue()).decode("ascii") + + +def _draw_grid(image, region_x: int, region_y: int, rows: int, cols: int, include_labels: bool): + _, ImageDraw, _ = _import_capture_libs() + out = image.copy() + draw = ImageDraw.Draw(out) + w, h = out.size + + cell_w = w / cols + cell_h = h / rows + + for c in range(1, cols): + x = int(round(c * cell_w)) + draw.line([(x, 0), (x, h)], fill=(255, 0, 0), width=1) + for r in range(1, rows): + y = int(round(r * cell_h)) + draw.line([(0, y), (w, y)], fill=(255, 0, 0), width=1) + + draw.rectangle([(0, 0), (w - 1, h - 1)], outline=(255, 0, 0), width=2) + + if include_labels: + for r in range(rows): + for c in range(cols): + cx = int((c + 0.5) * cell_w) + cy = int((r + 0.5) * cell_h) + label = f"{r},{c}" + draw.text((cx - 12, cy - 6), label, fill=(255, 255, 0)) + + meta = { + "region": {"x": region_x, "y": region_y, "width": w, "height": h}, + "grid": { + "rows": rows, + "cols": cols, + "cell_width": cell_w, + "cell_height": cell_h, + "indexing": "zero-based", + "point_formula": { + "pixel_x": "region.x + ((col + 0.5 + dx*0.5) * cell_width)", + "pixel_y": "region.y + ((row + 0.5 + dy*0.5) * cell_height)", + "dx_range": "[-1,1]", + "dy_range": "[-1,1]", + }, + }, + } + return out, meta + + +def _resolve_target(target: Target) -> tuple[int, int, dict]: + if isinstance(target, PixelTarget): + x = target.x + target.dx + y = target.y + target.dy + return x, y, {"mode": "pixel", "source": target.model_dump()} + + cell_w = target.region_width / target.cols + cell_h = target.region_height / target.rows + + x = target.region_x + int(round((target.col + 0.5 + (target.dx * 0.5)) * cell_w)) + y = target.region_y + int(round((target.row + 0.5 + (target.dy * 0.5)) * cell_h)) + + return x, y, { + "mode": "grid", + "source": target.model_dump(), + "derived": {"cell_width": cell_w, "cell_height": cell_h}, + } + + +def _enforce_allowed_region(x: int, y: int): + region = SETTINGS["allowed_region"] + if region is None: + return + rx, ry, rw, rh = region + if not (rx <= x < rx + rw and ry <= y < ry + rh): + raise HTTPException(status_code=403, detail="point outside allowed region") + + +def _import_input_lib(): + try: + import pyautogui + + pyautogui.FAILSAFE = True + return pyautogui + except Exception as exc: + raise HTTPException(status_code=500, detail=f"input backend unavailable: {exc}") from exc + + +def _exec_action(req: ActionRequest) -> dict: + run_dry = SETTINGS["dry_run"] or req.dry_run + + pyautogui = None if run_dry else _import_input_lib() + resolved_target = None + + if req.target is not None: + x, y, info = _resolve_target(req.target) + _enforce_allowed_region(x, y) + resolved_target = {"x": x, "y": y, "target_info": info} + + duration_sec = req.duration_ms / 1000.0 + + if req.action in {"move", "click", "right_click", "double_click", "middle_click"} and resolved_target is None: + raise HTTPException(status_code=400, detail="target is required for pointer actions") + + if req.action == "scroll" and resolved_target is None: + raise HTTPException(status_code=400, detail="target is required for scroll") + + if not run_dry: + if req.action == "move": + pyautogui.moveTo(resolved_target["x"], resolved_target["y"], duration=duration_sec) + + elif req.action == "click": + pyautogui.click( + x=resolved_target["x"], + y=resolved_target["y"], + clicks=req.clicks, + interval=req.interval_ms / 1000.0, + button=req.button, + duration=duration_sec, + ) + + elif req.action == "right_click": + pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], button="right", duration=duration_sec) + + elif req.action == "double_click": + pyautogui.doubleClick(x=resolved_target["x"], y=resolved_target["y"], interval=req.interval_ms / 1000.0) + + elif req.action == "middle_click": + pyautogui.click(x=resolved_target["x"], y=resolved_target["y"], button="middle", duration=duration_sec) + + elif req.action == "scroll": + pyautogui.moveTo(resolved_target["x"], resolved_target["y"], duration=duration_sec) + pyautogui.scroll(req.scroll_amount) + + elif req.action == "type": + pyautogui.write(req.text, interval=req.interval_ms / 1000.0) + + elif req.action == "hotkey": + if len(req.keys) < 1: + raise HTTPException(status_code=400, detail="keys is required for hotkey") + pyautogui.hotkey(*req.keys) + + return { + "action": req.action, + "executed": not run_dry, + "dry_run": run_dry, + "resolved_target": resolved_target, + } + + +@app.get("/health") +def health(_: None = Depends(_auth)): + return { + "ok": True, + "service": "clickthrough", + "version": app.version, + "time_ms": _now_ms(), + "request_id": _request_id(), + "dry_run": SETTINGS["dry_run"], + "allowed_region": SETTINGS["allowed_region"], + } + + +@app.get("/screen") +def screen( + with_grid: bool = True, + grid_rows: int = SETTINGS["default_grid_rows"], + grid_cols: int = SETTINGS["default_grid_cols"], + include_labels: bool = True, + image_format: Literal["png", "jpeg"] = "png", + jpeg_quality: int = 85, + _: None = Depends(_auth), +): + req = ScreenRequest( + with_grid=with_grid, + grid_rows=grid_rows, + grid_cols=grid_cols, + include_labels=include_labels, + image_format=image_format, + jpeg_quality=jpeg_quality, + ) + + base_img, mon = _capture_screen() + meta = {"region": mon} + out_img = base_img + + if req.with_grid: + out_img, grid_meta = _draw_grid(base_img, mon["x"], mon["y"], req.grid_rows, req.grid_cols, req.include_labels) + meta.update(grid_meta) + + encoded = _encode_image(out_img, req.image_format, req.jpeg_quality) + return { + "ok": True, + "request_id": _request_id(), + "time_ms": _now_ms(), + "image": { + "format": req.image_format, + "base64": encoded, + "width": out_img.size[0], + "height": out_img.size[1], + }, + "meta": meta, + } + + +@app.post("/zoom") +def zoom(req: ZoomRequest, _: None = Depends(_auth)): + base_img, mon = _capture_screen() + + cx = req.center_x - mon["x"] + cy = req.center_y - mon["y"] + + half_w = req.width // 2 + half_h = req.height // 2 + + left = max(0, cx - half_w) + top = max(0, cy - half_h) + right = min(base_img.size[0], left + req.width) + bottom = min(base_img.size[1], top + req.height) + + crop = base_img.crop((left, top, right, bottom)) + + region_x = mon["x"] + left + region_y = mon["y"] + top + + meta = { + "source_monitor": mon, + "region": { + "x": region_x, + "y": region_y, + "width": crop.size[0], + "height": crop.size[1], + }, + } + + out_img = crop + if req.with_grid: + out_img, grid_meta = _draw_grid(crop, region_x, region_y, req.grid_rows, req.grid_cols, req.include_labels) + meta.update(grid_meta) + + encoded = _encode_image(out_img, req.image_format, req.jpeg_quality) + + return { + "ok": True, + "request_id": _request_id(), + "time_ms": _now_ms(), + "image": { + "format": req.image_format, + "base64": encoded, + "width": out_img.size[0], + "height": out_img.size[1], + }, + "meta": meta, + } + + +@app.post("/action") +def action(req: ActionRequest, _: None = Depends(_auth)): + result = _exec_action(req) + return { + "ok": True, + "request_id": _request_id(), + "time_ms": _now_ms(), + "result": result, + } + + +@app.post("/batch") +def batch(req: BatchRequest, _: None = Depends(_auth)): + results = [] + for index, item in enumerate(req.actions): + try: + item_result = _exec_action(item) + results.append({"index": index, "ok": True, "result": item_result}) + except Exception as exc: + results.append({"index": index, "ok": False, "error": str(exc)}) + if req.stop_on_error: + break + + return { + "ok": all(r["ok"] for r in results), + "request_id": _request_id(), + "time_ms": _now_ms(), + "results": results, + } + + +if __name__ == "__main__": + import uvicorn + + uvicorn.run("server.app:app", host=SETTINGS["host"], port=SETTINGS["port"], reload=False) diff --git a/skill/SKILL.md b/skill/SKILL.md new file mode 100644 index 0000000..f96c3c2 --- /dev/null +++ b/skill/SKILL.md @@ -0,0 +1,35 @@ +--- +name: clickthrough-http-control +description: Control a local computer through the Clickthrough HTTP server using screenshot grids, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification. +--- + +# Clickthrough HTTP Control + +Use a strict observe-decide-act-verify loop. + +## Workflow + +1. Call `GET /screen` with coarse grid (e.g., 12x12). +2. Identify likely cell/region for the target UI element. +3. If confidence is low, call `POST /zoom` centered on the candidate and use denser grid (e.g., 20x20). +4. Execute one minimal action via `POST /action`. +5. Re-capture with `GET /screen` and verify the expected state change. +6. Repeat until objective is complete. + +## Precision rules + +- Prefer grid targets first, then use `dx/dy` for subcell precision. +- Keep `dx/dy` in `[-1,1]`; start at `0,0` and only offset when needed. +- Use zoom before guessing offsets. + +## Safety rules + +- Respect `dry_run` and `allowed_region` restrictions from `/health`. +- Avoid destructive shortcuts unless explicitly requested. +- Send one action at a time unless deterministic; then use `/batch`. + +## Reliability rules + +- After every meaningful action, verify with a fresh screenshot. +- On mismatch, do not spam clicks: zoom, re-localize, and retry once. +- Prefer short, reversible actions over long macros.