diff --git a/README.md b/README.md index 8714d73..4658a10 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,7 @@ Let an Agent interact with your computer over HTTP, with grid-aware screenshots - **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes) - **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported) +- **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ... - **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey - **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr` - **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec` @@ -30,11 +31,12 @@ For OCR support, install the native `tesseract` binary on the host (in addition ## Minimal API flow -1. `GET /screen` with grid -2. Decide cell / target -3. Optional `POST /zoom` for finer targeting -4. `POST /action` to execute -5. `GET /screen` again to verify result +1. `GET /displays` if you need a non-primary monitor +2. `GET /screen?screen=0` with grid +3. Decide cell / target +4. Optional `POST /zoom?screen=0` for finer targeting +5. `POST /action?screen=0` to execute +6. `GET /screen?screen=0` again to verify result See: - `docs/API.md` diff --git a/docs/API.md b/docs/API.md index dbfb58e..26e10af 100644 --- a/docs/API.md +++ b/docs/API.md @@ -12,19 +12,39 @@ x-clickthrough-token: Returns status and runtime safety flags, including `exec` capability config. +## `GET /displays` + +Returns detected displays in API screen order. + +```json +{ + "ok": true, + "default_screen": 0, + "displays": [ + {"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080}, + {"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080} + ] +} +``` + +`screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend. +Invalid `screen` values fall back to `0`. + ## `GET /screen` Query params: +- `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0` - `with_grid` (bool, default `true`) - `grid_rows` (int, default env or `12`) - `grid_cols` (int, default env or `12`) - `include_labels` (bool, default `true`) - `image_format` (`png`|`jpeg`, default `png`) - `jpeg_quality` (1-100, default `85`) -- `asImage` (bool, default `false`) — if `true`, return raw image bytes only (`image/png` or `image/jpeg`) +- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`) -Default response includes base64 image and metadata (`meta.region`, optional `meta.grid`). +Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`). +`meta.region` uses global desktop coordinates. ## `POST /zoom` @@ -47,14 +67,21 @@ Body: Query params: -- `asImage` (bool, default `false`) — if `true`, return raw image bytes only (`image/png` or `image/jpeg`) +- `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0` +- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`) -Default response returns cropped image + region metadata in global pixel coordinates. +Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base. ## `POST /action` Body: one action. +Query params: + +- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0` + +Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target. + ### Pointer target modes #### Pixel target @@ -147,6 +174,10 @@ Hotkey: Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes. +Query params: + +- `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0` + Body: ```json @@ -158,7 +189,7 @@ Body: ``` Modes: -- `screen` (default): OCR over full captured monitor +- `screen` (default): OCR over full selected monitor - `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`) - `image`: OCR over provided `image_base64` (supports plain base64 or data URL) @@ -246,6 +277,10 @@ Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution Runs multiple `action` payloads sequentially. +Query params: + +- `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0` + ```json { "actions": [ diff --git a/docs/coordinate-system.md b/docs/coordinate-system.md index 07d5f34..047af84 100644 --- a/docs/coordinate-system.md +++ b/docs/coordinate-system.md @@ -1,6 +1,8 @@ # Coordinate System -All interactions ultimately execute in **global pixel coordinates** of the primary monitor. +All interactions ultimately execute in **global desktop pixel coordinates**. + +Use `GET /displays` to list available displays. Visual endpoints accept `?screen=X` where `X` is a zero-based display index. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend. Invalid screen values fall back to `0`. ## Regions @@ -12,6 +14,12 @@ Visual endpoints return a `region` object: This describes where the image sits in global desktop space. +For a second display to the right of the primary display, `GET /screen?screen=1` might return: + +```json +{"x": 1920, "y": 0, "width": 1920, "height": 1080} +``` + ## Grid indexing - Rows/cols are **zero-based** @@ -35,7 +43,7 @@ Interpretation: ## Recommended agent loop -1. Capture `/screen` with coarse grid +1. Capture `/screen?screen=0` with coarse grid, or choose another display with `/screen?screen=1` 2. Find candidate cell 3. If uncertain, use `/zoom` around candidate 4. Convert target to grid action diff --git a/examples/quickstart.py b/examples/quickstart.py index 876d9d1..5aba923 100644 --- a/examples/quickstart.py +++ b/examples/quickstart.py @@ -5,6 +5,7 @@ import requests BASE_URL = os.getenv("CLICKTHROUGH_URL", "http://127.0.0.1:8123") TOKEN = os.getenv("CLICKTHROUGH_TOKEN", "") +SCREEN = int(os.getenv("CLICKTHROUGH_SCREEN", "0")) headers = {} if TOKEN: @@ -16,10 +17,14 @@ def main(): r.raise_for_status() print("health:", r.json()) + d = requests.get(f"{BASE_URL}/displays", headers=headers, timeout=10) + d.raise_for_status() + print("displays:", d.json().get("displays", [])) + s = requests.get( f"{BASE_URL}/screen", headers=headers, - params={"with_grid": True, "grid_rows": 12, "grid_cols": 12}, + params={"screen": SCREEN, "with_grid": True, "grid_rows": 12, "grid_cols": 12}, timeout=30, ) s.raise_for_status() diff --git a/server/app.py b/server/app.py index 5726da2..fe19ce1 100644 --- a/server/app.py +++ b/server/app.py @@ -192,13 +192,73 @@ def _import_capture_libs(): raise HTTPException(status_code=500, detail=f"capture backend unavailable: {exc}") from exc -def _capture_screen(): +def _display_region(mon: dict, screen: int, mss_index: int, primary: bool) -> dict: + return { + "screen": screen, + "mss_index": mss_index, + "primary": primary, + "x": mon["left"], + "y": mon["top"], + "width": mon["width"], + "height": mon["height"], + } + + +def _ordered_displays(sct) -> list[dict]: + raw_monitors = list(enumerate(sct.monitors[1:], start=1)) + if not raw_monitors: + raise HTTPException(status_code=500, detail="no displays detected") + + primary_pos = next( + (idx for idx, (_, mon) in enumerate(raw_monitors) if mon["left"] == 0 and mon["top"] == 0), + 0, + ) + ordered = [raw_monitors[primary_pos]] + [ + item for idx, item in enumerate(raw_monitors) if idx != primary_pos + ] + return [ + _display_region(mon, screen=index, mss_index=mss_index, primary=(index == 0)) + for index, (mss_index, mon) in enumerate(ordered) + ] + + +def _get_displays() -> list[dict]: + _, _, mss = _import_capture_libs() + with mss.mss() as sct: + return _ordered_displays(sct) + + +def _select_display(screen: int) -> tuple[dict, list[dict], dict]: + displays = _get_displays() + selected = displays[screen] if 0 <= screen < len(displays) else displays[0] + selection = { + "requested": screen, + "selected": selected["screen"], + "fallback": selected["screen"] != screen, + } + return selected, displays, selection + + +def _capture_screen(screen: int = 0): Image, _, mss = _import_capture_libs() with mss.mss() as sct: - mon = sct.monitors[1] - shot = sct.grab(mon) + displays = _ordered_displays(sct) + mon = displays[screen] if 0 <= screen < len(displays) else displays[0] + shot = sct.grab( + { + "left": mon["x"], + "top": mon["y"], + "width": mon["width"], + "height": mon["height"], + } + ) image = Image.frombytes("RGB", shot.size, shot.rgb) - return image, {"x": mon["left"], "y": mon["top"], "width": mon["width"], "height": mon["height"]} + selection = { + "requested": screen, + "selected": mon["screen"], + "fallback": mon["screen"] != screen, + } + return image, mon, displays, selection def _serialize_image(image, image_format: str, jpeg_quality: int) -> bytes: @@ -503,8 +563,9 @@ def _exec_command(req: ExecRequest) -> dict: } -def _exec_action(req: ActionRequest) -> dict: +def _exec_action(req: ActionRequest, screen: int = 0) -> dict: run_dry = SETTINGS["dry_run"] or req.dry_run + selected_display, displays, screen_selection = _select_display(screen) pyautogui = None if run_dry else _import_input_lib() resolved_target = None @@ -561,6 +622,8 @@ def _exec_action(req: ActionRequest) -> dict: "action": req.action, "executed": not run_dry, "dry_run": run_dry, + "screen": screen_selection, + "display": selected_display, "resolved_target": resolved_target, } @@ -585,6 +648,18 @@ def health(_: None = Depends(_auth)): } +@app.get("/displays") +def displays(_: None = Depends(_auth)): + detected = _get_displays() + return { + "ok": True, + "request_id": _request_id(), + "time_ms": _now_ms(), + "displays": detected, + "default_screen": 0, + } + + @app.get("/screen") def screen( with_grid: bool = True, @@ -594,6 +669,7 @@ def screen( image_format: Literal["png", "jpeg"] = "png", jpeg_quality: int = 85, asImage: bool = False, + screen: int = 0, _: None = Depends(_auth), ): req = ScreenRequest( @@ -605,8 +681,8 @@ def screen( jpeg_quality=jpeg_quality, ) - base_img, mon = _capture_screen() - meta = {"region": mon} + base_img, mon, displays, screen_selection = _capture_screen(screen) + meta = {"region": mon, "screen": screen_selection, "displays": displays} out_img = base_img if req.with_grid: @@ -634,8 +710,8 @@ def screen( @app.post("/zoom") -def zoom(req: ZoomRequest, asImage: bool = False, _: None = Depends(_auth)): - base_img, mon = _capture_screen() +def zoom(req: ZoomRequest, asImage: bool = False, screen: int = 0, _: None = Depends(_auth)): + base_img, mon, displays, screen_selection = _capture_screen(screen) cx = req.center_x - mon["x"] cy = req.center_y - mon["y"] @@ -655,6 +731,8 @@ def zoom(req: ZoomRequest, asImage: bool = False, _: None = Depends(_auth)): meta = { "source_monitor": mon, + "screen": screen_selection, + "displays": displays, "region": { "x": region_x, "y": region_y, @@ -690,8 +768,8 @@ def zoom(req: ZoomRequest, asImage: bool = False, _: None = Depends(_auth)): @app.post("/action") -def action(req: ActionRequest, _: None = Depends(_auth)): - result = _exec_action(req) +def action(req: ActionRequest, screen: int = 0, _: None = Depends(_auth)): + result = _exec_action(req, screen) return { "ok": True, "request_id": _request_id(), @@ -722,14 +800,14 @@ def exec_command( @app.post("/ocr") -def ocr(req: OCRRequest, _: None = Depends(_auth)): +def ocr(req: OCRRequest, screen: int = 0, _: None = Depends(_auth)): source = req.mode if source == "image": image = _decode_image_base64(req.image_base64 or "") region = {"x": 0, "y": 0, "width": image.size[0], "height": image.size[1]} blocks = _run_ocr(image, req.language_hint, req.min_confidence, 0, 0) else: - base_img, mon = _capture_screen() + base_img, mon, displays, screen_selection = _capture_screen(screen) if source == "screen": image = base_img region = {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]} @@ -762,6 +840,8 @@ def ocr(req: OCRRequest, _: None = Depends(_auth)): "time_ms": _now_ms(), "result": { "mode": source, + "screen": screen_selection if source != "image" else None, + "display": mon if source != "image" else None, "language_hint": req.language_hint, "min_confidence": req.min_confidence, "region": region, @@ -771,11 +851,11 @@ def ocr(req: OCRRequest, _: None = Depends(_auth)): @app.post("/batch") -def batch(req: BatchRequest, _: None = Depends(_auth)): +def batch(req: BatchRequest, screen: int = 0, _: None = Depends(_auth)): results = [] for index, item in enumerate(req.actions): try: - item_result = _exec_action(item) + item_result = _exec_action(item, screen) results.append({"index": index, "ok": True, "result": item_result}) except Exception as exc: results.append({"index": index, "ok": False, "error": str(exc)}) diff --git a/skill/SKILL.md b/skill/SKILL.md index fc11c9a..6a06972 100644 --- a/skill/SKILL.md +++ b/skill/SKILL.md @@ -33,13 +33,20 @@ The agent should not assume it can self-install this stack. ## Mini API map - `GET /health` → server status + safety flags -- `GET /screen` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`) -- `POST /zoom` → cropped screenshot around point/region (also supports `asImage=true`) +- `GET /displays` → detected displays in zero-based API order +- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`) +- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`) - `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes -- `POST /action` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...) -- `POST /batch` → sequential action list +- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...) +- `POST /batch?screen=0` → sequential action list - `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header) +### Display selection + +- Use `GET /displays` before operating on multi-monitor systems. +- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`. +- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates. + ### OCR usage - Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs). @@ -55,7 +62,7 @@ The agent should not assume it can self-install this stack. ## Core workflow (mandatory) -1. Call `GET /screen` with coarse grid (e.g., 12x12). +1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display. 2. Identify likely target region and compute an initial confidence score. 3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate. 4. **Before any click**, verify target identity (OCR text/icon/location consistency).