Support multi-display screen selection
All checks were successful
python-syntax / syntax-check (push) Successful in 1m33s

This commit is contained in:
Space-Banane
2026-04-29 21:52:01 +02:00
parent a8f2e01bb9
commit 775c188732
6 changed files with 170 additions and 33 deletions

View File

@@ -6,6 +6,7 @@ Let an Agent interact with your computer over HTTP, with grid-aware screenshots
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes) - **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported) - **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
- **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ...
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey - **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
- **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr` - **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr`
- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec` - **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
@@ -30,11 +31,12 @@ For OCR support, install the native `tesseract` binary on the host (in addition
## Minimal API flow ## Minimal API flow
1. `GET /screen` with grid 1. `GET /displays` if you need a non-primary monitor
2. Decide cell / target 2. `GET /screen?screen=0` with grid
3. Optional `POST /zoom` for finer targeting 3. Decide cell / target
4. `POST /action` to execute 4. Optional `POST /zoom?screen=0` for finer targeting
5. `GET /screen` again to verify result 5. `POST /action?screen=0` to execute
6. `GET /screen?screen=0` again to verify result
See: See:
- `docs/API.md` - `docs/API.md`

View File

@@ -12,19 +12,39 @@ x-clickthrough-token: <token>
Returns status and runtime safety flags, including `exec` capability config. Returns status and runtime safety flags, including `exec` capability config.
## `GET /displays`
Returns detected displays in API screen order.
```json
{
"ok": true,
"default_screen": 0,
"displays": [
{"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
{"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
]
}
```
`screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend.
Invalid `screen` values fall back to `0`.
## `GET /screen` ## `GET /screen`
Query params: Query params:
- `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0`
- `with_grid` (bool, default `true`) - `with_grid` (bool, default `true`)
- `grid_rows` (int, default env or `12`) - `grid_rows` (int, default env or `12`)
- `grid_cols` (int, default env or `12`) - `grid_cols` (int, default env or `12`)
- `include_labels` (bool, default `true`) - `include_labels` (bool, default `true`)
- `image_format` (`png`|`jpeg`, default `png`) - `image_format` (`png`|`jpeg`, default `png`)
- `jpeg_quality` (1-100, default `85`) - `jpeg_quality` (1-100, default `85`)
- `asImage` (bool, default `false`) if `true`, return raw image bytes only (`image/png` or `image/jpeg`) - `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
Default response includes base64 image and metadata (`meta.region`, optional `meta.grid`). Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
`meta.region` uses global desktop coordinates.
## `POST /zoom` ## `POST /zoom`
@@ -47,14 +67,21 @@ Body:
Query params: Query params:
- `asImage` (bool, default `false`) — if `true`, return raw image bytes only (`image/png` or `image/jpeg`) - `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0`
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
Default response returns cropped image + region metadata in global pixel coordinates. Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.
## `POST /action` ## `POST /action`
Body: one action. Body: one action.
Query params:
- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`
Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target.
### Pointer target modes ### Pointer target modes
#### Pixel target #### Pixel target
@@ -147,6 +174,10 @@ Hotkey:
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes. Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
Query params:
- `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0`
Body: Body:
```json ```json
@@ -158,7 +189,7 @@ Body:
``` ```
Modes: Modes:
- `screen` (default): OCR over full captured monitor - `screen` (default): OCR over full selected monitor
- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`) - `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
- `image`: OCR over provided `image_base64` (supports plain base64 or data URL) - `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
@@ -246,6 +277,10 @@ Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution
Runs multiple `action` payloads sequentially. Runs multiple `action` payloads sequentially.
Query params:
- `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0`
```json ```json
{ {
"actions": [ "actions": [

View File

@@ -1,6 +1,8 @@
# Coordinate System # Coordinate System
All interactions ultimately execute in **global pixel coordinates** of the primary monitor. All interactions ultimately execute in **global desktop pixel coordinates**.
Use `GET /displays` to list available displays. Visual endpoints accept `?screen=X` where `X` is a zero-based display index. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend. Invalid screen values fall back to `0`.
## Regions ## Regions
@@ -12,6 +14,12 @@ Visual endpoints return a `region` object:
This describes where the image sits in global desktop space. This describes where the image sits in global desktop space.
For a second display to the right of the primary display, `GET /screen?screen=1` might return:
```json
{"x": 1920, "y": 0, "width": 1920, "height": 1080}
```
## Grid indexing ## Grid indexing
- Rows/cols are **zero-based** - Rows/cols are **zero-based**
@@ -35,7 +43,7 @@ Interpretation:
## Recommended agent loop ## Recommended agent loop
1. Capture `/screen` with coarse grid 1. Capture `/screen?screen=0` with coarse grid, or choose another display with `/screen?screen=1`
2. Find candidate cell 2. Find candidate cell
3. If uncertain, use `/zoom` around candidate 3. If uncertain, use `/zoom` around candidate
4. Convert target to grid action 4. Convert target to grid action

View File

@@ -5,6 +5,7 @@ import requests
BASE_URL = os.getenv("CLICKTHROUGH_URL", "http://127.0.0.1:8123") BASE_URL = os.getenv("CLICKTHROUGH_URL", "http://127.0.0.1:8123")
TOKEN = os.getenv("CLICKTHROUGH_TOKEN", "") TOKEN = os.getenv("CLICKTHROUGH_TOKEN", "")
SCREEN = int(os.getenv("CLICKTHROUGH_SCREEN", "0"))
headers = {} headers = {}
if TOKEN: if TOKEN:
@@ -16,10 +17,14 @@ def main():
r.raise_for_status() r.raise_for_status()
print("health:", r.json()) print("health:", r.json())
d = requests.get(f"{BASE_URL}/displays", headers=headers, timeout=10)
d.raise_for_status()
print("displays:", d.json().get("displays", []))
s = requests.get( s = requests.get(
f"{BASE_URL}/screen", f"{BASE_URL}/screen",
headers=headers, headers=headers,
params={"with_grid": True, "grid_rows": 12, "grid_cols": 12}, params={"screen": SCREEN, "with_grid": True, "grid_rows": 12, "grid_cols": 12},
timeout=30, timeout=30,
) )
s.raise_for_status() s.raise_for_status()

View File

@@ -192,13 +192,73 @@ def _import_capture_libs():
raise HTTPException(status_code=500, detail=f"capture backend unavailable: {exc}") from exc raise HTTPException(status_code=500, detail=f"capture backend unavailable: {exc}") from exc
def _capture_screen(): def _display_region(mon: dict, screen: int, mss_index: int, primary: bool) -> dict:
return {
"screen": screen,
"mss_index": mss_index,
"primary": primary,
"x": mon["left"],
"y": mon["top"],
"width": mon["width"],
"height": mon["height"],
}
def _ordered_displays(sct) -> list[dict]:
raw_monitors = list(enumerate(sct.monitors[1:], start=1))
if not raw_monitors:
raise HTTPException(status_code=500, detail="no displays detected")
primary_pos = next(
(idx for idx, (_, mon) in enumerate(raw_monitors) if mon["left"] == 0 and mon["top"] == 0),
0,
)
ordered = [raw_monitors[primary_pos]] + [
item for idx, item in enumerate(raw_monitors) if idx != primary_pos
]
return [
_display_region(mon, screen=index, mss_index=mss_index, primary=(index == 0))
for index, (mss_index, mon) in enumerate(ordered)
]
def _get_displays() -> list[dict]:
_, _, mss = _import_capture_libs()
with mss.mss() as sct:
return _ordered_displays(sct)
def _select_display(screen: int) -> tuple[dict, list[dict], dict]:
displays = _get_displays()
selected = displays[screen] if 0 <= screen < len(displays) else displays[0]
selection = {
"requested": screen,
"selected": selected["screen"],
"fallback": selected["screen"] != screen,
}
return selected, displays, selection
def _capture_screen(screen: int = 0):
Image, _, mss = _import_capture_libs() Image, _, mss = _import_capture_libs()
with mss.mss() as sct: with mss.mss() as sct:
mon = sct.monitors[1] displays = _ordered_displays(sct)
shot = sct.grab(mon) mon = displays[screen] if 0 <= screen < len(displays) else displays[0]
shot = sct.grab(
{
"left": mon["x"],
"top": mon["y"],
"width": mon["width"],
"height": mon["height"],
}
)
image = Image.frombytes("RGB", shot.size, shot.rgb) image = Image.frombytes("RGB", shot.size, shot.rgb)
return image, {"x": mon["left"], "y": mon["top"], "width": mon["width"], "height": mon["height"]} selection = {
"requested": screen,
"selected": mon["screen"],
"fallback": mon["screen"] != screen,
}
return image, mon, displays, selection
def _serialize_image(image, image_format: str, jpeg_quality: int) -> bytes: def _serialize_image(image, image_format: str, jpeg_quality: int) -> bytes:
@@ -503,8 +563,9 @@ def _exec_command(req: ExecRequest) -> dict:
} }
def _exec_action(req: ActionRequest) -> dict: def _exec_action(req: ActionRequest, screen: int = 0) -> dict:
run_dry = SETTINGS["dry_run"] or req.dry_run run_dry = SETTINGS["dry_run"] or req.dry_run
selected_display, displays, screen_selection = _select_display(screen)
pyautogui = None if run_dry else _import_input_lib() pyautogui = None if run_dry else _import_input_lib()
resolved_target = None resolved_target = None
@@ -561,6 +622,8 @@ def _exec_action(req: ActionRequest) -> dict:
"action": req.action, "action": req.action,
"executed": not run_dry, "executed": not run_dry,
"dry_run": run_dry, "dry_run": run_dry,
"screen": screen_selection,
"display": selected_display,
"resolved_target": resolved_target, "resolved_target": resolved_target,
} }
@@ -585,6 +648,18 @@ def health(_: None = Depends(_auth)):
} }
@app.get("/displays")
def displays(_: None = Depends(_auth)):
detected = _get_displays()
return {
"ok": True,
"request_id": _request_id(),
"time_ms": _now_ms(),
"displays": detected,
"default_screen": 0,
}
@app.get("/screen") @app.get("/screen")
def screen( def screen(
with_grid: bool = True, with_grid: bool = True,
@@ -594,6 +669,7 @@ def screen(
image_format: Literal["png", "jpeg"] = "png", image_format: Literal["png", "jpeg"] = "png",
jpeg_quality: int = 85, jpeg_quality: int = 85,
asImage: bool = False, asImage: bool = False,
screen: int = 0,
_: None = Depends(_auth), _: None = Depends(_auth),
): ):
req = ScreenRequest( req = ScreenRequest(
@@ -605,8 +681,8 @@ def screen(
jpeg_quality=jpeg_quality, jpeg_quality=jpeg_quality,
) )
base_img, mon = _capture_screen() base_img, mon, displays, screen_selection = _capture_screen(screen)
meta = {"region": mon} meta = {"region": mon, "screen": screen_selection, "displays": displays}
out_img = base_img out_img = base_img
if req.with_grid: if req.with_grid:
@@ -634,8 +710,8 @@ def screen(
@app.post("/zoom") @app.post("/zoom")
def zoom(req: ZoomRequest, asImage: bool = False, _: None = Depends(_auth)): def zoom(req: ZoomRequest, asImage: bool = False, screen: int = 0, _: None = Depends(_auth)):
base_img, mon = _capture_screen() base_img, mon, displays, screen_selection = _capture_screen(screen)
cx = req.center_x - mon["x"] cx = req.center_x - mon["x"]
cy = req.center_y - mon["y"] cy = req.center_y - mon["y"]
@@ -655,6 +731,8 @@ def zoom(req: ZoomRequest, asImage: bool = False, _: None = Depends(_auth)):
meta = { meta = {
"source_monitor": mon, "source_monitor": mon,
"screen": screen_selection,
"displays": displays,
"region": { "region": {
"x": region_x, "x": region_x,
"y": region_y, "y": region_y,
@@ -690,8 +768,8 @@ def zoom(req: ZoomRequest, asImage: bool = False, _: None = Depends(_auth)):
@app.post("/action") @app.post("/action")
def action(req: ActionRequest, _: None = Depends(_auth)): def action(req: ActionRequest, screen: int = 0, _: None = Depends(_auth)):
result = _exec_action(req) result = _exec_action(req, screen)
return { return {
"ok": True, "ok": True,
"request_id": _request_id(), "request_id": _request_id(),
@@ -722,14 +800,14 @@ def exec_command(
@app.post("/ocr") @app.post("/ocr")
def ocr(req: OCRRequest, _: None = Depends(_auth)): def ocr(req: OCRRequest, screen: int = 0, _: None = Depends(_auth)):
source = req.mode source = req.mode
if source == "image": if source == "image":
image = _decode_image_base64(req.image_base64 or "") image = _decode_image_base64(req.image_base64 or "")
region = {"x": 0, "y": 0, "width": image.size[0], "height": image.size[1]} region = {"x": 0, "y": 0, "width": image.size[0], "height": image.size[1]}
blocks = _run_ocr(image, req.language_hint, req.min_confidence, 0, 0) blocks = _run_ocr(image, req.language_hint, req.min_confidence, 0, 0)
else: else:
base_img, mon = _capture_screen() base_img, mon, displays, screen_selection = _capture_screen(screen)
if source == "screen": if source == "screen":
image = base_img image = base_img
region = {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]} region = {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}
@@ -762,6 +840,8 @@ def ocr(req: OCRRequest, _: None = Depends(_auth)):
"time_ms": _now_ms(), "time_ms": _now_ms(),
"result": { "result": {
"mode": source, "mode": source,
"screen": screen_selection if source != "image" else None,
"display": mon if source != "image" else None,
"language_hint": req.language_hint, "language_hint": req.language_hint,
"min_confidence": req.min_confidence, "min_confidence": req.min_confidence,
"region": region, "region": region,
@@ -771,11 +851,11 @@ def ocr(req: OCRRequest, _: None = Depends(_auth)):
@app.post("/batch") @app.post("/batch")
def batch(req: BatchRequest, _: None = Depends(_auth)): def batch(req: BatchRequest, screen: int = 0, _: None = Depends(_auth)):
results = [] results = []
for index, item in enumerate(req.actions): for index, item in enumerate(req.actions):
try: try:
item_result = _exec_action(item) item_result = _exec_action(item, screen)
results.append({"index": index, "ok": True, "result": item_result}) results.append({"index": index, "ok": True, "result": item_result})
except Exception as exc: except Exception as exc:
results.append({"index": index, "ok": False, "error": str(exc)}) results.append({"index": index, "ok": False, "error": str(exc)})

View File

@@ -33,13 +33,20 @@ The agent should not assume it can self-install this stack.
## Mini API map ## Mini API map
- `GET /health` → server status + safety flags - `GET /health` → server status + safety flags
- `GET /screen` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`) - `GET /displays` → detected displays in zero-based API order
- `POST /zoom` → cropped screenshot around point/region (also supports `asImage=true`) - `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes - `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
- `POST /action` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...) - `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
- `POST /batch` → sequential action list - `POST /batch?screen=0` → sequential action list
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header) - `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
### Display selection
- Use `GET /displays` before operating on multi-monitor systems.
- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
### OCR usage ### OCR usage
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs). - Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
@@ -55,7 +62,7 @@ The agent should not assume it can self-install this stack.
## Core workflow (mandatory) ## Core workflow (mandatory)
1. Call `GET /screen` with coarse grid (e.g., 12x12). 1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
2. Identify likely target region and compute an initial confidence score. 2. Identify likely target region and compute an initial confidence score.
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate. 3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
4. **Before any click**, verify target identity (OCR text/icon/location consistency). 4. **Before any click**, verify target identity (OCR text/icon/location consistency).