Support multi-display screen selection
All checks were successful
python-syntax / syntax-check (push) Successful in 1m33s
All checks were successful
python-syntax / syntax-check (push) Successful in 1m33s
This commit is contained in:
12
README.md
12
README.md
@@ -6,6 +6,7 @@ Let an Agent interact with your computer over HTTP, with grid-aware screenshots
|
|||||||
|
|
||||||
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
|
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
|
||||||
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
|
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
|
||||||
|
- **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ...
|
||||||
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
|
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
|
||||||
- **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr`
|
- **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr`
|
||||||
- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
|
- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
|
||||||
@@ -30,11 +31,12 @@ For OCR support, install the native `tesseract` binary on the host (in addition
|
|||||||
|
|
||||||
## Minimal API flow
|
## Minimal API flow
|
||||||
|
|
||||||
1. `GET /screen` with grid
|
1. `GET /displays` if you need a non-primary monitor
|
||||||
2. Decide cell / target
|
2. `GET /screen?screen=0` with grid
|
||||||
3. Optional `POST /zoom` for finer targeting
|
3. Decide cell / target
|
||||||
4. `POST /action` to execute
|
4. Optional `POST /zoom?screen=0` for finer targeting
|
||||||
5. `GET /screen` again to verify result
|
5. `POST /action?screen=0` to execute
|
||||||
|
6. `GET /screen?screen=0` again to verify result
|
||||||
|
|
||||||
See:
|
See:
|
||||||
- `docs/API.md`
|
- `docs/API.md`
|
||||||
|
|||||||
45
docs/API.md
45
docs/API.md
@@ -12,19 +12,39 @@ x-clickthrough-token: <token>
|
|||||||
|
|
||||||
Returns status and runtime safety flags, including `exec` capability config.
|
Returns status and runtime safety flags, including `exec` capability config.
|
||||||
|
|
||||||
|
## `GET /displays`
|
||||||
|
|
||||||
|
Returns detected displays in API screen order.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"ok": true,
|
||||||
|
"default_screen": 0,
|
||||||
|
"displays": [
|
||||||
|
{"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
|
||||||
|
{"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend.
|
||||||
|
Invalid `screen` values fall back to `0`.
|
||||||
|
|
||||||
## `GET /screen`
|
## `GET /screen`
|
||||||
|
|
||||||
Query params:
|
Query params:
|
||||||
|
|
||||||
|
- `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0`
|
||||||
- `with_grid` (bool, default `true`)
|
- `with_grid` (bool, default `true`)
|
||||||
- `grid_rows` (int, default env or `12`)
|
- `grid_rows` (int, default env or `12`)
|
||||||
- `grid_cols` (int, default env or `12`)
|
- `grid_cols` (int, default env or `12`)
|
||||||
- `include_labels` (bool, default `true`)
|
- `include_labels` (bool, default `true`)
|
||||||
- `image_format` (`png`|`jpeg`, default `png`)
|
- `image_format` (`png`|`jpeg`, default `png`)
|
||||||
- `jpeg_quality` (1-100, default `85`)
|
- `jpeg_quality` (1-100, default `85`)
|
||||||
- `asImage` (bool, default `false`) — if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
|
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
|
||||||
|
|
||||||
Default response includes base64 image and metadata (`meta.region`, optional `meta.grid`).
|
Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
|
||||||
|
`meta.region` uses global desktop coordinates.
|
||||||
|
|
||||||
## `POST /zoom`
|
## `POST /zoom`
|
||||||
|
|
||||||
@@ -47,14 +67,21 @@ Body:
|
|||||||
|
|
||||||
Query params:
|
Query params:
|
||||||
|
|
||||||
- `asImage` (bool, default `false`) — if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
|
- `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0`
|
||||||
|
- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
|
||||||
|
|
||||||
Default response returns cropped image + region metadata in global pixel coordinates.
|
Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.
|
||||||
|
|
||||||
## `POST /action`
|
## `POST /action`
|
||||||
|
|
||||||
Body: one action.
|
Body: one action.
|
||||||
|
|
||||||
|
Query params:
|
||||||
|
|
||||||
|
- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`
|
||||||
|
|
||||||
|
Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target.
|
||||||
|
|
||||||
### Pointer target modes
|
### Pointer target modes
|
||||||
|
|
||||||
#### Pixel target
|
#### Pixel target
|
||||||
@@ -147,6 +174,10 @@ Hotkey:
|
|||||||
|
|
||||||
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
|
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
|
||||||
|
|
||||||
|
Query params:
|
||||||
|
|
||||||
|
- `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0`
|
||||||
|
|
||||||
Body:
|
Body:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
@@ -158,7 +189,7 @@ Body:
|
|||||||
```
|
```
|
||||||
|
|
||||||
Modes:
|
Modes:
|
||||||
- `screen` (default): OCR over full captured monitor
|
- `screen` (default): OCR over full selected monitor
|
||||||
- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
|
- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
|
||||||
- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
|
- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
|
||||||
|
|
||||||
@@ -246,6 +277,10 @@ Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution
|
|||||||
|
|
||||||
Runs multiple `action` payloads sequentially.
|
Runs multiple `action` payloads sequentially.
|
||||||
|
|
||||||
|
Query params:
|
||||||
|
|
||||||
|
- `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0`
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"actions": [
|
"actions": [
|
||||||
|
|||||||
@@ -1,6 +1,8 @@
|
|||||||
# Coordinate System
|
# Coordinate System
|
||||||
|
|
||||||
All interactions ultimately execute in **global pixel coordinates** of the primary monitor.
|
All interactions ultimately execute in **global desktop pixel coordinates**.
|
||||||
|
|
||||||
|
Use `GET /displays` to list available displays. Visual endpoints accept `?screen=X` where `X` is a zero-based display index. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend. Invalid screen values fall back to `0`.
|
||||||
|
|
||||||
## Regions
|
## Regions
|
||||||
|
|
||||||
@@ -12,6 +14,12 @@ Visual endpoints return a `region` object:
|
|||||||
|
|
||||||
This describes where the image sits in global desktop space.
|
This describes where the image sits in global desktop space.
|
||||||
|
|
||||||
|
For a second display to the right of the primary display, `GET /screen?screen=1` might return:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{"x": 1920, "y": 0, "width": 1920, "height": 1080}
|
||||||
|
```
|
||||||
|
|
||||||
## Grid indexing
|
## Grid indexing
|
||||||
|
|
||||||
- Rows/cols are **zero-based**
|
- Rows/cols are **zero-based**
|
||||||
@@ -35,7 +43,7 @@ Interpretation:
|
|||||||
|
|
||||||
## Recommended agent loop
|
## Recommended agent loop
|
||||||
|
|
||||||
1. Capture `/screen` with coarse grid
|
1. Capture `/screen?screen=0` with coarse grid, or choose another display with `/screen?screen=1`
|
||||||
2. Find candidate cell
|
2. Find candidate cell
|
||||||
3. If uncertain, use `/zoom` around candidate
|
3. If uncertain, use `/zoom` around candidate
|
||||||
4. Convert target to grid action
|
4. Convert target to grid action
|
||||||
|
|||||||
@@ -5,6 +5,7 @@ import requests
|
|||||||
|
|
||||||
BASE_URL = os.getenv("CLICKTHROUGH_URL", "http://127.0.0.1:8123")
|
BASE_URL = os.getenv("CLICKTHROUGH_URL", "http://127.0.0.1:8123")
|
||||||
TOKEN = os.getenv("CLICKTHROUGH_TOKEN", "")
|
TOKEN = os.getenv("CLICKTHROUGH_TOKEN", "")
|
||||||
|
SCREEN = int(os.getenv("CLICKTHROUGH_SCREEN", "0"))
|
||||||
|
|
||||||
headers = {}
|
headers = {}
|
||||||
if TOKEN:
|
if TOKEN:
|
||||||
@@ -16,10 +17,14 @@ def main():
|
|||||||
r.raise_for_status()
|
r.raise_for_status()
|
||||||
print("health:", r.json())
|
print("health:", r.json())
|
||||||
|
|
||||||
|
d = requests.get(f"{BASE_URL}/displays", headers=headers, timeout=10)
|
||||||
|
d.raise_for_status()
|
||||||
|
print("displays:", d.json().get("displays", []))
|
||||||
|
|
||||||
s = requests.get(
|
s = requests.get(
|
||||||
f"{BASE_URL}/screen",
|
f"{BASE_URL}/screen",
|
||||||
headers=headers,
|
headers=headers,
|
||||||
params={"with_grid": True, "grid_rows": 12, "grid_cols": 12},
|
params={"screen": SCREEN, "with_grid": True, "grid_rows": 12, "grid_cols": 12},
|
||||||
timeout=30,
|
timeout=30,
|
||||||
)
|
)
|
||||||
s.raise_for_status()
|
s.raise_for_status()
|
||||||
|
|||||||
110
server/app.py
110
server/app.py
@@ -192,13 +192,73 @@ def _import_capture_libs():
|
|||||||
raise HTTPException(status_code=500, detail=f"capture backend unavailable: {exc}") from exc
|
raise HTTPException(status_code=500, detail=f"capture backend unavailable: {exc}") from exc
|
||||||
|
|
||||||
|
|
||||||
def _capture_screen():
|
def _display_region(mon: dict, screen: int, mss_index: int, primary: bool) -> dict:
|
||||||
|
return {
|
||||||
|
"screen": screen,
|
||||||
|
"mss_index": mss_index,
|
||||||
|
"primary": primary,
|
||||||
|
"x": mon["left"],
|
||||||
|
"y": mon["top"],
|
||||||
|
"width": mon["width"],
|
||||||
|
"height": mon["height"],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _ordered_displays(sct) -> list[dict]:
|
||||||
|
raw_monitors = list(enumerate(sct.monitors[1:], start=1))
|
||||||
|
if not raw_monitors:
|
||||||
|
raise HTTPException(status_code=500, detail="no displays detected")
|
||||||
|
|
||||||
|
primary_pos = next(
|
||||||
|
(idx for idx, (_, mon) in enumerate(raw_monitors) if mon["left"] == 0 and mon["top"] == 0),
|
||||||
|
0,
|
||||||
|
)
|
||||||
|
ordered = [raw_monitors[primary_pos]] + [
|
||||||
|
item for idx, item in enumerate(raw_monitors) if idx != primary_pos
|
||||||
|
]
|
||||||
|
return [
|
||||||
|
_display_region(mon, screen=index, mss_index=mss_index, primary=(index == 0))
|
||||||
|
for index, (mss_index, mon) in enumerate(ordered)
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _get_displays() -> list[dict]:
|
||||||
|
_, _, mss = _import_capture_libs()
|
||||||
|
with mss.mss() as sct:
|
||||||
|
return _ordered_displays(sct)
|
||||||
|
|
||||||
|
|
||||||
|
def _select_display(screen: int) -> tuple[dict, list[dict], dict]:
|
||||||
|
displays = _get_displays()
|
||||||
|
selected = displays[screen] if 0 <= screen < len(displays) else displays[0]
|
||||||
|
selection = {
|
||||||
|
"requested": screen,
|
||||||
|
"selected": selected["screen"],
|
||||||
|
"fallback": selected["screen"] != screen,
|
||||||
|
}
|
||||||
|
return selected, displays, selection
|
||||||
|
|
||||||
|
|
||||||
|
def _capture_screen(screen: int = 0):
|
||||||
Image, _, mss = _import_capture_libs()
|
Image, _, mss = _import_capture_libs()
|
||||||
with mss.mss() as sct:
|
with mss.mss() as sct:
|
||||||
mon = sct.monitors[1]
|
displays = _ordered_displays(sct)
|
||||||
shot = sct.grab(mon)
|
mon = displays[screen] if 0 <= screen < len(displays) else displays[0]
|
||||||
|
shot = sct.grab(
|
||||||
|
{
|
||||||
|
"left": mon["x"],
|
||||||
|
"top": mon["y"],
|
||||||
|
"width": mon["width"],
|
||||||
|
"height": mon["height"],
|
||||||
|
}
|
||||||
|
)
|
||||||
image = Image.frombytes("RGB", shot.size, shot.rgb)
|
image = Image.frombytes("RGB", shot.size, shot.rgb)
|
||||||
return image, {"x": mon["left"], "y": mon["top"], "width": mon["width"], "height": mon["height"]}
|
selection = {
|
||||||
|
"requested": screen,
|
||||||
|
"selected": mon["screen"],
|
||||||
|
"fallback": mon["screen"] != screen,
|
||||||
|
}
|
||||||
|
return image, mon, displays, selection
|
||||||
|
|
||||||
|
|
||||||
def _serialize_image(image, image_format: str, jpeg_quality: int) -> bytes:
|
def _serialize_image(image, image_format: str, jpeg_quality: int) -> bytes:
|
||||||
@@ -503,8 +563,9 @@ def _exec_command(req: ExecRequest) -> dict:
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def _exec_action(req: ActionRequest) -> dict:
|
def _exec_action(req: ActionRequest, screen: int = 0) -> dict:
|
||||||
run_dry = SETTINGS["dry_run"] or req.dry_run
|
run_dry = SETTINGS["dry_run"] or req.dry_run
|
||||||
|
selected_display, displays, screen_selection = _select_display(screen)
|
||||||
|
|
||||||
pyautogui = None if run_dry else _import_input_lib()
|
pyautogui = None if run_dry else _import_input_lib()
|
||||||
resolved_target = None
|
resolved_target = None
|
||||||
@@ -561,6 +622,8 @@ def _exec_action(req: ActionRequest) -> dict:
|
|||||||
"action": req.action,
|
"action": req.action,
|
||||||
"executed": not run_dry,
|
"executed": not run_dry,
|
||||||
"dry_run": run_dry,
|
"dry_run": run_dry,
|
||||||
|
"screen": screen_selection,
|
||||||
|
"display": selected_display,
|
||||||
"resolved_target": resolved_target,
|
"resolved_target": resolved_target,
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -585,6 +648,18 @@ def health(_: None = Depends(_auth)):
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/displays")
|
||||||
|
def displays(_: None = Depends(_auth)):
|
||||||
|
detected = _get_displays()
|
||||||
|
return {
|
||||||
|
"ok": True,
|
||||||
|
"request_id": _request_id(),
|
||||||
|
"time_ms": _now_ms(),
|
||||||
|
"displays": detected,
|
||||||
|
"default_screen": 0,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
@app.get("/screen")
|
@app.get("/screen")
|
||||||
def screen(
|
def screen(
|
||||||
with_grid: bool = True,
|
with_grid: bool = True,
|
||||||
@@ -594,6 +669,7 @@ def screen(
|
|||||||
image_format: Literal["png", "jpeg"] = "png",
|
image_format: Literal["png", "jpeg"] = "png",
|
||||||
jpeg_quality: int = 85,
|
jpeg_quality: int = 85,
|
||||||
asImage: bool = False,
|
asImage: bool = False,
|
||||||
|
screen: int = 0,
|
||||||
_: None = Depends(_auth),
|
_: None = Depends(_auth),
|
||||||
):
|
):
|
||||||
req = ScreenRequest(
|
req = ScreenRequest(
|
||||||
@@ -605,8 +681,8 @@ def screen(
|
|||||||
jpeg_quality=jpeg_quality,
|
jpeg_quality=jpeg_quality,
|
||||||
)
|
)
|
||||||
|
|
||||||
base_img, mon = _capture_screen()
|
base_img, mon, displays, screen_selection = _capture_screen(screen)
|
||||||
meta = {"region": mon}
|
meta = {"region": mon, "screen": screen_selection, "displays": displays}
|
||||||
out_img = base_img
|
out_img = base_img
|
||||||
|
|
||||||
if req.with_grid:
|
if req.with_grid:
|
||||||
@@ -634,8 +710,8 @@ def screen(
|
|||||||
|
|
||||||
|
|
||||||
@app.post("/zoom")
|
@app.post("/zoom")
|
||||||
def zoom(req: ZoomRequest, asImage: bool = False, _: None = Depends(_auth)):
|
def zoom(req: ZoomRequest, asImage: bool = False, screen: int = 0, _: None = Depends(_auth)):
|
||||||
base_img, mon = _capture_screen()
|
base_img, mon, displays, screen_selection = _capture_screen(screen)
|
||||||
|
|
||||||
cx = req.center_x - mon["x"]
|
cx = req.center_x - mon["x"]
|
||||||
cy = req.center_y - mon["y"]
|
cy = req.center_y - mon["y"]
|
||||||
@@ -655,6 +731,8 @@ def zoom(req: ZoomRequest, asImage: bool = False, _: None = Depends(_auth)):
|
|||||||
|
|
||||||
meta = {
|
meta = {
|
||||||
"source_monitor": mon,
|
"source_monitor": mon,
|
||||||
|
"screen": screen_selection,
|
||||||
|
"displays": displays,
|
||||||
"region": {
|
"region": {
|
||||||
"x": region_x,
|
"x": region_x,
|
||||||
"y": region_y,
|
"y": region_y,
|
||||||
@@ -690,8 +768,8 @@ def zoom(req: ZoomRequest, asImage: bool = False, _: None = Depends(_auth)):
|
|||||||
|
|
||||||
|
|
||||||
@app.post("/action")
|
@app.post("/action")
|
||||||
def action(req: ActionRequest, _: None = Depends(_auth)):
|
def action(req: ActionRequest, screen: int = 0, _: None = Depends(_auth)):
|
||||||
result = _exec_action(req)
|
result = _exec_action(req, screen)
|
||||||
return {
|
return {
|
||||||
"ok": True,
|
"ok": True,
|
||||||
"request_id": _request_id(),
|
"request_id": _request_id(),
|
||||||
@@ -722,14 +800,14 @@ def exec_command(
|
|||||||
|
|
||||||
|
|
||||||
@app.post("/ocr")
|
@app.post("/ocr")
|
||||||
def ocr(req: OCRRequest, _: None = Depends(_auth)):
|
def ocr(req: OCRRequest, screen: int = 0, _: None = Depends(_auth)):
|
||||||
source = req.mode
|
source = req.mode
|
||||||
if source == "image":
|
if source == "image":
|
||||||
image = _decode_image_base64(req.image_base64 or "")
|
image = _decode_image_base64(req.image_base64 or "")
|
||||||
region = {"x": 0, "y": 0, "width": image.size[0], "height": image.size[1]}
|
region = {"x": 0, "y": 0, "width": image.size[0], "height": image.size[1]}
|
||||||
blocks = _run_ocr(image, req.language_hint, req.min_confidence, 0, 0)
|
blocks = _run_ocr(image, req.language_hint, req.min_confidence, 0, 0)
|
||||||
else:
|
else:
|
||||||
base_img, mon = _capture_screen()
|
base_img, mon, displays, screen_selection = _capture_screen(screen)
|
||||||
if source == "screen":
|
if source == "screen":
|
||||||
image = base_img
|
image = base_img
|
||||||
region = {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}
|
region = {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}
|
||||||
@@ -762,6 +840,8 @@ def ocr(req: OCRRequest, _: None = Depends(_auth)):
|
|||||||
"time_ms": _now_ms(),
|
"time_ms": _now_ms(),
|
||||||
"result": {
|
"result": {
|
||||||
"mode": source,
|
"mode": source,
|
||||||
|
"screen": screen_selection if source != "image" else None,
|
||||||
|
"display": mon if source != "image" else None,
|
||||||
"language_hint": req.language_hint,
|
"language_hint": req.language_hint,
|
||||||
"min_confidence": req.min_confidence,
|
"min_confidence": req.min_confidence,
|
||||||
"region": region,
|
"region": region,
|
||||||
@@ -771,11 +851,11 @@ def ocr(req: OCRRequest, _: None = Depends(_auth)):
|
|||||||
|
|
||||||
|
|
||||||
@app.post("/batch")
|
@app.post("/batch")
|
||||||
def batch(req: BatchRequest, _: None = Depends(_auth)):
|
def batch(req: BatchRequest, screen: int = 0, _: None = Depends(_auth)):
|
||||||
results = []
|
results = []
|
||||||
for index, item in enumerate(req.actions):
|
for index, item in enumerate(req.actions):
|
||||||
try:
|
try:
|
||||||
item_result = _exec_action(item)
|
item_result = _exec_action(item, screen)
|
||||||
results.append({"index": index, "ok": True, "result": item_result})
|
results.append({"index": index, "ok": True, "result": item_result})
|
||||||
except Exception as exc:
|
except Exception as exc:
|
||||||
results.append({"index": index, "ok": False, "error": str(exc)})
|
results.append({"index": index, "ok": False, "error": str(exc)})
|
||||||
|
|||||||
@@ -33,13 +33,20 @@ The agent should not assume it can self-install this stack.
|
|||||||
## Mini API map
|
## Mini API map
|
||||||
|
|
||||||
- `GET /health` → server status + safety flags
|
- `GET /health` → server status + safety flags
|
||||||
- `GET /screen` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
|
- `GET /displays` → detected displays in zero-based API order
|
||||||
- `POST /zoom` → cropped screenshot around point/region (also supports `asImage=true`)
|
- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
|
||||||
|
- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
|
||||||
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
|
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
|
||||||
- `POST /action` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
|
- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
|
||||||
- `POST /batch` → sequential action list
|
- `POST /batch?screen=0` → sequential action list
|
||||||
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
|
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
|
||||||
|
|
||||||
|
### Display selection
|
||||||
|
|
||||||
|
- Use `GET /displays` before operating on multi-monitor systems.
|
||||||
|
- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
|
||||||
|
- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
|
||||||
|
|
||||||
### OCR usage
|
### OCR usage
|
||||||
|
|
||||||
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
|
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
|
||||||
@@ -55,7 +62,7 @@ The agent should not assume it can self-install this stack.
|
|||||||
|
|
||||||
## Core workflow (mandatory)
|
## Core workflow (mandatory)
|
||||||
|
|
||||||
1. Call `GET /screen` with coarse grid (e.g., 12x12).
|
1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
|
||||||
2. Identify likely target region and compute an initial confidence score.
|
2. Identify likely target region and compute an initial confidence score.
|
||||||
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
||||||
4. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
4. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
||||||
|
|||||||
Reference in New Issue
Block a user