feat(window): add window lifecycle and launch endpoints
All checks were successful
python-syntax / syntax-check (push) Successful in 28s

This commit is contained in:
2026-05-01 15:52:02 +02:00
parent 1429e90be2
commit 493e5499e8
4 changed files with 382 additions and 2 deletions

View File

@@ -8,6 +8,8 @@ Let an Agent interact with your computer over HTTP, with grid-aware screenshots
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported) - **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
- **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ... - **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ...
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey - **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
- **Window lifecycle endpoints**: list/focus/restore/minimize/maximize/close windows via `GET /windows` + `POST /windows/action`
- **Structured launch endpoint**: start an app/process without dropping to a shell via `POST /launch`
- **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr` - **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr`
- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec` - **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels - **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
@@ -41,7 +43,7 @@ For OCR support, install the native `tesseract` binary on the host (in addition
Important: Important:
- `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields. - `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields.
- Pixel coordinates and OCR bounding boxes are always global desktop coordinates. - Pixel coordinates and OCR bounding boxes are always global desktop coordinates.
- Prefer structured GUI interaction first; use `/exec` for launch, recovery, or explicit system-level tasks. - Prefer structured GUI interaction first; use `/windows`, `/launch`, and `/action` before reaching for `/exec`.
See: See:
- `docs/API.md` - `docs/API.md`
@@ -67,6 +69,8 @@ Environment variables:
- `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`) - `CLICKTHROUGH_EXEC_MAX_OUTPUT_CHARS` (default `20000`)
- `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable) - `CLICKTHROUGH_TESSERACT_CMD` (optional path to the `tesseract` executable)
Window management endpoints currently target Windows hosts. On non-Windows hosts they return `501` instead of guessing.
## Gitea CI ## Gitea CI
A Gitea Actions workflow is included at `.gitea/workflows/python-syntax.yml`. A Gitea Actions workflow is included at `.gitea/workflows/python-syntax.yml`.

View File

@@ -194,6 +194,89 @@ Move only:
} }
``` ```
## `GET /windows`
List desktop windows using structured filters instead of shelling out.
Query params:
- `title_contains` (optional substring match)
- `title_regex` (optional case-insensitive regex)
- `process_name` (optional exact process name, e.g. `explorer.exe`)
- `hwnd` (optional exact window handle)
- `visible_only` (bool, default `true`)
```json
{
"ok": true,
"count": 1,
"windows": [
{
"hwnd": 132640,
"title": "WinDirStat",
"class_name": "WinDirStatMainWindow",
"pid": 18420,
"process_name": "windirstat.exe",
"visible": true,
"enabled": true,
"minimized": false,
"maximized": false,
"foreground": true,
"rect": {"x": 194, "y": 116, "width": 1532, "height": 870}
}
]
}
```
Notes:
- Currently supported on Windows hosts only.
- Returns `409` for ambiguous write-target matches when a mutation endpoint would affect multiple windows.
## `POST /windows/action`
Perform a structured window action against exactly one matched window.
```json
{
"action": "focus",
"title_contains": "WinDirStat",
"visible_only": true,
"timeout_ms": 3000
}
```
Supported actions:
- `focus`
- `restore`
- `minimize`
- `maximize`
- `close`
The response includes the matched pre-action window and the final observed window state (or `closed=true` if it disappeared).
## `POST /launch`
Start an app/process without invoking a shell.
```json
{
"executable": "C:/Program Files/WinDirStat/WinDirStat.exe",
"args": [],
"cwd": "C:/Program Files/WinDirStat",
"wait_for_window": true,
"match": {
"title_contains": "WinDirStat",
"visible_only": true
},
"timeout_ms": 8000
}
```
Notes:
- Launch uses direct process execution (`subprocess.Popen`) rather than PowerShell/CMD.
- If `wait_for_window=true`, the server polls for a matching window and returns `window_found`.
- `dry_run=true` returns the resolved argv/cwd without launching.
## `POST /ocr` ## `POST /ocr`
Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes. Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.

View File

@@ -1,8 +1,10 @@
import base64 import base64
import ctypes
import hmac import hmac
import io import io
import os import os
import subprocess import subprocess
import sys
import time import time
import uuid import uuid
from typing import Literal, Optional from typing import Literal, Optional
@@ -168,6 +170,31 @@ class OCRRequest(BaseModel):
return self return self
class WindowQuery(BaseModel):
title_contains: str | None = Field(default=None, max_length=512)
title_regex: str | None = Field(default=None, max_length=512)
process_name: str | None = Field(default=None, max_length=260)
hwnd: int | None = Field(default=None, ge=1)
visible_only: bool = True
class WindowActionRequest(WindowQuery):
action: Literal["focus", "restore", "minimize", "maximize", "close"]
timeout_ms: int = Field(default=3000, ge=0, le=60000)
class LaunchRequest(BaseModel):
executable: str = Field(min_length=1, max_length=2048)
args: list[str] = Field(default_factory=list, max_length=100)
cwd: str | None = None
wait_for_window: bool = False
match: WindowQuery | None = None
timeout_ms: int = Field(default=5000, ge=0, le=120000)
dry_run: bool = False
def _auth(x_clickthrough_token: Optional[str] = Header(default=None)): def _auth(x_clickthrough_token: Optional[str] = Header(default=None)):
token = SETTINGS["token"] token = SETTINGS["token"]
if token and x_clickthrough_token != token: if token and x_clickthrough_token != token:
@@ -456,6 +483,221 @@ def _run_ocr(image, language_hint: str | None, min_confidence: float, offset_x:
return blocks return blocks
def _windows_only(feature: str):
if sys.platform != "win32":
raise HTTPException(status_code=501, detail=f"{feature} is currently supported on Windows hosts only")
def _tasklist_process_name(pid: int) -> str | None:
try:
completed = subprocess.run(
["tasklist", "/FI", f"PID eq {pid}", "/FO", "CSV", "/NH"],
capture_output=True,
text=True,
timeout=5,
check=False,
)
except Exception:
return None
line = (completed.stdout or "").strip().splitlines()
if not line:
return None
row = line[0].strip()
if not row or row.startswith("INFO:"):
return None
if row.startswith('"') and '","' in row:
return row.split('","', 1)[0].strip('"')
return None
def _list_windows(query: WindowQuery | None = None) -> list[dict]:
_windows_only("window endpoints")
user32 = ctypes.windll.user32
user32.EnumWindows.restype = ctypes.c_bool
user32.EnumWindows.argtypes = [ctypes.WINFUNCTYPE(ctypes.c_bool, ctypes.c_void_p, ctypes.c_void_p), ctypes.c_void_p]
user32.IsWindowVisible.argtypes = [ctypes.c_void_p]
user32.IsWindowVisible.restype = ctypes.c_bool
user32.IsWindowEnabled.argtypes = [ctypes.c_void_p]
user32.IsWindowEnabled.restype = ctypes.c_bool
user32.IsIconic.argtypes = [ctypes.c_void_p]
user32.IsIconic.restype = ctypes.c_bool
user32.IsZoomed.argtypes = [ctypes.c_void_p]
user32.IsZoomed.restype = ctypes.c_bool
user32.GetWindowTextLengthW.argtypes = [ctypes.c_void_p]
user32.GetWindowTextLengthW.restype = ctypes.c_int
user32.GetWindowTextW.argtypes = [ctypes.c_void_p, ctypes.c_wchar_p, ctypes.c_int]
user32.GetClassNameW.argtypes = [ctypes.c_void_p, ctypes.c_wchar_p, ctypes.c_int]
user32.GetClassNameW.restype = ctypes.c_int
user32.GetForegroundWindow.restype = ctypes.c_void_p
user32.GetWindowRect.argtypes = [ctypes.c_void_p, ctypes.POINTER(ctypes.wintypes.RECT)]
foreground = int(user32.GetForegroundWindow() or 0)
title_regex = re.compile(query.title_regex, re.IGNORECASE) if query and query.title_regex else None
windows: list[dict] = []
enum_proc = ctypes.WINFUNCTYPE(ctypes.c_bool, ctypes.c_void_p, ctypes.c_void_p)
def _callback(hwnd, _lparam):
hwnd_int = int(hwnd)
if query and query.hwnd is not None and hwnd_int != query.hwnd:
return True
title_len = user32.GetWindowTextLengthW(hwnd)
title_buf = ctypes.create_unicode_buffer(max(title_len + 1, 1))
user32.GetWindowTextW(hwnd, title_buf, len(title_buf))
title = title_buf.value
visible = bool(user32.IsWindowVisible(hwnd))
if query and query.visible_only and not visible:
return True
class_buf = ctypes.create_unicode_buffer(256)
user32.GetClassNameW(hwnd, class_buf, len(class_buf))
pid = ctypes.wintypes.DWORD()
user32.GetWindowThreadProcessId(hwnd, ctypes.byref(pid))
process_name = _tasklist_process_name(int(pid.value))
rect = ctypes.wintypes.RECT()
user32.GetWindowRect(hwnd, ctypes.byref(rect))
window = {
"hwnd": hwnd_int,
"title": title,
"class_name": class_buf.value,
"pid": int(pid.value),
"process_name": process_name,
"visible": visible,
"enabled": bool(user32.IsWindowEnabled(hwnd)),
"minimized": bool(user32.IsIconic(hwnd)),
"maximized": bool(user32.IsZoomed(hwnd)),
"foreground": hwnd_int == foreground,
"rect": {
"x": int(rect.left),
"y": int(rect.top),
"width": int(rect.right - rect.left),
"height": int(rect.bottom - rect.top),
},
}
if query:
if query.title_contains and query.title_contains.lower() not in title.lower():
return True
if title_regex and not title_regex.search(title):
return True
if query.process_name and (process_name or "").lower() != query.process_name.lower():
return True
windows.append(window)
return True
user32.EnumWindows(enum_proc(_callback), 0)
windows.sort(key=lambda item: (not item["foreground"], item["title"].lower(), item["hwnd"]))
return windows
def _require_window_match(query: WindowQuery) -> dict:
matches = _list_windows(query)
if not matches:
raise HTTPException(status_code=404, detail="no matching window found")
if len(matches) > 1 and query.hwnd is None:
raise HTTPException(
status_code=409,
detail={"message": "multiple windows matched", "matches": matches[:10]},
)
return matches[0]
def _apply_window_action(req: WindowActionRequest) -> dict:
_windows_only("window endpoints")
match = _require_window_match(req)
hwnd = match["hwnd"]
user32 = ctypes.windll.user32
WM_CLOSE = 0x0010
SW_RESTORE = 9
SW_MINIMIZE = 6
SW_MAXIMIZE = 3
if req.action in {"focus", "restore"}:
user32.ShowWindow(hwnd, SW_RESTORE)
ok = bool(user32.SetForegroundWindow(hwnd))
elif req.action == "minimize":
ok = bool(user32.ShowWindow(hwnd, SW_MINIMIZE))
elif req.action == "maximize":
ok = bool(user32.ShowWindow(hwnd, SW_MAXIMIZE))
elif req.action == "close":
ok = bool(user32.PostMessageW(hwnd, WM_CLOSE, 0, 0))
else:
raise HTTPException(status_code=400, detail="unsupported window action")
deadline = time.time() + (req.timeout_ms / 1000.0)
final_match = None
while time.time() <= deadline:
current = _list_windows(WindowQuery(hwnd=hwnd, visible_only=False))
final_match = current[0] if current else None
if req.action == "close" and final_match is None:
break
if req.action in {"focus", "restore"} and final_match and final_match["foreground"] and not final_match["minimized"]:
break
if req.action == "minimize" and final_match and final_match["minimized"]:
break
if req.action == "maximize" and final_match and final_match["maximized"]:
break
time.sleep(0.1)
return {
"ok": ok,
"matched": match,
"window": final_match,
"closed": final_match is None,
}
def _launch_app(req: LaunchRequest) -> dict:
if req.cwd:
cwd = os.path.abspath(req.cwd)
if not os.path.isdir(cwd):
raise HTTPException(status_code=400, detail="cwd does not exist or is not a directory")
else:
cwd = None
argv = [req.executable, *req.args]
if SETTINGS["dry_run"] or req.dry_run:
return {"executed": False, "dry_run": True, "argv": argv, "cwd": cwd}
try:
proc = subprocess.Popen(argv, cwd=cwd)
except FileNotFoundError as exc:
raise HTTPException(status_code=400, detail=f"executable not found: {exc}") from exc
except OSError as exc:
raise HTTPException(status_code=400, detail=f"failed to launch process: {exc}") from exc
result = {
"executed": True,
"dry_run": False,
"argv": argv,
"cwd": cwd,
"pid": proc.pid,
}
if req.wait_for_window:
query = req.match or WindowQuery(process_name=os.path.basename(req.executable), visible_only=True)
deadline = time.time() + (req.timeout_ms / 1000.0)
match = None
while time.time() <= deadline:
matches = _list_windows(query)
if matches:
match = matches[0]
break
time.sleep(0.2)
result["window"] = match
result["window_found"] = match is not None
return result
def _pick_shell(explicit_shell: str | None) -> str: def _pick_shell(explicit_shell: str | None) -> str:
shell_name = (explicit_shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip() shell_name = (explicit_shell or SETTINGS["exec_default_shell"] or "powershell").lower().strip()
if shell_name not in {"powershell", "bash", "cmd"}: if shell_name not in {"powershell", "bash", "cmd"}:
@@ -799,6 +1041,54 @@ def exec_command(
} }
@app.get("/windows")
def windows(
title_contains: str | None = None,
title_regex: str | None = None,
process_name: str | None = None,
hwnd: int | None = None,
visible_only: bool = True,
_: None = Depends(_auth),
):
query = WindowQuery(
title_contains=title_contains,
title_regex=title_regex,
process_name=process_name,
hwnd=hwnd,
visible_only=visible_only,
)
matches = _list_windows(query)
return {
"ok": True,
"request_id": _request_id(),
"time_ms": _now_ms(),
"windows": matches,
"count": len(matches),
}
@app.post("/windows/action")
def window_action(req: WindowActionRequest, _: None = Depends(_auth)):
result = _apply_window_action(req)
return {
"ok": True,
"request_id": _request_id(),
"time_ms": _now_ms(),
"result": result,
}
@app.post("/launch")
def launch(req: LaunchRequest, _: None = Depends(_auth)):
result = _launch_app(req)
return {
"ok": True,
"request_id": _request_id(),
"time_ms": _now_ms(),
"result": result,
}
@app.post("/ocr") @app.post("/ocr")
def ocr(req: OCRRequest, screen: int = 0, _: None = Depends(_auth)): def ocr(req: OCRRequest, screen: int = 0, _: None = Depends(_auth)):
source = req.mode source = req.mode

View File

@@ -36,6 +36,9 @@ The agent should not assume it can self-install this stack.
- `GET /displays` → detected displays in zero-based API order - `GET /displays` → detected displays in zero-based API order
- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`) - `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`) - `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
- `GET /windows` → discover visible desktop windows and their handles/processes
- `POST /windows/action` → focus/restore/minimize/maximize/close a matched window
- `POST /launch` → start an app/process without dropping to a shell
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes - `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...) - `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
- `POST /batch?screen=0` → sequential action list - `POST /batch?screen=0` → sequential action list
@@ -123,11 +126,11 @@ Prefer structured GUI control first:
- `/action` or `/batch` to interact - `/action` or `/batch` to interact
Use `/exec` only when it is the cleanest available tool for the job, for example: Use `/exec` only when it is the cleanest available tool for the job, for example:
- launching an app that is not already visible
- querying machine state that the GUI does not expose well - querying machine state that the GUI does not expose well
- performing an explicit user-requested shell/system task - performing an explicit user-requested shell/system task
- recovering from a blocked GUI flow when normal interaction failed - recovering from a blocked GUI flow when normal interaction failed
Prefer `GET /windows`, `POST /windows/action`, and `POST /launch` for app lifecycle tasks before falling back to `/exec`.
Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly. Avoid using `/exec` for routine in-app clicks, menu navigation, or text entry when the GUI can be driven directly.
## Core workflow (mandatory) ## Core workflow (mandatory)