feat(ocr): add /ocr endpoint for screen, region, and image input
All checks were successful
python-syntax / syntax-check (push) Successful in 5s
python-syntax / syntax-check (pull_request) Successful in 4s

This commit is contained in:
2026-04-06 13:48:33 +02:00
parent 2955426f14
commit 097c6a095c
4 changed files with 240 additions and 0 deletions

View File

@@ -7,6 +7,7 @@ Let an Agent interact with your computer over HTTP, with grid-aware screenshots
- **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
- **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
- **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
- **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr`
- **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
- **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
- **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
@@ -23,6 +24,8 @@ CLICKTHROUGH_TOKEN=change-me python -m server.app
Server defaults to `127.0.0.1:8123`.
For OCR support, install the native `tesseract` binary on the host (in addition to Python deps).
`python-dotenv` is enabled, so values from a repo-root `.env` file are loaded automatically.
## Minimal API flow