feat(ocr): add higher-level text search helpers

2026-05-01 16:23:16 +02:00
parent 8857feaf7b
commit f00c525721
4 changed files with 190 additions and 35 deletions
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@ Let an Agent interact with your computer over HTTP, with grid-aware screenshots
 - **Window lifecycle endpoints**: list/focus/restore/minimize/maximize/close windows via `GET /windows` + `POST /windows/action`
 - **Structured launch endpoint**: start an app/process without dropping to a shell via `POST /launch`
 - **Wait/sync endpoint**: poll for text, window, or visual state changes via `POST /wait`
- **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr`
+- **OCR endpoints**: extract text blocks or search for matching text via `POST /ocr` and `POST /ocr/find`
 - **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
 - **Coordinate transform metadata** in visual responses so agents can map grid cells to real pixels
 - **Safety knobs**: token auth, dry-run mode, optional allowed-region restriction
@@ -39,7 +39,7 @@ For OCR support, install the native `tesseract` binary on the host (in addition
 3. Decide cell / target
 4. Optional `POST /zoom?screen=0` for finer targeting
 5. `POST /action?screen=0` to execute
-6. `GET /screen?screen=0` again to verify result
+6. `GET /screen?screen=0` again to verify result, or use `POST /ocr/find` when you need explicit text matching

 Important:
 - `POST /action` expects an `action` plus a `target` object; do not send raw top-level `x` / `y` fields.