feat(ocr): add /ocr endpoint for on-screen text extraction #1

New Issue

luna · 2026-04-05T20:16:08+02:00

luna commented

2026-04-05 20:16:08 +02:00

Summary

Add an OCR endpoint so agents can read visible UI text directly from screenshots.

Why

Grid + click works, but text-driven UIs are much more reliable when the agent can read labels/buttons/menus.

Scope

Add endpoint
Input options: full screen, region crop, or provided image bytes
Output: text blocks with bounding boxes + confidence
Optional language hint parameter

Acceptance criteria

Returns deterministic JSON
Works on Windows desktop screenshots
Documented in API.md with examples

## Summary Add an OCR endpoint so agents can read visible UI text directly from screenshots. ## Why Grid + click works, but text-driven UIs are much more reliable when the agent can read labels/buttons/menus. ## Scope - Add endpoint - Input options: full screen, region crop, or provided image bytes - Output: text blocks with bounding boxes + confidence - Optional language hint parameter ## Acceptance criteria - Returns deterministic JSON - Works on Windows desktop screenshots - Documented in API.md with examples

luna referenced this issue

2026-04-05 20:18:18 +02:00

feat(exec): add low-friction shell execution endpoint #6

~~luna referenced this issue 2026-04-06 13:48:48 +02:00~~

feat(ocr): add /ocr endpoint for text extraction #7

luna referenced a pull request that will close this issue

2026-04-06 13:49:08 +02:00

feat(ocr): add /ocr endpoint for text extraction #7

space referenced a pull request that will close this issue

2026-04-06 13:52:08 +02:00

feat(ocr): add /ocr endpoint for text extraction #7

luna closed this issue