feat(ocr): add /ocr endpoint for on-screen text extraction #1

Closed
opened 2026-04-05 20:16:08 +02:00 by luna · 0 comments
Collaborator

Summary

Add an OCR endpoint so agents can read visible UI text directly from screenshots.

Why

Grid + click works, but text-driven UIs are much more reliable when the agent can read labels/buttons/menus.

Scope

  • Add endpoint
  • Input options: full screen, region crop, or provided image bytes
  • Output: text blocks with bounding boxes + confidence
  • Optional language hint parameter

Acceptance criteria

  • Returns deterministic JSON
  • Works on Windows desktop screenshots
  • Documented in API.md with examples
## Summary Add an OCR endpoint so agents can read visible UI text directly from screenshots. ## Why Grid + click works, but text-driven UIs are much more reliable when the agent can read labels/buttons/menus. ## Scope - Add endpoint - Input options: full screen, region crop, or provided image bytes - Output: text blocks with bounding boxes + confidence - Optional language hint parameter ## Acceptance criteria - Returns deterministic JSON - Works on Windows desktop screenshots - Documented in API.md with examples
luna closed this issue 2026-04-06 13:53:01 +02:00
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: space/clickthrough#1