feat(ocr): add /ocr endpoint for screen, region, and image input

2026-04-06 13:48:33 +02:00
parent 2955426f14
commit 097c6a095c
4 changed files with 240 additions and 0 deletions
--- a/docs/API.md
+++ b/docs/API.md
@@ -143,6 +143,77 @@ Hotkey:
 }
 ```

+## `POST /ocr`
+
+Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
+
+Body:
+
+```json
+{
+  "mode": "screen",
+  "language_hint": "eng",
+  "min_confidence": 0.4
+}
+```
+
+Modes:
+- `screen` (default): OCR over full captured monitor
+- `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
+- `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
+
+Region mode example:
+
+```json
+{
+  "mode": "region",
+  "region_x": 220,
+  "region_y": 160,
+  "region_width": 900,
+  "region_height": 400,
+  "language_hint": "eng",
+  "min_confidence": 0.5
+}
+```
+
+Image mode example:
+
+```json
+{
+  "mode": "image",
+  "image_base64": "iVBORw0KGgoAAAANSUhEUgAA...",
+  "language_hint": "eng"
+}
+```
+
+Response shape:
+
+```json
+{
+  "ok": true,
+  "request_id": "...",
+  "time_ms": 1710000000000,
+  "result": {
+    "mode": "screen",
+    "language_hint": "eng",
+    "min_confidence": 0.4,
+    "region": {"x": 0, "y": 0, "width": 1920, "height": 1080},
+    "blocks": [
+      {
+        "text": "Settings",
+        "confidence": 0.9821,
+        "bbox": {"x": 144, "y": 92, "width": 96, "height": 21}
+      }
+    ]
+  }
+}
+```
+
+Notes:
+- Output is deterministic JSON (stable ordering by top-to-bottom, then left-to-right).
+- `bbox` coordinates are in global screen space for `screen`/`region`, and image-local for `image`.
+- Requires `tesseract` executable plus Python package `pytesseract`.
+
 ## `POST /exec`

 Execute a shell command on the host running Clickthrough.