docs(skill): include OCR endpoint workflow guidance
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
---
|
||||
name: clickthrough-http-control
|
||||
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
|
||||
description: Control a local computer through the Clickthrough HTTP server using screenshot grids, OCR, zoomed grids, and pointer/keyboard actions. Use when an agent must operate GUI apps by repeatedly capturing the screen, reading visible text, refining target coordinates, and executing precise interactions (click/right-click/double-click/scroll/type/hotkey) with verification.
|
||||
---
|
||||
|
||||
# Clickthrough HTTP Control
|
||||
@@ -35,10 +35,19 @@ The agent should not assume it can self-install this stack.
|
||||
- `GET /health` → server status + safety flags
|
||||
- `GET /screen` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
|
||||
- `POST /zoom` → cropped screenshot around point/region (also supports `asImage=true`)
|
||||
- `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
|
||||
- `POST /action` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
|
||||
- `POST /batch` → sequential action list
|
||||
- `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
|
||||
|
||||
### OCR usage
|
||||
|
||||
- Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
|
||||
- Use `mode=screen` for discovery, then `mode=region` for precision and speed.
|
||||
- Use `language_hint` when known (for example `eng`) to improve consistency.
|
||||
- Filter noise with `min_confidence` (start around `0.4` and tune per app).
|
||||
- Treat OCR as one signal, not the only signal, before high-impact clicks.
|
||||
|
||||
### Header requirements
|
||||
|
||||
- Always send `x-clickthrough-token` when token auth is enabled.
|
||||
@@ -49,7 +58,7 @@ The agent should not assume it can self-install this stack.
|
||||
1. Call `GET /screen` with coarse grid (e.g., 12x12).
|
||||
2. Identify likely target region and compute an initial confidence score.
|
||||
3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
|
||||
4. **Before any click**, verify target identity (text/icon/location consistency).
|
||||
4. **Before any click**, verify target identity (OCR text/icon/location consistency).
|
||||
5. Execute one minimal action via `POST /action`.
|
||||
6. Re-capture with `GET /screen` and verify the expected state change.
|
||||
7. Repeat until objective is complete.
|
||||
@@ -57,7 +66,7 @@ The agent should not assume it can self-install this stack.
|
||||
## Verify-before-click rules
|
||||
|
||||
- Never click if target identity is ambiguous.
|
||||
- Require at least two matching signals before click (example: expected text + expected UI region).
|
||||
- Require at least two matching signals before click (example: OCR text + expected UI region).
|
||||
- If confidence is low, do not "test click"; zoom and re-localize first.
|
||||
- For high-impact actions (close/delete/send/purchase), use two-phase flow:
|
||||
1) preview intended coordinate + reason
|
||||
|
||||
Reference in New Issue
Block a user