All checks were successful
python-syntax / syntax-check (push) Successful in 7s
2.6 KiB
2.6 KiB
name, description
| name | description |
|---|---|
| clickthrough-http-control | Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops. |
Clickthrough HTTP Control (v2)
Agents do not see live desktop video. They operate on snapshots. Use this loop: observe -> localize -> act -> verify.
Fast defaults
- Start with
POST /v2/observeon a tight region, not full screen. - Set
ocr_modetononeunless text is required immediately. - Use
imagetool localization for icon-heavy or dense controls. - Use
POST /v2/act-verifyinstead of manual sleep/poll loops.
Mandatory image-tool click localization
When OCR is weak or ambiguous, ask image tool for one coordinate in bounds.
Prompt template:
- "Return one click point as JSON
{\"x\":<int>,\"y\":<int>}inside this image (width=W,height=H) for the control."
Rules:
- Ask for one point only.
- Include bounds in the prompt.
- If answer is not parseable
x,y, re-ask once with stricter format. - Send returned point to
POST /v2/localizeviaimage_tool_point.
API playbook
- Observe
POST /v2/observe?screen=0
{
"mode": "region",
"region_x": 820,
"region_y": 420,
"region_width": 700,
"region_height": 420,
"include_image": true,
"ocr_mode": "none"
}
- Localize (choose one)
Text:
POST /v2/localize
{"observation_id":"...","text_query":"Save","text_match":"exact"}
Image-tool point:
POST /v2/localize
{"observation_id":"...","image_tool_point":{"x":312,"y":188}}
- Act
POST /v2/act?screen=0
{"action":{"action":"click","target":{"resolved_target_id":"..."}}}
- Verify
POST /v2/act-verify?screen=0
{
"action":{"action":"click","target":{"resolved_target_id":"..."}},
"condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420},
"risk_level":"low"
}
Risk policy
- Low risk (navigation, focus, benign clicks): single verification signal.
- High risk (delete/send/purchase/close-lossy): use
risk_level=highand require two checks before act. - Never do speculative repeat clicks; switch strategy after one failed verify.
Anti-latency rules
- Never repeat full-screen OCR by default.
- Re-observe only the active pane/region.
- Prefer keyboard + window APIs for app switching.
- Use OCR on region only and cap area with
max_ocr_area_px.
Setup and auth
- Include
x-clickthrough-tokenwhen token auth is enabled. /execadditionally requiresx-clickthrough-exec-secret.- Validate server first:
GET /health.