This repository has been archived on 2026-05-20. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
clickthrough/skill/SKILL.md
Paul Wähner aced5be25e
All checks were successful
python-syntax / syntax-check (push) Successful in 7s
feat: migrate to v2-only API and unified response envelope
2026-05-03 19:11:11 +02:00

2.6 KiB

name, description
name description
clickthrough-http-control Drive GUI apps with Clickthrough v2 observe/localize/act APIs. Use image-tool point localization for ambiguous targets and avoid full-screen OCR loops.

Clickthrough HTTP Control (v2)

Agents do not see live desktop video. They operate on snapshots. Use this loop: observe -> localize -> act -> verify.

Fast defaults

  • Start with POST /v2/observe on a tight region, not full screen.
  • Set ocr_mode to none unless text is required immediately.
  • Use image tool localization for icon-heavy or dense controls.
  • Use POST /v2/act-verify instead of manual sleep/poll loops.

Mandatory image-tool click localization

When OCR is weak or ambiguous, ask image tool for one coordinate in bounds.

Prompt template:

  • "Return one click point as JSON {\"x\":<int>,\"y\":<int>} inside this image (width=W, height=H) for the control."

Rules:

  • Ask for one point only.
  • Include bounds in the prompt.
  • If answer is not parseable x,y, re-ask once with stricter format.
  • Send returned point to POST /v2/localize via image_tool_point.

API playbook

  1. Observe
POST /v2/observe?screen=0
{
  "mode": "region",
  "region_x": 820,
  "region_y": 420,
  "region_width": 700,
  "region_height": 420,
  "include_image": true,
  "ocr_mode": "none"
}
  1. Localize (choose one)

Text:

POST /v2/localize
{"observation_id":"...","text_query":"Save","text_match":"exact"}

Image-tool point:

POST /v2/localize
{"observation_id":"...","image_tool_point":{"x":312,"y":188}}
  1. Act
POST /v2/act?screen=0
{"action":{"action":"click","target":{"resolved_target_id":"..."}}}
  1. Verify
POST /v2/act-verify?screen=0
{
  "action":{"action":"click","target":{"resolved_target_id":"..."}},
  "condition":{"kind":"visual","state":"change","region_x":820,"region_y":420,"region_width":700,"region_height":420},
  "risk_level":"low"
}

Risk policy

  • Low risk (navigation, focus, benign clicks): single verification signal.
  • High risk (delete/send/purchase/close-lossy): use risk_level=high and require two checks before act.
  • Never do speculative repeat clicks; switch strategy after one failed verify.

Anti-latency rules

  • Never repeat full-screen OCR by default.
  • Re-observe only the active pane/region.
  • Prefer keyboard + window APIs for app switching.
  • Use OCR on region only and cap area with max_ocr_area_px.

Setup and auth

  • Include x-clickthrough-token when token auth is enabled.
  • /exec additionally requires x-clickthrough-exec-secret.
  • Validate server first: GET /health.