Clickthrough is a lightweight HTTP control layer that lets an AI safely operate a real computer by repeatedly capturing structured screenshots with coordinate-aware grids (see), executing precise mouse/keyboard actions from those coordinates (interact), and optionally running authenticated shell commands for system-level tasks (exec) under a consistent response contract.

Core Methods

POST /see: Capture a full screen or region, optionally with a click-ready grid overlay.
POST /see/zoom: Capture a tighter crop around a point and draw a denser grid for precise targeting.
POST /interact: Perform one mouse or keyboard action (click, scroll, type, hotkey, etc.).
POST /exec: Run PowerShell/Bash/CMD commands when shell-level control is needed.

Why this works for AI agents

Agents do not need live vision; they iterate on snapshots.
Grid metadata bridges image understanding to deterministic click coordinates.
Interaction stays explicit and auditable (one action per request).
A unified response envelope (ok, data, error) reduces agent-side branching.

Minimal Agent Loop

Call see with a coarse grid.
If uncertain, call see/zoom with a denser grid.
Call interact once.
Call see again to verify state change.
Use exec only for explicit shell/system tasks.

Safety and Auth

x-clickthrough-token protects API access when enabled.
x-clickthrough-exec-secret is required for /exec.
Optional dry-run and allowed-region constraints reduce accidental risk.

Docs

API: docs/API.md
Agent procedure: skill/SKILL.md
Coordinate system details: docs/coordinate-system.md