clickthrough/README.md

# Clickthrough

Clickthrough is a lightweight HTTP control layer that lets an AI safely operate a real computer by repeatedly capturing structured screenshots with coordinate-aware grids (`see`), executing precise mouse/keyboard actions from those coordinates (`interact`), and optionally running authenticated shell commands for system-level tasks (`exec`) under a consistent response contract.

## Core Methods

- `POST /see`: Capture a full screen or region, optionally with a click-ready grid overlay.
- `POST /see/zoom`: Capture a tighter crop around a point and draw a denser grid for precise targeting.
- `POST /interact`: Perform one mouse or keyboard action (`click`, `scroll`, `type`, `hotkey`, etc.).
- `POST /exec`: Run PowerShell/Bash/CMD commands when shell-level control is needed.

## Why this works for AI agents

- Agents do not need live vision; they iterate on snapshots.
- Grid metadata bridges image understanding to deterministic click coordinates.
- Interaction stays explicit and auditable (one action per request).
- A unified response envelope (`ok`, `data`, `error`) reduces agent-side branching.

## Minimal Agent Loop

1. Call `see` with a coarse grid.
2. If uncertain, call `see/zoom` with a denser grid.
3. Call `interact` once.
4. Call `see` again to verify state change.
5. Use `exec` only for explicit shell/system tasks.

## Safety and Auth

- `x-clickthrough-token` protects API access when enabled.
- `x-clickthrough-exec-secret` is required for `/exec`.
- Optional dry-run and allowed-region constraints reduce accidental risk.

## Docs

- API: `docs/API.md`
- Agent procedure: `skill/SKILL.md`
- Coordinate system details: `docs/coordinate-system.md`