refactor: simplify to see/interact/exec and split server modules
All checks were successful
python-syntax / syntax-check (push) Successful in 6s
All checks were successful
python-syntax / syntax-check (push) Successful in 6s
This commit is contained in:
62
README.md
62
README.md
@@ -1,49 +1,37 @@
|
||||
# Clickthrough
|
||||
|
||||
Let an agent interact with a computer over HTTP.
|
||||
Clickthrough is a lightweight HTTP control layer that lets an AI safely operate a real computer by repeatedly capturing structured screenshots with coordinate-aware grids (`see`), executing precise mouse/keyboard actions from those coordinates (`interact`), and optionally running authenticated shell commands for system-level tasks (`exec`) under a consistent response contract.
|
||||
|
||||
## Primary mode (v2)
|
||||
## Core Methods
|
||||
|
||||
Use the v2 contract for faster, less OCR-heavy control loops:
|
||||
- `POST /v2/observe`
|
||||
- `POST /v2/localize`
|
||||
- `POST /v2/act`
|
||||
- `POST /v2/act-verify`
|
||||
- `POST /see`: Capture a full screen or region, optionally with a click-ready grid overlay.
|
||||
- `POST /see/zoom`: Capture a tighter crop around a point and draw a denser grid for precise targeting.
|
||||
- `POST /interact`: Perform one mouse or keyboard action (`click`, `scroll`, `type`, `hotkey`, etc.).
|
||||
- `POST /exec`: Run PowerShell/Bash/CMD commands when shell-level control is needed.
|
||||
|
||||
This is optimized for agents that cannot directly see the screen and must use screenshot/image tools.
|
||||
## Why this works for AI agents
|
||||
|
||||
## What this provides
|
||||
- Agents do not need live vision; they iterate on snapshots.
|
||||
- Grid metadata bridges image understanding to deterministic click coordinates.
|
||||
- Interaction stays explicit and auditable (one action per request).
|
||||
- A unified response envelope (`ok`, `data`, `error`) reduces agent-side branching.
|
||||
|
||||
- Screen/region capture with optional OCR and timing stats
|
||||
- Observation IDs for deterministic follow-up localization
|
||||
- Text localization and image-tool coordinate localization
|
||||
- Action execution with resolved target IDs
|
||||
- Risk-aware action+verification defaults
|
||||
- Unified response envelope across all endpoints
|
||||
## Minimal Agent Loop
|
||||
|
||||
## Quick start
|
||||
1. Call `see` with a coarse grid.
|
||||
2. If uncertain, call `see/zoom` with a denser grid.
|
||||
3. Call `interact` once.
|
||||
4. Call `see` again to verify state change.
|
||||
5. Use `exec` only for explicit shell/system tasks.
|
||||
|
||||
```bash
|
||||
cd /root/external-projects/clickthrough
|
||||
python3 -m venv .venv
|
||||
. .venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
CLICKTHROUGH_TOKEN=change-me python -m server.app
|
||||
```
|
||||
## Safety and Auth
|
||||
|
||||
Server defaults to `127.0.0.1:8123`.
|
||||
- `x-clickthrough-token` protects API access when enabled.
|
||||
- `x-clickthrough-exec-secret` is required for `/exec`.
|
||||
- Optional dry-run and allowed-region constraints reduce accidental risk.
|
||||
|
||||
## Fast control loop
|
||||
## Docs
|
||||
|
||||
1. `POST /v2/observe` on a tight region
|
||||
2. If OCR is enough, `POST /v2/localize` with `text_query`
|
||||
3. If ambiguous, ask image tool for one x,y in observation bounds
|
||||
4. `POST /v2/localize` with `image_tool_point`
|
||||
5. `POST /v2/act` or `POST /v2/act-verify`
|
||||
6. Re-observe only changed region
|
||||
|
||||
## See docs
|
||||
|
||||
- `docs/API.md`
|
||||
- `skill/SKILL.md`
|
||||
- `docs/coordinate-system.md`
|
||||
- API: `docs/API.md`
|
||||
- Agent procedure: `skill/SKILL.md`
|
||||
- Coordinate system details: `docs/coordinate-system.md`
|
||||
|
||||
Reference in New Issue
Block a user