# Clickthrough Let an Agent interact with your Computer. `Clickthrough` is a proof-of-concept bridge between a vision-aware agent and a headless controller. The project is split into two halves: 1. A Python server that accepts a static grid overlay (think of a screenshot broken into cells) and exposes lightweight endpoints to ask questions, plan actions, or even run pointer/keyboard events. 2. A **skill** that bundles the HTTP calls/intent construction so we can hardwire the same flow inside OpenClaw later. ## Server surface (FastAPI) - `POST /grid/init`: Accepts a base64 screenshot plus the requested rows/columns, returns a `grid_id`, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions. - `POST /grid/action`: Takes a plan (`grid_id`, optional target cell, and an action like `click`/`drag`/`type`) and returns a structured `ActionResult` with computed coordinates for tooling to consume. - `GET /grid/{grid_id}/summary`: Returns both a heuristic description (`GridPlanner`) and a rich descriptor so the skill can summarize what it sees. - `GET /grid/{grid_id}/history`: Streams back the action history for that grid so an agent or operator can audit what was done. - `GET /health`: A minimal health check for deployments. Vision metadata is kept on a per-grid basis, including history, layout dimensions, and any appended memo. Each `VisionGrid` also exposes a short textual summary so the skill layer can turn sensory data into sentences directly. ## Skill layer (OpenClaw integration) The `skill/` package wraps the server calls and exposes helpers: - `ClickthroughSkill.describe_grid()` builds a grid session and returns the descriptor. - `ClickthroughSkill.plan_action()` drives the `/grid/action` endpoint. - `ClickthroughSkill.grid_summary()` and `.grid_history()` surface the new metadata endpoints. - `ClickthroughAgentRunner` simulates a tiny agent loop that chooses a cell (optionally by label), submits an action, and fetches the summary/history. Future work can swap the stub runner for a full OpenClaw skill that keeps reasoning inside the agent and uses these primitives to steer the mouse/keyboard. ## Testing 1. `python3 -m pip install -r requirements.txt` 2. `python3 -m pip install -r requirements-dev.txt` 3. `python3 -m pytest` The `tests/` suite covers grid construction, the FastAPI surface, and the skill/runner helpers. ## Continuous Integration `.github/workflows/ci.yml` runs on pushes and PRs: - Checks out the repo and sets up Python 3.11. - Installs dependencies (`requirements.txt` + `requirements-dev.txt`). - Runs `ruff check` over the Python packages. - Executes `pytest` to keep coverage high. ## Next steps - Add OCR or UI heuristics so grid cells have meaningful labels before the agent reasons about them. - Persist grids and histories in a lightweight store so long-running sessions survive restarts. - Expose a websocket/watch endpoint that streams updated screenshots and invalidates cached `grid_id`s when the scene changes.