Clickthrough
Let an Agent interact with your Computer.
Clickthrough is a proof-of-concept bridge between a vision-aware agent and a headless controller. The project is split into two halves:
- A Python server that accepts a static grid overlay (think of a screenshot broken into cells) and exposes lightweight endpoints to ask questions, plan actions, or even run pointer/keyboard events.
- A skill that bundles the HTTP calls/intent construction so we can hardwire the same flow inside OpenClaw later.
Server surface (FastAPI)
POST /grid/init: Accepts a base64 screenshot plus the requested rows/columns, returns agrid_id, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions.POST /grid/action: Takes a plan (grid_id, optional target cell, and an action likeclick/drag/type) and returns a structuredActionResultwith computed coordinates for tooling to consume.GET /grid/{grid_id}/summary: Returns both a heuristic description (GridPlanner) and a rich descriptor so the skill can summarize what it sees.GET /grid/{grid_id}/history: Streams back the action history for that grid so an agent or operator can audit what was done.GET /health: A minimal health check for deployments.
Vision metadata is kept on a per-grid basis, including history, layout dimensions, and any appended memo. Each VisionGrid also exposes a short textual summary so the skill layer can turn sensory data into sentences directly.
Skill layer (OpenClaw integration)
The skill/ package wraps the server calls and exposes helpers:
ClickthroughSkill.describe_grid()builds a grid session and returns the descriptor.ClickthroughSkill.plan_action()drives the/grid/actionendpoint.ClickthroughSkill.grid_summary()and.grid_history()surface the new metadata endpoints.ClickthroughAgentRunnersimulates a tiny agent loop that chooses a cell (optionally by label), submits an action, and fetches the summary/history.
Future work can swap the stub runner for a full OpenClaw skill that keeps reasoning inside the agent and uses these primitives to steer the mouse/keyboard.
Testing
python3 -m pip install -r requirements.txtpython3 -m pip install -r requirements-dev.txtpython3 -m pytest
The tests/ suite covers grid construction, the FastAPI surface, and the skill/runner helpers.
Continuous Integration
.github/workflows/ci.yml runs on pushes and PRs:
- Checks out the repo and sets up Python 3.11.
- Installs dependencies (
requirements.txt+requirements-dev.txt). - Runs
ruff checkover the Python packages. - Executes
pytestto keep coverage high.
Next steps
- Add OCR or UI heuristics so grid cells have meaningful labels before the agent reasons about them.
- Persist grids and histories in a lightweight store so long-running sessions survive restarts.
- Expose a websocket/watch endpoint that streams updated screenshots and invalidates cached
grid_ids when the scene changes.