a2ef50401b131b87602df57212fb5f9fe32aed7e
Clickthrough
Let an Agent interact with your Computer.
Clickthrough is a proof-of-concept bridge between a vision-aware agent and a headless controller. The project is split into two halves:
- A Python server that accepts a static grid overlay (think of a screenshot broken into cells) and exposes lightweight endpoints to ask questions, plan actions, or even run pointer/keyboard events.
- A skill that bundles the HTTP calls/intent construction so we can hardwire the same flow inside OpenClaw later.
Server surface (FastAPI)
POST /grid/init: Accepts a base64 screenshot plus the requested rows/columns, returns agrid_id, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions.POST /grid/action: Takes a plan (grid_id, optional target cell, and an action likeclick/drag/type) and returns a structuredActionResultwith computed coordinates for tooling to consume.GET /health: A minimal health check for deployments.
The server tracks each grid by a UUID and keeps layout metadata so multiple agents can keep in sync with the same screenshot/scene.
Skill layer (OpenClaw integration)
The skill/ package is a placeholder for how an agent action would look in OpenClaw. It wraps the server calls, interprets the grid cells, and exposes helpers such as describe_grid() and plan_action() so future work can plug into the agent toolkit directly.
Getting started
- Install dependencies:
python -m pip install -r requirements.txt. - Run the server:
uvicorn server.main:app --reload. - Use the skill helper to bootstrap a grid, or wire the REST endpoints into a higher-level agent.
Next steps
- Add real OCR/layout logic so cells understand labels.
- Turn the action planner into a state machine that can focus/double-click/type/drag.
- Persist grid sessions for longer running interactions.
- Ship the OpenClaw skill (skill folder) as a plugin that can call
http://localhost:8000and scaffold the agent’s reasoning.
Languages
Python
100%