A Python server that accepts a static grid overlay (think of a screenshot broken into cells) and exposes lightweight endpoints to ask questions, plan actions, or even run pointer/keyboard events.
A skill that bundles the HTTP calls/intent construction so we can hardwire the same flow inside OpenClaw later.

Server surface (FastAPI)

POST /grid/init: Accepts a base64 screenshot plus the requested rows/columns, returns a grid_id, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions.
POST /grid/action: Takes a plan (grid_id, optional target cell, and an action like click/drag/type) and returns a structured ActionResult with computed coordinates for tooling to consume.
GET /grid/{grid_id}/summary: Returns both a heuristic description (GridPlanner) and a rich descriptor so the skill can summarize what it sees.
GET /grid/{grid_id}/history: Streams back the action history for that grid so an agent or operator can audit what was done.
GET /health: A minimal health check for deployments.

Vision metadata is kept on a per-grid basis, including history, layout dimensions, and any appended memo. Each VisionGrid also exposes a short textual summary so the skill layer can turn sensory data into sentences directly.

Skill layer (OpenClaw integration)

The skill/ package wraps the server calls and exposes helpers:

ClickthroughSkill.describe_grid() builds a grid session and returns the descriptor.
ClickthroughSkill.plan_action() drives the /grid/action endpoint.
ClickthroughSkill.grid_summary() and .grid_history() surface the new metadata endpoints.
ClickthroughAgentRunner simulates a tiny agent loop that chooses a cell (optionally by label), submits an action, and fetches the summary/history.

Future work can swap the stub runner for a full OpenClaw skill that keeps reasoning inside the agent and uses these primitives to steer the mouse/keyboard.

Testing

python3 -m pip install -r requirements.txt
python3 -m pip install -r requirements-dev.txt
python3 -m pytest

The tests/ suite covers grid construction, the FastAPI surface, and the skill/runner helpers.

Continuous Integration

.github/workflows/ci.yml runs on pushes and PRs:

Checks out the repo and sets up Python 3.11.
Installs dependencies (requirements.txt + requirements-dev.txt).
Runs ruff check over the Python packages.
Executes pytest to keep coverage high.

Next steps

Add OCR or UI heuristics so grid cells have meaningful labels before the agent reasons about them.
Persist grids and histories in a lightweight store so long-running sessions survive restarts.
Expose a websocket/watch endpoint that streams updated screenshots and invalidates cached grid_ids when the scene changes.