A Python server that accepts a static grid overlay (think of a screenshot broken into cells) and exposes lightweight endpoints to ask questions, plan actions, or even run pointer/keyboard events.
A skill that bundles the HTTP calls/intent construction so we can hardwire the same flow inside OpenClaw later.

Server surface (FastAPI)

POST /grid/init: Accepts a base64 screenshot plus the requested rows/columns, returns a grid_id, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions.
POST /grid/action: Takes a plan (grid_id, optional target cell, and an action like click/drag/type) and returns a structured ActionResult with computed coordinates for tooling to consume.
GET /grid/{grid_id}/summary: Returns both a heuristic description (GridPlanner) and a rich descriptor so the skill can summarize what it sees.
GET /grid/{grid_id}/history: Streams back the action history for that grid so an agent or operator can audit what was done.
POST /grid/{grid_id}/plan: Lets GridPlanner select the target and return a preview action plan without committing to it, so we can inspect coordinates before triggering events.
POST /grid/{grid_id}/refresh + GET /stream/screenshots: Refresh the cached screenshot/metadata and broadcast the updated scene over a websocket so clients can redraw overlays in near real time.
GET /health: A minimal health check for deployments.

Vision metadata is kept on a per-grid basis, including history, layout dimensions, and any appended memo. Each VisionGrid also exposes a short textual summary so the skill layer can turn sensory data into sentences directly.

Skill layer (OpenClaw integration)

The skill/ package wraps the server calls and exposes helpers:

ClickthroughSkill.describe_grid() builds a grid session and returns the descriptor.
ClickthroughSkill.plan_action() drives the /grid/action endpoint.
ClickthroughSkill.plan_with_planner() calls /grid/{grid_id}/plan, so you can preview the GridPlanner suggestion before executing it.
ClickthroughSkill.grid_summary() and .grid_history() surface the new metadata endpoints.
ClickthroughSkill.refresh_grid() pushes a new screenshot and memo, triggering websocket listeners.
ClickthroughAgentRunner simulates a tiny agent loop that asks the planner for a preview, executes the resulting action, and then gathers the summary/history so you can iterate on reasoning loops in tests.

Future work can swap the stub runner for a full OpenClaw skill that keeps reasoning inside the agent and uses these primitives to steer the mouse/keyboard.

Screenshot streaming

Capture loops can now talk to FastAPI in two ways:

POST /grid/{grid_id}/refresh with fresh base64 screenshots and an optional memo; the server updates the cached grid metadata and broadcasts the change.
Open a websocket to GET /stream/screenshots (optionally passing grid_id as a query param) to receive realtime deltas whenever a refresh happens. Clients can use the descriptor/payload to redraw overlays or trigger new planner runs without polling.

Testing

python3 -m pip install -r requirements.txt
python3 -m pip install -r requirements-dev.txt
python3 -m pytest

The tests/ suite covers grid construction, the FastAPI surface, and the skill/runner helpers.

Continuous Integration

.github/workflows/ci.yml runs on pushes and PRs:

Checks out the repo and sets up Python 3.11.
Installs dependencies (requirements.txt + requirements-dev.txt).
Runs ruff check over the Python packages.
Executes pytest to keep coverage high.

Next steps

Add OCR or UI heuristics so grid cells have meaningful labels before the agent reasons about them.
Persist grids and histories in a lightweight store so long-running sessions survive restarts.
Expose a websocket/watch endpoint that streams updated screenshots and invalidates cached grid_ids when the scene changes.