Luna 1b0b9cfdef
Some checks failed
CI / test (push) Failing after 45s
Add planner previews and streaming
2026-04-05 19:33:24 +02:00
2026-04-05 19:33:24 +02:00
2026-04-05 19:33:24 +02:00
2026-04-05 19:33:24 +02:00
2026-03-28 18:58:45 +01:00
2026-04-05 19:27:55 +02:00
2026-04-05 19:33:24 +02:00
2026-04-05 19:27:55 +02:00

Clickthrough

Let an Agent interact with your Computer.

Clickthrough is a proof-of-concept bridge between a vision-aware agent and a headless controller. The project is split into two halves:

  1. A Python server that accepts a static grid overlay (think of a screenshot broken into cells) and exposes lightweight endpoints to ask questions, plan actions, or even run pointer/keyboard events.
  2. A skill that bundles the HTTP calls/intent construction so we can hardwire the same flow inside OpenClaw later.

Server surface (FastAPI)

  • POST /grid/init: Accepts a base64 screenshot plus the requested rows/columns, returns a grid_id, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions.
  • POST /grid/action: Takes a plan (grid_id, optional target cell, and an action like click/drag/type) and returns a structured ActionResult with computed coordinates for tooling to consume.
  • GET /grid/{grid_id}/summary: Returns both a heuristic description (GridPlanner) and a rich descriptor so the skill can summarize what it sees.
  • GET /grid/{grid_id}/history: Streams back the action history for that grid so an agent or operator can audit what was done.
  • POST /grid/{grid_id}/plan: Lets GridPlanner select the target and return a preview action plan without committing to it, so we can inspect coordinates before triggering events.
  • POST /grid/{grid_id}/refresh + GET /stream/screenshots: Refresh the cached screenshot/metadata and broadcast the updated scene over a websocket so clients can redraw overlays in near real time.
  • GET /health: A minimal health check for deployments.

Vision metadata is kept on a per-grid basis, including history, layout dimensions, and any appended memo. Each VisionGrid also exposes a short textual summary so the skill layer can turn sensory data into sentences directly.

Skill layer (OpenClaw integration)

The skill/ package wraps the server calls and exposes helpers:

  • ClickthroughSkill.describe_grid() builds a grid session and returns the descriptor.
  • ClickthroughSkill.plan_action() drives the /grid/action endpoint.
  • ClickthroughSkill.plan_with_planner() calls /grid/{grid_id}/plan, so you can preview the GridPlanner suggestion before executing it.
  • ClickthroughSkill.grid_summary() and .grid_history() surface the new metadata endpoints.
  • ClickthroughSkill.refresh_grid() pushes a new screenshot and memo, triggering websocket listeners.
  • ClickthroughAgentRunner simulates a tiny agent loop that asks the planner for a preview, executes the resulting action, and then gathers the summary/history so you can iterate on reasoning loops in tests.

Future work can swap the stub runner for a full OpenClaw skill that keeps reasoning inside the agent and uses these primitives to steer the mouse/keyboard.

Screenshot streaming

Capture loops can now talk to FastAPI in two ways:

  1. POST /grid/{grid_id}/refresh with fresh base64 screenshots and an optional memo; the server updates the cached grid metadata and broadcasts the change.
  2. Open a websocket to GET /stream/screenshots (optionally passing grid_id as a query param) to receive realtime deltas whenever a refresh happens. Clients can use the descriptor/payload to redraw overlays or trigger new planner runs without polling.

Testing

  1. python3 -m pip install -r requirements.txt
  2. python3 -m pip install -r requirements-dev.txt
  3. python3 -m pytest

The tests/ suite covers grid construction, the FastAPI surface, and the skill/runner helpers.

Continuous Integration

.github/workflows/ci.yml runs on pushes and PRs:

  • Checks out the repo and sets up Python 3.11.
  • Installs dependencies (requirements.txt + requirements-dev.txt).
  • Runs ruff check over the Python packages.
  • Executes pytest to keep coverage high.

Next steps

  • Add OCR or UI heuristics so grid cells have meaningful labels before the agent reasons about them.
  • Persist grids and histories in a lightweight store so long-running sessions survive restarts.
  • Expose a websocket/watch endpoint that streams updated screenshots and invalidates cached grid_ids when the scene changes.
Description
No description provided
Readme MIT 146 KiB
Languages
Python 100%