Add planner previews and streaming

2026-04-05 19:33:24 +02:00
parent b1d2b6b321
commit 1b0b9cfdef
12 changed files with 332 additions and 31 deletions
--- a/README.md
+++ b/README.md
@@ -13,6 +13,8 @@ Let an Agent interact with your Computer.
 - `POST /grid/action`: Takes a plan (`grid_id`, optional target cell, and an action like `click`/`drag`/`type`) and returns a structured `ActionResult` with computed coordinates for tooling to consume.
 - `GET /grid/{grid_id}/summary`: Returns both a heuristic description (`GridPlanner`) and a rich descriptor so the skill can summarize what it sees.
 - `GET /grid/{grid_id}/history`: Streams back the action history for that grid so an agent or operator can audit what was done.
+- `POST /grid/{grid_id}/plan`: Lets `GridPlanner` select the target and return a preview action plan without committing to it, so we can inspect coordinates before triggering events.
+- `POST /grid/{grid_id}/refresh` + `GET /stream/screenshots`: Refresh the cached screenshot/metadata and broadcast the updated scene over a websocket so clients can redraw overlays in near real time.
 - `GET /health`: A minimal health check for deployments.

 Vision metadata is kept on a per-grid basis, including history, layout dimensions, and any appended memo. Each `VisionGrid` also exposes a short textual summary so the skill layer can turn sensory data into sentences directly.
@@ -23,11 +25,20 @@ The `skill/` package wraps the server calls and exposes helpers:

 - `ClickthroughSkill.describe_grid()` builds a grid session and returns the descriptor.
 - `ClickthroughSkill.plan_action()` drives the `/grid/action` endpoint.
+- `ClickthroughSkill.plan_with_planner()` calls `/grid/{grid_id}/plan`, so you can preview the `GridPlanner` suggestion before executing it.
 - `ClickthroughSkill.grid_summary()` and `.grid_history()` surface the new metadata endpoints.
- `ClickthroughAgentRunner` simulates a tiny agent loop that chooses a cell (optionally by label), submits an action, and fetches the summary/history.
+- `ClickthroughSkill.refresh_grid()` pushes a new screenshot and memo, triggering websocket listeners.
+- `ClickthroughAgentRunner` simulates a tiny agent loop that asks the planner for a preview, executes the resulting action, and then gathers the summary/history so you can iterate on reasoning loops in tests.

 Future work can swap the stub runner for a full OpenClaw skill that keeps reasoning inside the agent and uses these primitives to steer the mouse/keyboard.

+## Screenshot streaming
+
+Capture loops can now talk to FastAPI in two ways:
+
+1. POST `/grid/{grid_id}/refresh` with fresh base64 screenshots and an optional memo; the server updates the cached grid metadata and broadcasts the change.
+2. Open a websocket to `GET /stream/screenshots` (optionally passing `grid_id` as a query param) to receive realtime deltas whenever a refresh happens. Clients can use the descriptor/payload to redraw overlays or trigger new planner runs without polling.
+
 ## Testing

 1. `python3 -m pip install -r requirements.txt`