Add grid planner, CI, and tests

2026-04-05 19:27:55 +02:00
parent a2ef50401b
commit b1d2b6b321
16 changed files with 383 additions and 19 deletions
--- a/README.md
+++ b/README.md
@@ -11,23 +11,42 @@ Let an Agent interact with your Computer.

 - `POST /grid/init`: Accepts a base64 screenshot plus the requested rows/columns, returns a `grid_id`, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions.
 - `POST /grid/action`: Takes a plan (`grid_id`, optional target cell, and an action like `click`/`drag`/`type`) and returns a structured `ActionResult` with computed coordinates for tooling to consume.
+- `GET /grid/{grid_id}/summary`: Returns both a heuristic description (`GridPlanner`) and a rich descriptor so the skill can summarize what it sees.
+- `GET /grid/{grid_id}/history`: Streams back the action history for that grid so an agent or operator can audit what was done.
 - `GET /health`: A minimal health check for deployments.

-The server tracks each grid by a UUID and keeps layout metadata so multiple agents can keep in sync with the same screenshot/scene.
+Vision metadata is kept on a per-grid basis, including history, layout dimensions, and any appended memo. Each `VisionGrid` also exposes a short textual summary so the skill layer can turn sensory data into sentences directly.

 ## Skill layer (OpenClaw integration)

-The `skill/` package is a placeholder for how an agent action would look in OpenClaw. It wraps the server calls, interprets the grid cells, and exposes helpers such as `describe_grid()` and `plan_action()` so future work can plug into the agent toolkit directly.
+The `skill/` package wraps the server calls and exposes helpers:

-## Getting started
+- `ClickthroughSkill.describe_grid()` builds a grid session and returns the descriptor.
+- `ClickthroughSkill.plan_action()` drives the `/grid/action` endpoint.
+- `ClickthroughSkill.grid_summary()` and `.grid_history()` surface the new metadata endpoints.
+- `ClickthroughAgentRunner` simulates a tiny agent loop that chooses a cell (optionally by label), submits an action, and fetches the summary/history.

-1. Install dependencies: `python -m pip install -r requirements.txt`.
-2. Run the server: `uvicorn server.main:app --reload`.
-3. Use the skill helper to bootstrap a grid, or wire the REST endpoints into a higher-level agent.
+Future work can swap the stub runner for a full OpenClaw skill that keeps reasoning inside the agent and uses these primitives to steer the mouse/keyboard.
+
+## Testing
+
+1. `python3 -m pip install -r requirements.txt`
+2. `python3 -m pip install -r requirements-dev.txt`
+3. `python3 -m pytest`
+
+The `tests/` suite covers grid construction, the FastAPI surface, and the skill/runner helpers.
+
+## Continuous Integration
+
+`.github/workflows/ci.yml` runs on pushes and PRs:
+
+- Checks out the repo and sets up Python 3.11.
+- Installs dependencies (`requirements.txt` + `requirements-dev.txt`).
+- Runs `ruff check` over the Python packages.
+- Executes `pytest` to keep coverage high.

 ## Next steps

- Add real OCR/layout logic so cells understand labels.
- Turn the action planner into a state machine that can focus/double-click/type/drag.
- Persist grid sessions for longer running interactions.
- Ship the OpenClaw skill (skill folder) as a plugin that can call `http://localhost:8000` and scaffold the agent’s reasoning.
+- Add OCR or UI heuristics so grid cells have meaningful labels before the agent reasons about them.
+- Persist grids and histories in a lightweight store so long-running sessions survive restarts.
+- Expose a websocket/watch endpoint that streams updated screenshots and invalidates cached `grid_id`s when the scene changes.