reset

2026-04-05 19:48:00 +02:00
parent 48ac9f5d7d
commit 6f9eedcc7a
27 changed files with 1 additions and 1396 deletions
--- a/README.md
+++ b/README.md
@@ -1,69 +1,2 @@
 # Clickthrough
-
-Let an Agent interact with your Computer.
-
-`Clickthrough` is a proof-of-concept bridge between a vision-aware agent and a headless controller. The project is split into two halves:
-
-1. A Python server that accepts a static grid overlay (think of a screenshot broken into cells) and exposes lightweight endpoints to ask questions, plan actions, or even run pointer/keyboard events.
-2. A **skill** that bundles the HTTP calls/intent construction so we can hardwire the same flow inside OpenClaw later.
-
-## Server surface (FastAPI)
-
- `POST /grid/init`: Accepts a base64 screenshot plus the requested rows/columns, returns a `grid_id`, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions.
- `POST /grid/action`: Takes a plan (`grid_id`, optional target cell, and an action like `click`/`drag`/`type`) and returns a structured `ActionResult` with computed coordinates for tooling to consume.
- `GET /grid/{grid_id}/summary`: Returns both a heuristic description (`GridPlanner`) and a rich descriptor so the skill can summarize what it sees.
- `GET /grid/{grid_id}/history`: Streams back the action history for that grid so an agent or operator can audit what was done.
- `POST /grid/{grid_id}/plan`: Lets `GridPlanner` select the target and return a preview action plan without committing to it, so we can inspect coordinates before triggering events.
- `POST /grid/{grid_id}/refresh` + `GET /stream/screenshots`: Refresh the cached screenshot/metadata and broadcast the updated scene over a websocket so clients can redraw overlays in near real time.
- `GET /health`: A minimal health check for deployments.
-
-Vision metadata is kept on a per-grid basis, including history, layout dimensions, and any appended memo. Each `VisionGrid` also exposes a short textual summary so the skill layer can turn sensory data into sentences directly.
-
-## Skill layer (OpenClaw integration)
-
-The `skill/` package wraps the server calls and exposes helpers:
-
- `ClickthroughSkill.describe_grid()` builds a grid session and returns the descriptor.
- `ClickthroughSkill.plan_action()` drives the `/grid/action` endpoint.
- `ClickthroughSkill.plan_with_planner()` calls `/grid/{grid_id}/plan`, so you can preview the `GridPlanner` suggestion before executing it.
- `ClickthroughSkill.grid_summary()` and `.grid_history()` surface the new metadata endpoints.
- `ClickthroughSkill.refresh_grid()` pushes a new screenshot and memo, triggering websocket listeners.
- `ClickthroughAgentRunner` simulates a tiny agent loop that asks the planner for a preview, executes the resulting action, and then gathers the summary/history so you can iterate on reasoning loops in tests.
-
-Future work can swap the stub runner for a full OpenClaw skill that keeps reasoning inside the agent and uses these primitives to steer the mouse/keyboard.
-
-## Screenshot streaming
-
-Capture loops can now talk to FastAPI in two ways:
-
-1. POST `/grid/{grid_id}/refresh` with fresh base64 screenshots and an optional memo; the server updates the cached grid metadata and broadcasts the change.
-2. Open a websocket to `GET /stream/screenshots` (optionally passing `grid_id` as a query param) to receive realtime deltas whenever a refresh happens. Clients can use the descriptor/payload to redraw overlays or trigger new planner runs without polling.
-
-## Testing
-
-1. `python3 -m pip install -r requirements.txt`
-2. `python3 -m pip install -r requirements-dev.txt`
-3. `python3 -m pytest`
-
-The `tests/` suite covers grid construction, the FastAPI surface, and the skill/runner helpers.
-
-## Continuous Integration
-
-`.github/workflows/ci.yml` runs on pushes and PRs:
-
- Checks out the repo and sets up Python 3.11.
- Installs dependencies (`requirements.txt` + `requirements-dev.txt`).
- Runs `ruff check` over the Python packages.
- Executes `pytest` to keep coverage high.
-
-## Control UI
-
- `/ui/` serves a small control panel where you can bootstrap a grid from a base64 screenshot, ask the planner for a preview, execute clicks, refresh the screenshot, and watch the summary/history.
- Most traffic is HTTP: `/grid/init`, `/grid/{id}/plan`, `/grid/{id}/action`, `/grid/{id}/refresh`, `/grid/{id}/summary`, and `/grid/{id}/history`. Only the `/stream/screenshots` websocket pushes updates after a refresh so the overlay redraws.
- The FastAPI root now redirects to `/ui/` when the client assets are present, making the UI a lightweight entry point for demos or manual command-and-control work.
-
-## Next steps
-
- Add OCR or UI heuristics so grid cells have meaningful labels before the agent reasons about them.
- Persist grids and histories in a lightweight store so long-running sessions survive restarts.
- Expand the UI to preview actions visually (perhaps overlaying cells on top of rendered screenshots).
+Let an Agent interact with your Computer.