Files
clickthrough/README.md
2026-04-05 19:15:12 +02:00

34 lines
2.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Clickthrough
Let an Agent interact with your Computer.
`Clickthrough` is a proof-of-concept bridge between a vision-aware agent and a headless controller. The project is split into two halves:
1. A Python server that accepts a static grid overlay (think of a screenshot broken into cells) and exposes lightweight endpoints to ask questions, plan actions, or even run pointer/keyboard events.
2. A **skill** that bundles the HTTP calls/intent construction so we can hardwire the same flow inside OpenClaw later.
## Server surface (FastAPI)
- `POST /grid/init`: Accepts a base64 screenshot plus the requested rows/columns, returns a `grid_id`, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions.
- `POST /grid/action`: Takes a plan (`grid_id`, optional target cell, and an action like `click`/`drag`/`type`) and returns a structured `ActionResult` with computed coordinates for tooling to consume.
- `GET /health`: A minimal health check for deployments.
The server tracks each grid by a UUID and keeps layout metadata so multiple agents can keep in sync with the same screenshot/scene.
## Skill layer (OpenClaw integration)
The `skill/` package is a placeholder for how an agent action would look in OpenClaw. It wraps the server calls, interprets the grid cells, and exposes helpers such as `describe_grid()` and `plan_action()` so future work can plug into the agent toolkit directly.
## Getting started
1. Install dependencies: `python -m pip install -r requirements.txt`.
2. Run the server: `uvicorn server.main:app --reload`.
3. Use the skill helper to bootstrap a grid, or wire the REST endpoints into a higher-level agent.
## Next steps
- Add real OCR/layout logic so cells understand labels.
- Turn the action planner into a state machine that can focus/double-click/type/drag.
- Persist grid sessions for longer running interactions.
- Ship the OpenClaw skill (skill folder) as a plugin that can call `http://localhost:8000` and scaffold the agents reasoning.