Add grid planner, CI, and tests
Some checks failed
CI / test (push) Failing after 1m12s

This commit is contained in:
2026-04-05 19:27:55 +02:00
parent a2ef50401b
commit b1d2b6b321
16 changed files with 383 additions and 19 deletions

23
.github/workflows/ci.yml vendored Normal file
View File

@@ -0,0 +1,23 @@
name: CI
on:
push: {}
pull_request: {}
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.11
- name: Install runtime dependencies
run: python -m pip install --upgrade pip && pip install -r requirements.txt
- name: Install dev dependencies
run: pip install -r requirements-dev.txt
- name: Run lints
run: ruff check server skill tests
- name: Run tests
run: pytest

View File

@@ -11,23 +11,42 @@ Let an Agent interact with your Computer.
- `POST /grid/init`: Accepts a base64 screenshot plus the requested rows/columns, returns a `grid_id`, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions. - `POST /grid/init`: Accepts a base64 screenshot plus the requested rows/columns, returns a `grid_id`, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions.
- `POST /grid/action`: Takes a plan (`grid_id`, optional target cell, and an action like `click`/`drag`/`type`) and returns a structured `ActionResult` with computed coordinates for tooling to consume. - `POST /grid/action`: Takes a plan (`grid_id`, optional target cell, and an action like `click`/`drag`/`type`) and returns a structured `ActionResult` with computed coordinates for tooling to consume.
- `GET /grid/{grid_id}/summary`: Returns both a heuristic description (`GridPlanner`) and a rich descriptor so the skill can summarize what it sees.
- `GET /grid/{grid_id}/history`: Streams back the action history for that grid so an agent or operator can audit what was done.
- `GET /health`: A minimal health check for deployments. - `GET /health`: A minimal health check for deployments.
The server tracks each grid by a UUID and keeps layout metadata so multiple agents can keep in sync with the same screenshot/scene. Vision metadata is kept on a per-grid basis, including history, layout dimensions, and any appended memo. Each `VisionGrid` also exposes a short textual summary so the skill layer can turn sensory data into sentences directly.
## Skill layer (OpenClaw integration) ## Skill layer (OpenClaw integration)
The `skill/` package is a placeholder for how an agent action would look in OpenClaw. It wraps the server calls, interprets the grid cells, and exposes helpers such as `describe_grid()` and `plan_action()` so future work can plug into the agent toolkit directly. The `skill/` package wraps the server calls and exposes helpers:
## Getting started - `ClickthroughSkill.describe_grid()` builds a grid session and returns the descriptor.
- `ClickthroughSkill.plan_action()` drives the `/grid/action` endpoint.
- `ClickthroughSkill.grid_summary()` and `.grid_history()` surface the new metadata endpoints.
- `ClickthroughAgentRunner` simulates a tiny agent loop that chooses a cell (optionally by label), submits an action, and fetches the summary/history.
1. Install dependencies: `python -m pip install -r requirements.txt`. Future work can swap the stub runner for a full OpenClaw skill that keeps reasoning inside the agent and uses these primitives to steer the mouse/keyboard.
2. Run the server: `uvicorn server.main:app --reload`.
3. Use the skill helper to bootstrap a grid, or wire the REST endpoints into a higher-level agent. ## Testing
1. `python3 -m pip install -r requirements.txt`
2. `python3 -m pip install -r requirements-dev.txt`
3. `python3 -m pytest`
The `tests/` suite covers grid construction, the FastAPI surface, and the skill/runner helpers.
## Continuous Integration
`.github/workflows/ci.yml` runs on pushes and PRs:
- Checks out the repo and sets up Python 3.11.
- Installs dependencies (`requirements.txt` + `requirements-dev.txt`).
- Runs `ruff check` over the Python packages.
- Executes `pytest` to keep coverage high.
## Next steps ## Next steps
- Add real OCR/layout logic so cells understand labels. - Add OCR or UI heuristics so grid cells have meaningful labels before the agent reasons about them.
- Turn the action planner into a state machine that can focus/double-click/type/drag. - Persist grids and histories in a lightweight store so long-running sessions survive restarts.
- Persist grid sessions for longer running interactions. - Expose a websocket/watch endpoint that streams updated screenshots and invalidates cached `grid_id`s when the scene changes.
- Ship the OpenClaw skill (skill folder) as a plugin that can call `http://localhost:8000` and scaffold the agents reasoning.

3
pytest.ini Normal file
View File

@@ -0,0 +1,3 @@
[pytest]
testpaths = tests
python_files = test_*.py

2
requirements-dev.txt Normal file
View File

@@ -0,0 +1,2 @@
pytest>=8.0.0
ruff>=0.0.1

View File

@@ -1,4 +1,5 @@
fastapi>=0.105.2 fastapi>=0.105.2
uvicorn[standard]>=0.23.2 uvicorn[standard]>=0.23.2
pydantic>=2.8.2 pydantic>=2.8.2
httpx>=0.30.0 pydantic-settings>=2.5.0
httpx>=0.28.1

5
ruff.toml Normal file
View File

@@ -0,0 +1,5 @@
[tool.ruff]
line-length = 100
select = ["E", "F", "I", "S"]
target-version = "py311"
exclude = ["data", "__pycache__"]

View File

@@ -1,6 +1,7 @@
from pathlib import Path from pathlib import Path
from pydantic import BaseSettings from pydantic import ConfigDict
from pydantic_settings import BaseSettings
class ServerSettings(BaseSettings): class ServerSettings(BaseSettings):
@@ -10,6 +11,4 @@ class ServerSettings(BaseSettings):
storage_dir: Path = Path("data/screenshots") storage_dir: Path = Path("data/screenshots")
default_timeout: int = 10 default_timeout: int = 10
class Config: model_config = ConfigDict(env_prefix="CLICKTHROUGH_", env_file=".env")
env_prefix = "CLICKTHROUGH_"
env_file = ".env"

View File

@@ -1,7 +1,7 @@
from __future__ import annotations from __future__ import annotations
from dataclasses import dataclass from dataclasses import dataclass
from typing import Dict, Tuple from typing import Dict, List, Tuple, Any
import uuid import uuid
from .actions import ActionEngine from .actions import ActionEngine
@@ -31,6 +31,7 @@ class VisionGrid:
self.width = request.width self.width = request.width
self.height = request.height self.height = request.height
self.cells: Dict[str, _StoredCell] = {} self.cells: Dict[str, _StoredCell] = {}
self._action_history: List[dict[str, Any]] = []
self._engine = ActionEngine(self) self._engine = ActionEngine(self)
self._build_cells() self._build_cells()
@@ -75,7 +76,22 @@ class VisionGrid:
return cell.center return cell.center
def apply_action(self, payload: ActionPayload) -> ActionResult: def apply_action(self, payload: ActionPayload) -> ActionResult:
return self._engine.plan(payload) result = self._engine.plan(payload)
self._action_history.append(result.model_dump())
return result
@property
def action_history(self) -> List[dict[str, Any]]:
return list(self._action_history)
def summary(self) -> str:
last_action = self._action_history[-1] if self._action_history else None
last_summary = (
f"Last action: {last_action.get('detail')}" if last_action else "No actions recorded yet"
)
return (
f"Grid {self.grid_id} ({self.rows}x{self.columns}) with {len(self.cells)} cells. {last_summary}."
)
class GridManager: class GridManager:
@@ -100,3 +116,9 @@ class GridManager:
return self._grids[grid_id] return self._grids[grid_id]
except KeyError as exc: except KeyError as exc:
raise KeyError(f"Grid {grid_id} not found") from exc raise KeyError(f"Grid {grid_id} not found") from exc
def get_history(self, grid_id: str) -> List[dict[str, Any]]:
return self.get_grid(grid_id).action_history
def clear(self) -> None:
self._grids.clear()

View File

@@ -3,15 +3,17 @@ from fastapi import FastAPI, HTTPException
from .config import ServerSettings from .config import ServerSettings
from .grid import GridManager from .grid import GridManager
from .models import ActionPayload, GridDescriptor, GridInitRequest from .models import ActionPayload, GridDescriptor, GridInitRequest
from .planner import GridPlanner
settings = ServerSettings() settings = ServerSettings()
manager = GridManager(settings) manager = GridManager(settings)
planner = GridPlanner()
app = FastAPI( app = FastAPI(
title="Clickthrough", title="Clickthrough",
description="Grid-aware surface that lets an agent plan clicks, drags, and typing on a fake screenshot", description="Grid-aware surface that lets an agent plan clicks, drags, and typing on a fake screenshot",
version="0.1.0", version="0.2.0",
) )
@@ -33,3 +35,27 @@ def apply_action(payload: ActionPayload):
except KeyError as exc: except KeyError as exc:
raise HTTPException(status_code=404, detail=str(exc)) from exc raise HTTPException(status_code=404, detail=str(exc)) from exc
return grid.apply_action(payload) return grid.apply_action(payload)
@app.get("/grid/{grid_id}/summary")
def grid_summary(grid_id: str):
try:
grid = manager.get_grid(grid_id)
except KeyError as exc:
raise HTTPException(status_code=404, detail=str(exc)) from exc
descriptor = grid.describe()
return {
"grid_id": grid_id,
"summary": planner.describe(descriptor),
"details": grid.summary(),
"descriptor": descriptor,
}
@app.get("/grid/{grid_id}/history")
def grid_history(grid_id: str):
try:
history = manager.get_history(grid_id)
except KeyError as exc:
raise HTTPException(status_code=404, detail=str(exc)) from exc
return {"grid_id": grid_id, "history": history}

53
server/planner.py Normal file
View File

@@ -0,0 +1,53 @@
from __future__ import annotations
from math import hypot
from typing import Sequence
from .models import GridCellModel, GridDescriptor
class GridPlanner:
"""Helper that picks a grid cell using simple heuristics."""
def select_cell(
self, descriptor: GridDescriptor, preferred_label: str | None = None
) -> GridCellModel | None:
if not descriptor.cells:
return None
if preferred_label:
match = self._match_label(descriptor.cells, preferred_label)
if match:
return match
center_point = self._grid_center(descriptor)
return min(descriptor.cells, key=lambda cell: self._distance(self._cell_center(cell), center_point))
def describe(self, descriptor: GridDescriptor) -> str:
cell_count = len(descriptor.cells)
return (
f"Grid {descriptor.grid_id} is {descriptor.rows}x{descriptor.columns} with {cell_count} cells."
)
def _grid_center(self, descriptor: GridDescriptor) -> tuple[float, float]:
width = descriptor.metadata.get("width", 0)
height = descriptor.metadata.get("height", 0)
return (width / 2, height / 2)
def _cell_center(self, cell: GridCellModel) -> tuple[float, float]:
left, top, right, bottom = cell.bounds
return ((left + right) / 2, (top + bottom) / 2)
def _distance(
self, first: tuple[float, float], second: tuple[float, float]
) -> float:
return hypot(first[0] - second[0], first[1] - second[1])
def _match_label(
self, cells: Sequence[GridCellModel], label: str
) -> GridCellModel | None:
lowered = label.lower()
for cell in cells:
if cell.label and lowered in cell.label.lower():
return cell
return None

View File

@@ -1,5 +1,11 @@
"""Utility helpers for the Clickthrough agent skill.""" """Utility helpers for the Clickthrough agent skill."""
from .agent_runner import AgentRunResult, ClickthroughAgentRunner
from .clickthrough_skill import ActionPlan, ClickthroughSkill from .clickthrough_skill import ActionPlan, ClickthroughSkill
__all__ = ["ClickthroughSkill", "ActionPlan"] __all__ = [
"ClickthroughSkill",
"ActionPlan",
"ClickthroughAgentRunner",
"AgentRunResult",
]

62
skill/agent_runner.py Normal file
View File

@@ -0,0 +1,62 @@
from dataclasses import dataclass
from typing import Any, Dict, Sequence
from .clickthrough_skill import ActionPlan, ClickthroughSkill
@dataclass
class AgentRunResult:
summary: Dict[str, Any]
action: Dict[str, Any]
history: Dict[str, Any]
grid: Dict[str, Any]
class ClickthroughAgentRunner:
def __init__(self, skill: ClickthroughSkill) -> None:
self.skill = skill
def run_once(
self,
screenshot_base64: str,
width: int,
height: int,
rows: int = 4,
columns: int = 4,
preferred_label: str | None = None,
action: str = "click",
text: str | None = None,
) -> AgentRunResult:
grid = self.skill.describe_grid(
screenshot_base64=screenshot_base64,
width=width,
height=height,
rows=rows,
columns=columns,
)
cells = grid.get("cells") or []
target_cell = self._choose_cell(cells, preferred_label)
plan = ActionPlan(
grid_id=grid["grid_id"],
target_cell=target_cell,
action=action,
text=text,
)
action_result = self.skill.plan_action(plan)
summary = self.skill.grid_summary(grid["grid_id"])
history = self.skill.grid_history(grid["grid_id"])
return AgentRunResult(summary=summary, action=action_result, history=history, grid=grid)
def _choose_cell(
self, cells: Sequence[dict[str, Any]], preferred_label: str | None
) -> str:
if not cells:
raise ValueError("Grid contains no cells")
if preferred_label:
search = preferred_label.lower()
for cell in cells:
label_value = cell.get("label")
if label_value and search in label_value.lower():
return cell["cell_id"]
center_index = len(cells) // 2
return cells[center_index]["cell_id"]

View File

@@ -50,6 +50,16 @@ class ClickthroughSkill:
response.raise_for_status() response.raise_for_status()
return response.json() return response.json()
def grid_summary(self, grid_id: str) -> Dict[str, Any]:
response = self._client.get(f"/grid/{grid_id}/summary")
response.raise_for_status()
return response.json()
def grid_history(self, grid_id: str) -> Dict[str, Any]:
response = self._client.get(f"/grid/{grid_id}/history")
response.raise_for_status()
return response.json()
if __name__ == "__main__": if __name__ == "__main__":
import base64 import base64

29
tests/conftest.py Normal file
View File

@@ -0,0 +1,29 @@
import base64
import pytest
from server.main import manager
@pytest.fixture
def fake_screenshot() -> str:
"""Return a reproducible base64 string representing a dummy screenshot."""
return base64.b64encode(b"clickthrough-dummy").decode()
@pytest.fixture
def default_grid_request(fake_screenshot):
return {
"width": 640,
"height": 480,
"screenshot_base64": fake_screenshot,
"rows": 3,
"columns": 3,
}
@pytest.fixture(autouse=True)
def reset_manager_state():
manager._grids.clear()
yield
manager._grids.clear()

View File

@@ -0,0 +1,53 @@
from typing import Any, Dict
from skill.agent_runner import ClickthroughAgentRunner
from skill.clickthrough_skill import ActionPlan, ClickthroughSkill
class DummySkill(ClickthroughSkill):
def __init__(self):
self.last_plan: ActionPlan | None = None
def describe_grid(
self,
screenshot_base64: str,
width: int,
height: int,
rows: int = 4,
columns: int = 4,
) -> Dict[str, Any]:
return {
"grid_id": "dummy-grid",
"cells": [
{"cell_id": "dummy-grid-1", "label": "button", "bounds": [0, 0, 100, 100]},
{"cell_id": "dummy-grid-2", "label": "target", "bounds": [100, 0, 200, 100]},
],
}
def plan_action(self, plan: ActionPlan) -> Dict[str, Any]:
self.last_plan = plan
return {"success": True, "target_cell": plan.target_cell}
def grid_summary(self, grid_id: str) -> Dict[str, Any]:
return {"grid_id": grid_id, "summary": "ok"}
def grid_history(self, grid_id: str) -> Dict[str, Any]:
return {"grid_id": grid_id, "history": []}
def test_agent_runner_prefers_label():
runner = ClickthroughAgentRunner(DummySkill())
result = runner.run_once(
screenshot_base64="AA==",
width=120,
height=80,
preferred_label="target",
)
assert result.action["target_cell"] == "dummy-grid-2"
assert result.summary["summary"] == "ok"
def test_agent_runner_defaults_to_center():
runner = ClickthroughAgentRunner(DummySkill())
result = runner.run_once(screenshot_base64="AA==", width=120, height=80)
assert result.action["target_cell"] == "dummy-grid-2"

51
tests/test_grid.py Normal file
View File

@@ -0,0 +1,51 @@
from server.config import ServerSettings
from server.grid import GridManager
from server.models import ActionPayload, ActionType, GridInitRequest
def test_grid_creation_respects_dimensions(default_grid_request):
settings = ServerSettings(grid_rows=2, grid_cols=2)
manager = GridManager(settings)
request = GridInitRequest(**default_grid_request)
grid = manager.create_grid(request)
descriptor = grid.describe()
assert descriptor.grid_id
assert descriptor.rows == 3
assert descriptor.columns == 3
assert len(descriptor.cells) == 9
assert descriptor.metadata.get("width") == 640
assert descriptor.metadata.get("height") == 480
def test_grid_action_records_history(default_grid_request):
manager = GridManager(ServerSettings())
request = GridInitRequest(**default_grid_request)
grid = manager.create_grid(request)
descriptor = grid.describe()
target_cell = descriptor.cells[0].cell_id
payload = ActionPayload(
grid_id=descriptor.grid_id,
action=ActionType.CLICK,
target_cell=target_cell,
comment="click test",
)
result = grid.apply_action(payload)
assert result.success
assert result.coordinates is not None
assert grid.action_history[-1]["coordinates"] == result.coordinates
def test_manager_get_grid_missing(default_grid_request):
manager = GridManager(ServerSettings())
request = GridInitRequest(**default_grid_request)
_ = manager.create_grid(request)
try:
manager.get_grid("does-not-exist")
found = True
except KeyError:
found = False
assert not found