This commit is contained in:
23
.github/workflows/ci.yml
vendored
Normal file
23
.github/workflows/ci.yml
vendored
Normal file
@@ -0,0 +1,23 @@
|
||||
name: CI
|
||||
|
||||
on:
|
||||
push: {}
|
||||
pull_request: {}
|
||||
|
||||
jobs:
|
||||
test:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: 3.11
|
||||
- name: Install runtime dependencies
|
||||
run: python -m pip install --upgrade pip && pip install -r requirements.txt
|
||||
- name: Install dev dependencies
|
||||
run: pip install -r requirements-dev.txt
|
||||
- name: Run lints
|
||||
run: ruff check server skill tests
|
||||
- name: Run tests
|
||||
run: pytest
|
||||
39
README.md
39
README.md
@@ -11,23 +11,42 @@ Let an Agent interact with your Computer.
|
||||
|
||||
- `POST /grid/init`: Accepts a base64 screenshot plus the requested rows/columns, returns a `grid_id`, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions.
|
||||
- `POST /grid/action`: Takes a plan (`grid_id`, optional target cell, and an action like `click`/`drag`/`type`) and returns a structured `ActionResult` with computed coordinates for tooling to consume.
|
||||
- `GET /grid/{grid_id}/summary`: Returns both a heuristic description (`GridPlanner`) and a rich descriptor so the skill can summarize what it sees.
|
||||
- `GET /grid/{grid_id}/history`: Streams back the action history for that grid so an agent or operator can audit what was done.
|
||||
- `GET /health`: A minimal health check for deployments.
|
||||
|
||||
The server tracks each grid by a UUID and keeps layout metadata so multiple agents can keep in sync with the same screenshot/scene.
|
||||
Vision metadata is kept on a per-grid basis, including history, layout dimensions, and any appended memo. Each `VisionGrid` also exposes a short textual summary so the skill layer can turn sensory data into sentences directly.
|
||||
|
||||
## Skill layer (OpenClaw integration)
|
||||
|
||||
The `skill/` package is a placeholder for how an agent action would look in OpenClaw. It wraps the server calls, interprets the grid cells, and exposes helpers such as `describe_grid()` and `plan_action()` so future work can plug into the agent toolkit directly.
|
||||
The `skill/` package wraps the server calls and exposes helpers:
|
||||
|
||||
## Getting started
|
||||
- `ClickthroughSkill.describe_grid()` builds a grid session and returns the descriptor.
|
||||
- `ClickthroughSkill.plan_action()` drives the `/grid/action` endpoint.
|
||||
- `ClickthroughSkill.grid_summary()` and `.grid_history()` surface the new metadata endpoints.
|
||||
- `ClickthroughAgentRunner` simulates a tiny agent loop that chooses a cell (optionally by label), submits an action, and fetches the summary/history.
|
||||
|
||||
1. Install dependencies: `python -m pip install -r requirements.txt`.
|
||||
2. Run the server: `uvicorn server.main:app --reload`.
|
||||
3. Use the skill helper to bootstrap a grid, or wire the REST endpoints into a higher-level agent.
|
||||
Future work can swap the stub runner for a full OpenClaw skill that keeps reasoning inside the agent and uses these primitives to steer the mouse/keyboard.
|
||||
|
||||
## Testing
|
||||
|
||||
1. `python3 -m pip install -r requirements.txt`
|
||||
2. `python3 -m pip install -r requirements-dev.txt`
|
||||
3. `python3 -m pytest`
|
||||
|
||||
The `tests/` suite covers grid construction, the FastAPI surface, and the skill/runner helpers.
|
||||
|
||||
## Continuous Integration
|
||||
|
||||
`.github/workflows/ci.yml` runs on pushes and PRs:
|
||||
|
||||
- Checks out the repo and sets up Python 3.11.
|
||||
- Installs dependencies (`requirements.txt` + `requirements-dev.txt`).
|
||||
- Runs `ruff check` over the Python packages.
|
||||
- Executes `pytest` to keep coverage high.
|
||||
|
||||
## Next steps
|
||||
|
||||
- Add real OCR/layout logic so cells understand labels.
|
||||
- Turn the action planner into a state machine that can focus/double-click/type/drag.
|
||||
- Persist grid sessions for longer running interactions.
|
||||
- Ship the OpenClaw skill (skill folder) as a plugin that can call `http://localhost:8000` and scaffold the agent’s reasoning.
|
||||
- Add OCR or UI heuristics so grid cells have meaningful labels before the agent reasons about them.
|
||||
- Persist grids and histories in a lightweight store so long-running sessions survive restarts.
|
||||
- Expose a websocket/watch endpoint that streams updated screenshots and invalidates cached `grid_id`s when the scene changes.
|
||||
|
||||
3
pytest.ini
Normal file
3
pytest.ini
Normal file
@@ -0,0 +1,3 @@
|
||||
[pytest]
|
||||
testpaths = tests
|
||||
python_files = test_*.py
|
||||
2
requirements-dev.txt
Normal file
2
requirements-dev.txt
Normal file
@@ -0,0 +1,2 @@
|
||||
pytest>=8.0.0
|
||||
ruff>=0.0.1
|
||||
@@ -1,4 +1,5 @@
|
||||
fastapi>=0.105.2
|
||||
uvicorn[standard]>=0.23.2
|
||||
pydantic>=2.8.2
|
||||
httpx>=0.30.0
|
||||
pydantic-settings>=2.5.0
|
||||
httpx>=0.28.1
|
||||
|
||||
5
ruff.toml
Normal file
5
ruff.toml
Normal file
@@ -0,0 +1,5 @@
|
||||
[tool.ruff]
|
||||
line-length = 100
|
||||
select = ["E", "F", "I", "S"]
|
||||
target-version = "py311"
|
||||
exclude = ["data", "__pycache__"]
|
||||
@@ -1,6 +1,7 @@
|
||||
from pathlib import Path
|
||||
|
||||
from pydantic import BaseSettings
|
||||
from pydantic import ConfigDict
|
||||
from pydantic_settings import BaseSettings
|
||||
|
||||
|
||||
class ServerSettings(BaseSettings):
|
||||
@@ -10,6 +11,4 @@ class ServerSettings(BaseSettings):
|
||||
storage_dir: Path = Path("data/screenshots")
|
||||
default_timeout: int = 10
|
||||
|
||||
class Config:
|
||||
env_prefix = "CLICKTHROUGH_"
|
||||
env_file = ".env"
|
||||
model_config = ConfigDict(env_prefix="CLICKTHROUGH_", env_file=".env")
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Dict, Tuple
|
||||
from typing import Dict, List, Tuple, Any
|
||||
import uuid
|
||||
|
||||
from .actions import ActionEngine
|
||||
@@ -31,6 +31,7 @@ class VisionGrid:
|
||||
self.width = request.width
|
||||
self.height = request.height
|
||||
self.cells: Dict[str, _StoredCell] = {}
|
||||
self._action_history: List[dict[str, Any]] = []
|
||||
self._engine = ActionEngine(self)
|
||||
self._build_cells()
|
||||
|
||||
@@ -75,7 +76,22 @@ class VisionGrid:
|
||||
return cell.center
|
||||
|
||||
def apply_action(self, payload: ActionPayload) -> ActionResult:
|
||||
return self._engine.plan(payload)
|
||||
result = self._engine.plan(payload)
|
||||
self._action_history.append(result.model_dump())
|
||||
return result
|
||||
|
||||
@property
|
||||
def action_history(self) -> List[dict[str, Any]]:
|
||||
return list(self._action_history)
|
||||
|
||||
def summary(self) -> str:
|
||||
last_action = self._action_history[-1] if self._action_history else None
|
||||
last_summary = (
|
||||
f"Last action: {last_action.get('detail')}" if last_action else "No actions recorded yet"
|
||||
)
|
||||
return (
|
||||
f"Grid {self.grid_id} ({self.rows}x{self.columns}) with {len(self.cells)} cells. {last_summary}."
|
||||
)
|
||||
|
||||
|
||||
class GridManager:
|
||||
@@ -100,3 +116,9 @@ class GridManager:
|
||||
return self._grids[grid_id]
|
||||
except KeyError as exc:
|
||||
raise KeyError(f"Grid {grid_id} not found") from exc
|
||||
|
||||
def get_history(self, grid_id: str) -> List[dict[str, Any]]:
|
||||
return self.get_grid(grid_id).action_history
|
||||
|
||||
def clear(self) -> None:
|
||||
self._grids.clear()
|
||||
|
||||
@@ -3,15 +3,17 @@ from fastapi import FastAPI, HTTPException
|
||||
from .config import ServerSettings
|
||||
from .grid import GridManager
|
||||
from .models import ActionPayload, GridDescriptor, GridInitRequest
|
||||
from .planner import GridPlanner
|
||||
|
||||
|
||||
settings = ServerSettings()
|
||||
manager = GridManager(settings)
|
||||
planner = GridPlanner()
|
||||
|
||||
app = FastAPI(
|
||||
title="Clickthrough",
|
||||
description="Grid-aware surface that lets an agent plan clicks, drags, and typing on a fake screenshot",
|
||||
version="0.1.0",
|
||||
version="0.2.0",
|
||||
)
|
||||
|
||||
|
||||
@@ -33,3 +35,27 @@ def apply_action(payload: ActionPayload):
|
||||
except KeyError as exc:
|
||||
raise HTTPException(status_code=404, detail=str(exc)) from exc
|
||||
return grid.apply_action(payload)
|
||||
|
||||
|
||||
@app.get("/grid/{grid_id}/summary")
|
||||
def grid_summary(grid_id: str):
|
||||
try:
|
||||
grid = manager.get_grid(grid_id)
|
||||
except KeyError as exc:
|
||||
raise HTTPException(status_code=404, detail=str(exc)) from exc
|
||||
descriptor = grid.describe()
|
||||
return {
|
||||
"grid_id": grid_id,
|
||||
"summary": planner.describe(descriptor),
|
||||
"details": grid.summary(),
|
||||
"descriptor": descriptor,
|
||||
}
|
||||
|
||||
|
||||
@app.get("/grid/{grid_id}/history")
|
||||
def grid_history(grid_id: str):
|
||||
try:
|
||||
history = manager.get_history(grid_id)
|
||||
except KeyError as exc:
|
||||
raise HTTPException(status_code=404, detail=str(exc)) from exc
|
||||
return {"grid_id": grid_id, "history": history}
|
||||
|
||||
53
server/planner.py
Normal file
53
server/planner.py
Normal file
@@ -0,0 +1,53 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from math import hypot
|
||||
from typing import Sequence
|
||||
|
||||
from .models import GridCellModel, GridDescriptor
|
||||
|
||||
|
||||
class GridPlanner:
|
||||
"""Helper that picks a grid cell using simple heuristics."""
|
||||
|
||||
def select_cell(
|
||||
self, descriptor: GridDescriptor, preferred_label: str | None = None
|
||||
) -> GridCellModel | None:
|
||||
if not descriptor.cells:
|
||||
return None
|
||||
|
||||
if preferred_label:
|
||||
match = self._match_label(descriptor.cells, preferred_label)
|
||||
if match:
|
||||
return match
|
||||
|
||||
center_point = self._grid_center(descriptor)
|
||||
return min(descriptor.cells, key=lambda cell: self._distance(self._cell_center(cell), center_point))
|
||||
|
||||
def describe(self, descriptor: GridDescriptor) -> str:
|
||||
cell_count = len(descriptor.cells)
|
||||
return (
|
||||
f"Grid {descriptor.grid_id} is {descriptor.rows}x{descriptor.columns} with {cell_count} cells."
|
||||
)
|
||||
|
||||
def _grid_center(self, descriptor: GridDescriptor) -> tuple[float, float]:
|
||||
width = descriptor.metadata.get("width", 0)
|
||||
height = descriptor.metadata.get("height", 0)
|
||||
return (width / 2, height / 2)
|
||||
|
||||
def _cell_center(self, cell: GridCellModel) -> tuple[float, float]:
|
||||
left, top, right, bottom = cell.bounds
|
||||
return ((left + right) / 2, (top + bottom) / 2)
|
||||
|
||||
def _distance(
|
||||
self, first: tuple[float, float], second: tuple[float, float]
|
||||
) -> float:
|
||||
return hypot(first[0] - second[0], first[1] - second[1])
|
||||
|
||||
def _match_label(
|
||||
self, cells: Sequence[GridCellModel], label: str
|
||||
) -> GridCellModel | None:
|
||||
lowered = label.lower()
|
||||
for cell in cells:
|
||||
if cell.label and lowered in cell.label.lower():
|
||||
return cell
|
||||
return None
|
||||
@@ -1,5 +1,11 @@
|
||||
"""Utility helpers for the Clickthrough agent skill."""
|
||||
|
||||
from .agent_runner import AgentRunResult, ClickthroughAgentRunner
|
||||
from .clickthrough_skill import ActionPlan, ClickthroughSkill
|
||||
|
||||
__all__ = ["ClickthroughSkill", "ActionPlan"]
|
||||
__all__ = [
|
||||
"ClickthroughSkill",
|
||||
"ActionPlan",
|
||||
"ClickthroughAgentRunner",
|
||||
"AgentRunResult",
|
||||
]
|
||||
|
||||
62
skill/agent_runner.py
Normal file
62
skill/agent_runner.py
Normal file
@@ -0,0 +1,62 @@
|
||||
from dataclasses import dataclass
|
||||
from typing import Any, Dict, Sequence
|
||||
|
||||
from .clickthrough_skill import ActionPlan, ClickthroughSkill
|
||||
|
||||
|
||||
@dataclass
|
||||
class AgentRunResult:
|
||||
summary: Dict[str, Any]
|
||||
action: Dict[str, Any]
|
||||
history: Dict[str, Any]
|
||||
grid: Dict[str, Any]
|
||||
|
||||
|
||||
class ClickthroughAgentRunner:
|
||||
def __init__(self, skill: ClickthroughSkill) -> None:
|
||||
self.skill = skill
|
||||
|
||||
def run_once(
|
||||
self,
|
||||
screenshot_base64: str,
|
||||
width: int,
|
||||
height: int,
|
||||
rows: int = 4,
|
||||
columns: int = 4,
|
||||
preferred_label: str | None = None,
|
||||
action: str = "click",
|
||||
text: str | None = None,
|
||||
) -> AgentRunResult:
|
||||
grid = self.skill.describe_grid(
|
||||
screenshot_base64=screenshot_base64,
|
||||
width=width,
|
||||
height=height,
|
||||
rows=rows,
|
||||
columns=columns,
|
||||
)
|
||||
cells = grid.get("cells") or []
|
||||
target_cell = self._choose_cell(cells, preferred_label)
|
||||
plan = ActionPlan(
|
||||
grid_id=grid["grid_id"],
|
||||
target_cell=target_cell,
|
||||
action=action,
|
||||
text=text,
|
||||
)
|
||||
action_result = self.skill.plan_action(plan)
|
||||
summary = self.skill.grid_summary(grid["grid_id"])
|
||||
history = self.skill.grid_history(grid["grid_id"])
|
||||
return AgentRunResult(summary=summary, action=action_result, history=history, grid=grid)
|
||||
|
||||
def _choose_cell(
|
||||
self, cells: Sequence[dict[str, Any]], preferred_label: str | None
|
||||
) -> str:
|
||||
if not cells:
|
||||
raise ValueError("Grid contains no cells")
|
||||
if preferred_label:
|
||||
search = preferred_label.lower()
|
||||
for cell in cells:
|
||||
label_value = cell.get("label")
|
||||
if label_value and search in label_value.lower():
|
||||
return cell["cell_id"]
|
||||
center_index = len(cells) // 2
|
||||
return cells[center_index]["cell_id"]
|
||||
@@ -50,6 +50,16 @@ class ClickthroughSkill:
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
def grid_summary(self, grid_id: str) -> Dict[str, Any]:
|
||||
response = self._client.get(f"/grid/{grid_id}/summary")
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
def grid_history(self, grid_id: str) -> Dict[str, Any]:
|
||||
response = self._client.get(f"/grid/{grid_id}/history")
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import base64
|
||||
|
||||
29
tests/conftest.py
Normal file
29
tests/conftest.py
Normal file
@@ -0,0 +1,29 @@
|
||||
import base64
|
||||
|
||||
import pytest
|
||||
|
||||
from server.main import manager
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def fake_screenshot() -> str:
|
||||
"""Return a reproducible base64 string representing a dummy screenshot."""
|
||||
return base64.b64encode(b"clickthrough-dummy").decode()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def default_grid_request(fake_screenshot):
|
||||
return {
|
||||
"width": 640,
|
||||
"height": 480,
|
||||
"screenshot_base64": fake_screenshot,
|
||||
"rows": 3,
|
||||
"columns": 3,
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def reset_manager_state():
|
||||
manager._grids.clear()
|
||||
yield
|
||||
manager._grids.clear()
|
||||
53
tests/test_agent_runner.py
Normal file
53
tests/test_agent_runner.py
Normal file
@@ -0,0 +1,53 @@
|
||||
from typing import Any, Dict
|
||||
|
||||
from skill.agent_runner import ClickthroughAgentRunner
|
||||
from skill.clickthrough_skill import ActionPlan, ClickthroughSkill
|
||||
|
||||
|
||||
class DummySkill(ClickthroughSkill):
|
||||
def __init__(self):
|
||||
self.last_plan: ActionPlan | None = None
|
||||
|
||||
def describe_grid(
|
||||
self,
|
||||
screenshot_base64: str,
|
||||
width: int,
|
||||
height: int,
|
||||
rows: int = 4,
|
||||
columns: int = 4,
|
||||
) -> Dict[str, Any]:
|
||||
return {
|
||||
"grid_id": "dummy-grid",
|
||||
"cells": [
|
||||
{"cell_id": "dummy-grid-1", "label": "button", "bounds": [0, 0, 100, 100]},
|
||||
{"cell_id": "dummy-grid-2", "label": "target", "bounds": [100, 0, 200, 100]},
|
||||
],
|
||||
}
|
||||
|
||||
def plan_action(self, plan: ActionPlan) -> Dict[str, Any]:
|
||||
self.last_plan = plan
|
||||
return {"success": True, "target_cell": plan.target_cell}
|
||||
|
||||
def grid_summary(self, grid_id: str) -> Dict[str, Any]:
|
||||
return {"grid_id": grid_id, "summary": "ok"}
|
||||
|
||||
def grid_history(self, grid_id: str) -> Dict[str, Any]:
|
||||
return {"grid_id": grid_id, "history": []}
|
||||
|
||||
|
||||
def test_agent_runner_prefers_label():
|
||||
runner = ClickthroughAgentRunner(DummySkill())
|
||||
result = runner.run_once(
|
||||
screenshot_base64="AA==",
|
||||
width=120,
|
||||
height=80,
|
||||
preferred_label="target",
|
||||
)
|
||||
assert result.action["target_cell"] == "dummy-grid-2"
|
||||
assert result.summary["summary"] == "ok"
|
||||
|
||||
|
||||
def test_agent_runner_defaults_to_center():
|
||||
runner = ClickthroughAgentRunner(DummySkill())
|
||||
result = runner.run_once(screenshot_base64="AA==", width=120, height=80)
|
||||
assert result.action["target_cell"] == "dummy-grid-2"
|
||||
51
tests/test_grid.py
Normal file
51
tests/test_grid.py
Normal file
@@ -0,0 +1,51 @@
|
||||
from server.config import ServerSettings
|
||||
from server.grid import GridManager
|
||||
from server.models import ActionPayload, ActionType, GridInitRequest
|
||||
|
||||
|
||||
def test_grid_creation_respects_dimensions(default_grid_request):
|
||||
settings = ServerSettings(grid_rows=2, grid_cols=2)
|
||||
manager = GridManager(settings)
|
||||
request = GridInitRequest(**default_grid_request)
|
||||
grid = manager.create_grid(request)
|
||||
|
||||
descriptor = grid.describe()
|
||||
assert descriptor.grid_id
|
||||
assert descriptor.rows == 3
|
||||
assert descriptor.columns == 3
|
||||
assert len(descriptor.cells) == 9
|
||||
assert descriptor.metadata.get("width") == 640
|
||||
assert descriptor.metadata.get("height") == 480
|
||||
|
||||
|
||||
def test_grid_action_records_history(default_grid_request):
|
||||
manager = GridManager(ServerSettings())
|
||||
request = GridInitRequest(**default_grid_request)
|
||||
grid = manager.create_grid(request)
|
||||
descriptor = grid.describe()
|
||||
target_cell = descriptor.cells[0].cell_id
|
||||
|
||||
payload = ActionPayload(
|
||||
grid_id=descriptor.grid_id,
|
||||
action=ActionType.CLICK,
|
||||
target_cell=target_cell,
|
||||
comment="click test",
|
||||
)
|
||||
result = grid.apply_action(payload)
|
||||
|
||||
assert result.success
|
||||
assert result.coordinates is not None
|
||||
assert grid.action_history[-1]["coordinates"] == result.coordinates
|
||||
|
||||
|
||||
def test_manager_get_grid_missing(default_grid_request):
|
||||
manager = GridManager(ServerSettings())
|
||||
request = GridInitRequest(**default_grid_request)
|
||||
_ = manager.create_grid(request)
|
||||
|
||||
try:
|
||||
manager.get_grid("does-not-exist")
|
||||
found = True
|
||||
except KeyError:
|
||||
found = False
|
||||
assert not found
|
||||
Reference in New Issue
Block a user