This commit is contained in:
Space-Banane
2026-04-05 19:48:00 +02:00
parent 48ac9f5d7d
commit 6f9eedcc7a
27 changed files with 1 additions and 1396 deletions

View File

@@ -1,23 +0,0 @@
name: CI
on:
push: {}
pull_request: {}
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.11
- name: Install runtime dependencies
run: python -m pip install --upgrade pip && pip install -r requirements.txt
- name: Install dev dependencies
run: pip install -r requirements-dev.txt
- name: Run lints
run: ruff check server skill tests
- name: Run tests
run: pytest

View File

@@ -1,69 +1,2 @@
# Clickthrough # Clickthrough
Let an Agent interact with your Computer. Let an Agent interact with your Computer.
`Clickthrough` is a proof-of-concept bridge between a vision-aware agent and a headless controller. The project is split into two halves:
1. A Python server that accepts a static grid overlay (think of a screenshot broken into cells) and exposes lightweight endpoints to ask questions, plan actions, or even run pointer/keyboard events.
2. A **skill** that bundles the HTTP calls/intent construction so we can hardwire the same flow inside OpenClaw later.
## Server surface (FastAPI)
- `POST /grid/init`: Accepts a base64 screenshot plus the requested rows/columns, returns a `grid_id`, cell bounds, and helpful metadata. The grid is stored in-memory so the agent can reference cells by ID in later actions.
- `POST /grid/action`: Takes a plan (`grid_id`, optional target cell, and an action like `click`/`drag`/`type`) and returns a structured `ActionResult` with computed coordinates for tooling to consume.
- `GET /grid/{grid_id}/summary`: Returns both a heuristic description (`GridPlanner`) and a rich descriptor so the skill can summarize what it sees.
- `GET /grid/{grid_id}/history`: Streams back the action history for that grid so an agent or operator can audit what was done.
- `POST /grid/{grid_id}/plan`: Lets `GridPlanner` select the target and return a preview action plan without committing to it, so we can inspect coordinates before triggering events.
- `POST /grid/{grid_id}/refresh` + `GET /stream/screenshots`: Refresh the cached screenshot/metadata and broadcast the updated scene over a websocket so clients can redraw overlays in near real time.
- `GET /health`: A minimal health check for deployments.
Vision metadata is kept on a per-grid basis, including history, layout dimensions, and any appended memo. Each `VisionGrid` also exposes a short textual summary so the skill layer can turn sensory data into sentences directly.
## Skill layer (OpenClaw integration)
The `skill/` package wraps the server calls and exposes helpers:
- `ClickthroughSkill.describe_grid()` builds a grid session and returns the descriptor.
- `ClickthroughSkill.plan_action()` drives the `/grid/action` endpoint.
- `ClickthroughSkill.plan_with_planner()` calls `/grid/{grid_id}/plan`, so you can preview the `GridPlanner` suggestion before executing it.
- `ClickthroughSkill.grid_summary()` and `.grid_history()` surface the new metadata endpoints.
- `ClickthroughSkill.refresh_grid()` pushes a new screenshot and memo, triggering websocket listeners.
- `ClickthroughAgentRunner` simulates a tiny agent loop that asks the planner for a preview, executes the resulting action, and then gathers the summary/history so you can iterate on reasoning loops in tests.
Future work can swap the stub runner for a full OpenClaw skill that keeps reasoning inside the agent and uses these primitives to steer the mouse/keyboard.
## Screenshot streaming
Capture loops can now talk to FastAPI in two ways:
1. POST `/grid/{grid_id}/refresh` with fresh base64 screenshots and an optional memo; the server updates the cached grid metadata and broadcasts the change.
2. Open a websocket to `GET /stream/screenshots` (optionally passing `grid_id` as a query param) to receive realtime deltas whenever a refresh happens. Clients can use the descriptor/payload to redraw overlays or trigger new planner runs without polling.
## Testing
1. `python3 -m pip install -r requirements.txt`
2. `python3 -m pip install -r requirements-dev.txt`
3. `python3 -m pytest`
The `tests/` suite covers grid construction, the FastAPI surface, and the skill/runner helpers.
## Continuous Integration
`.github/workflows/ci.yml` runs on pushes and PRs:
- Checks out the repo and sets up Python 3.11.
- Installs dependencies (`requirements.txt` + `requirements-dev.txt`).
- Runs `ruff check` over the Python packages.
- Executes `pytest` to keep coverage high.
## Control UI
- `/ui/` serves a small control panel where you can bootstrap a grid from a base64 screenshot, ask the planner for a preview, execute clicks, refresh the screenshot, and watch the summary/history.
- Most traffic is HTTP: `/grid/init`, `/grid/{id}/plan`, `/grid/{id}/action`, `/grid/{id}/refresh`, `/grid/{id}/summary`, and `/grid/{id}/history`. Only the `/stream/screenshots` websocket pushes updates after a refresh so the overlay redraws.
- The FastAPI root now redirects to `/ui/` when the client assets are present, making the UI a lightweight entry point for demos or manual command-and-control work.
## Next steps
- Add OCR or UI heuristics so grid cells have meaningful labels before the agent reasons about them.
- Persist grids and histories in a lightweight store so long-running sessions survive restarts.
- Expand the UI to preview actions visually (perhaps overlaying cells on top of rendered screenshots).

View File

@@ -1,159 +0,0 @@
const gridForm = document.getElementById("grid-form");
const descriptorEl = document.getElementById("descriptor");
const gridMetaEl = document.getElementById("grid-meta");
const summaryEl = document.getElementById("summary");
const historyEl = document.getElementById("history");
const planOutput = document.getElementById("plan-output");
const preferredInput = document.getElementById("preferred-label");
const refreshScreenshot = document.getElementById("refresh-screenshot");
const refreshMemo = document.getElementById("refresh-memo");
const logEl = document.getElementById("ws-log");
let currentGrid = null;
let lastPlan = null;
let ws = null;
let keepAliveId = null;
const log = (message) => {
const timestamp = new Date().toLocaleTimeString();
logEl.textContent = `[${timestamp}] ${message}\n${logEl.textContent}`;
};
const headers = {
"Content-Type": "application/json",
};
const subscribeToGrid = (gridId) => {
if (!gridId) return;
if (ws) {
ws.close();
}
const protocol = window.location.protocol === "https:" ? "wss" : "ws";
ws = new WebSocket(`${protocol}://${window.location.host}/stream/screenshots?grid_id=${gridId}`);
ws.addEventListener("open", () => {
log(`WebSocket listening for grid ${gridId}`);
ws.send("ready");
keepAliveId = setInterval(() => ws.send("ping"), 15000);
});
ws.addEventListener("message", (event) => {
log(`Update received → ${event.data}`);
});
ws.addEventListener("close", () => {
log("WebSocket disconnected");
if (keepAliveId) {
clearInterval(keepAliveId);
keepAliveId = null;
}
});
};
const updateDescriptor = (descriptor) => {
descriptorEl.textContent = JSON.stringify(descriptor, null, 2);
gridMetaEl.textContent = `Grid ${descriptor.grid_id} (${descriptor.rows}x${descriptor.columns}) · ${descriptor.cells.length} cells`;
};
const updateSummary = async () => {
if (!currentGrid) return;
const [summaryResponse, historyResponse] = await Promise.all([
fetch(`/grid/${currentGrid}/summary`),
fetch(`/grid/${currentGrid}/history`),
]);
if (summaryResponse.ok) {
const payload = await summaryResponse.json();
summaryEl.textContent = payload.summary;
}
if (historyResponse.ok) {
const payload = await historyResponse.json();
historyEl.textContent = JSON.stringify(payload.history, null, 2);
}
};
const initGrid = async (event) => {
event.preventDefault();
const formData = new FormData(gridForm);
const payload = {
width: Number(formData.get("width")),
height: Number(formData.get("height")),
rows: Number(formData.get("rows")),
columns: Number(formData.get("columns")),
screenshot_base64: formData.get("screenshot"),
};
const response = await fetch("/grid/init", {
method: "POST",
headers,
body: JSON.stringify(payload),
});
const descriptor = await response.json();
currentGrid = descriptor.grid_id;
updateDescriptor(descriptor);
await updateSummary();
subscribeToGrid(currentGrid);
planOutput.textContent = "Plan preview will appear here.";
log(`Grid ${currentGrid} initialized.`);
};
document.getElementById("plan-button").addEventListener("click", async () => {
if (!currentGrid) {
log("Initialize a grid first.");
return;
}
const response = await fetch(`/grid/${currentGrid}/plan`, {
method: "POST",
headers,
body: JSON.stringify({
preferred_label: preferredInput.value || null,
action: "click",
text: "ui-trigger",
}),
});
const result = await response.json();
lastPlan = result.plan;
planOutput.textContent = JSON.stringify(result, null, 2);
});
document.getElementById("run-action").addEventListener("click", async () => {
if (!lastPlan) {
log("Run the planner first.");
return;
}
const payload = {
grid_id: lastPlan.grid_id,
action: lastPlan.action,
target_cell: lastPlan.target_cell,
text: "from-ui",
comment: "UI action",
};
const response = await fetch("/grid/action", {
method: "POST",
headers,
body: JSON.stringify(payload),
});
const result = await response.json();
log(`Action ${result.detail} at ${result.coordinates}`);
await updateSummary();
});
document.getElementById("refresh-button").addEventListener("click", async () => {
if (!currentGrid) {
log("Start a grid first.");
return;
}
const payload = {
screenshot_base64: refreshScreenshot.value || "",
memo: refreshMemo.value || undefined,
};
const response = await fetch(`/grid/${currentGrid}/refresh`, {
method: "POST",
headers,
body: JSON.stringify(payload),
});
const data = await response.json();
log(`Refresh acknowledged: ${JSON.stringify(data)}`);
});
gridForm.addEventListener("submit", initGrid);

View File

@@ -1,85 +0,0 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Clickthrough Control</title>
<link rel="stylesheet" href="styles.css" />
</head>
<body>
<main>
<header>
<h1>Clickthrough Control Panel</h1>
<p>Most actions use HTTP; screenshots stream over WebSocket when refreshed.</p>
</header>
<section class="card">
<h2>Grid bootstrap</h2>
<form id="grid-form">
<label>
Width
<input type="number" name="width" value="640" min="1" required />
</label>
<label>
Height
<input type="number" name="height" value="480" min="1" required />
</label>
<label>
Rows
<input type="number" name="rows" value="3" min="1" required />
</label>
<label>
Columns
<input type="number" name="columns" value="3" min="1" required />
</label>
<label class="stretch">
Screenshot (base64)
<textarea name="screenshot" id="screenshot" rows="3">AA==</textarea>
</label>
<button type="submit">Init grid</button>
</form>
</section>
<section class="card" id="grid-details">
<h2>Grid status</h2>
<div id="grid-meta">No grid yet.</div>
<pre id="descriptor" class="monospace"></pre>
</section>
<section class="card">
<h2>Planner &amp; Actions</h2>
<label>
Preferred label
<input type="text" id="preferred-label" placeholder="button" />
</label>
<div class="button-row">
<button type="button" id="plan-button">Preview plan</button>
<button type="button" id="run-action">Run action</button>
</div>
<pre id="plan-output" class="monospace">Plan preview will appear here.</pre>
</section>
<section class="card">
<h2>Refresh Screenshot</h2>
<textarea id="refresh-screenshot" rows="3" placeholder="Paste new base64 screenshot"></textarea>
<label>
Memo
<input type="text" id="refresh-memo" placeholder="Describe the scene" />
</label>
<button type="button" id="refresh-button">Refresh grid</button>
<p class="note">Refresh triggers /stream/screenshots so the UI can redraw.</p>
</section>
<section class="card">
<h2>Summary &amp; history</h2>
<pre id="summary" class="monospace">No data yet.</pre>
<pre id="history" class="monospace">History will show here.</pre>
</section>
<section class="card">
<h2>Websocket log</h2>
<pre id="ws-log" class="monospace">Waiting for updates…</pre>
</section>
</main>
<script type="module" src="app.js"></script>
</body>
</html>

View File

@@ -1,108 +0,0 @@
* {
box-sizing: border-box;
}
body {
font-family: "Inter", "Segoe UI", system-ui, sans-serif;
margin: 0;
background: #121212;
color: #f5f5f5;
}
main {
max-width: 960px;
margin: 0 auto;
padding: 24px;
}
header {
text-align: center;
margin-bottom: 24px;
}
header h1 {
margin-bottom: 8px;
}
.card {
background: #1f1f1f;
padding: 16px;
border-radius: 16px;
margin-bottom: 16px;
box-shadow: 0 20px 45px rgba(0, 0, 0, 0.35);
}
label {
display: block;
margin-bottom: 12px;
}
label input,
label textarea {
width: 100%;
border-radius: 10px;
border: 1px solid #333;
background: #0f0f0f;
color: #f1f1f1;
padding: 8px 12px;
margin-top: 4px;
font-family: inherit;
}
textarea {
font-family: inherit;
}
button {
background: linear-gradient(135deg, #6d7cff, #3b82f6);
border: none;
padding: 10px 20px;
color: white;
border-radius: 999px;
font-weight: 600;
cursor: pointer;
transition: transform 0.15s ease;
}
button:hover {
transform: translateY(-1px);
}
.button-row {
display: flex;
gap: 12px;
flex-wrap: wrap;
margin-bottom: 12px;
}
.monospace {
background: #0c0c0c;
border-radius: 12px;
padding: 12px;
border: 1px solid #333;
min-height: 80px;
}
.note {
font-size: 0.9rem;
margin-top: 8px;
color: #b0b0b0;
}
@media (min-width: 768px) {
label {
display: flex;
gap: 12px;
align-items: center;
}
label input,
label textarea {
width: auto;
flex: 1;
}
.stretch textarea {
width: 100%;
}
}

View File

@@ -1,3 +0,0 @@
[pytest]
testpaths = tests
python_files = test_*.py

View File

@@ -1,2 +0,0 @@
pytest>=8.0.0
ruff>=0.0.1

View File

@@ -1,5 +0,0 @@
fastapi>=0.105.2
uvicorn[standard]>=0.23.2
pydantic>=2.8.2
pydantic-settings>=2.5.0
httpx>=0.28.1

View File

@@ -1,5 +0,0 @@
[tool.ruff]
line-length = 100
select = ["E", "F", "I", "S"]
target-version = "py311"
exclude = ["data", "__pycache__"]

View File

@@ -1 +0,0 @@
from .main import app # noqa: F401

View File

@@ -1,34 +0,0 @@
from __future__ import annotations
from typing import Tuple
from .models import ActionPayload, ActionResult
class ActionEngine:
def __init__(self, grid) -> None:
self.grid = grid
def plan(self, payload: ActionPayload) -> ActionResult:
coords = self._resolve_coords(payload.target_cell)
detail = self._describe(payload, coords)
return ActionResult(
success=True,
detail=detail,
coordinates=coords,
payload={"comment": payload.comment or "", "text": payload.text or ""},
)
def _resolve_coords(self, target_cell: str | None) -> Tuple[int, int] | None:
if not target_cell:
return None
return self.grid.resolve_cell_center(target_cell)
def _describe(
self, payload: ActionPayload, coords: Tuple[int, int] | None
) -> str:
cell_info = payload.target_cell or "free space"
location = f"@{cell_info}" if coords else "(no target)"
action_hint = payload.action.value
extra = f" text='{payload.text}'" if payload.text else ""
return f"Plan {action_hint} {location}{extra}"

View File

@@ -1,14 +0,0 @@
from pathlib import Path
from pydantic import ConfigDict
from pydantic_settings import BaseSettings
class ServerSettings(BaseSettings):
grid_rows: int = 4
grid_cols: int = 4
cell_margin_px: int = 4
storage_dir: Path = Path("data/screenshots")
default_timeout: int = 10
model_config = ConfigDict(env_prefix="CLICKTHROUGH_", env_file=".env")

View File

@@ -1,136 +0,0 @@
from __future__ import annotations
from dataclasses import dataclass
from typing import Any, Dict, List, Tuple
import uuid
from .actions import ActionEngine
from .config import ServerSettings
from .models import (
ActionPayload,
ActionResult,
GridCellModel,
GridDescriptor,
GridInitRequest,
)
@dataclass
class _StoredCell:
model: GridCellModel
center: Tuple[int, int]
class VisionGrid:
def __init__(self, request: GridInitRequest, grid_id: str, rows: int, columns: int):
self.grid_id = grid_id
self.screenshot = request.screenshot_base64
self.memo = request.memo
self.rows = rows
self.columns = columns
self.width = request.width
self.height = request.height
self.cells: Dict[str, _StoredCell] = {}
self._action_history: List[dict[str, Any]] = []
self._engine = ActionEngine(self)
self._build_cells()
def _build_cells(self, margin: int = 4) -> None:
cell_width = max(1, self.width // self.columns)
cell_height = max(1, self.height // self.rows)
for row in range(self.rows):
for col in range(self.columns):
left = col * cell_width + margin
top = row * cell_height + margin
right = min(self.width - margin, (col + 1) * cell_width - margin)
bottom = min(self.height - margin, (row + 1) * cell_height - margin)
cell_id = f"{self.grid_id}-{row}-{col}"
bounds = (left, top, right, bottom)
center = ((left + right) // 2, (top + bottom) // 2)
cell = GridCellModel(
cell_id=cell_id,
row=row,
column=col,
bounds=bounds,
)
self.cells[cell_id] = _StoredCell(model=cell, center=center)
def describe(self) -> GridDescriptor:
return GridDescriptor(
grid_id=self.grid_id,
rows=self.rows,
columns=self.columns,
cells=[cell.model for cell in self.cells.values()],
metadata=self.metadata,
)
@property
def metadata(self) -> Dict[str, Any]:
return {
"memo": self.memo or "",
"width": self.width,
"height": self.height,
}
def resolve_cell_center(self, cell_id: str) -> Tuple[int, int]:
cell = self.cells.get(cell_id)
if not cell:
raise KeyError(f"Unknown cell {cell_id}")
return cell.center
def preview_action(self, payload: ActionPayload) -> ActionResult:
return self._engine.plan(payload)
def apply_action(self, payload: ActionPayload) -> ActionResult:
result = self._engine.plan(payload)
self._action_history.append(result.model_dump())
return result
def update_screenshot(self, screenshot_base64: str, memo: str | None = None) -> None:
self.screenshot = screenshot_base64
if memo:
self.memo = memo
@property
def action_history(self) -> List[dict[str, Any]]:
return list(self._action_history)
def summary(self) -> str:
last_action = self._action_history[-1] if self._action_history else None
last_summary = (
f"Last action: {last_action.get('detail')}" if last_action else "No actions recorded yet"
)
return (
f"Grid {self.grid_id} ({self.rows}x{self.columns}) with {len(self.cells)} cells. {last_summary}."
)
class GridManager:
def __init__(self, settings: ServerSettings):
self.settings = settings
self._grids: Dict[str, VisionGrid] = {}
@property
def grid_count(self) -> int:
return len(self._grids)
def create_grid(self, request: GridInitRequest) -> VisionGrid:
rows = request.rows or self.settings.grid_rows
columns = request.columns or self.settings.grid_cols
grid_id = uuid.uuid4().hex
grid = VisionGrid(request, grid_id, rows, columns)
self._grids[grid_id] = grid
return grid
def get_grid(self, grid_id: str) -> VisionGrid:
try:
return self._grids[grid_id]
except KeyError as exc:
raise KeyError(f"Grid {grid_id} not found") from exc
def get_history(self, grid_id: str) -> List[dict[str, Any]]:
return self.get_grid(grid_id).action_history
def clear(self) -> None:
self._grids.clear()

View File

@@ -1,133 +0,0 @@
import time
from pathlib import Path
from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
from fastapi.responses import RedirectResponse
from fastapi.staticfiles import StaticFiles
from .config import ServerSettings
from .grid import GridManager
from .models import (
ActionPayload,
GridDescriptor,
GridInitRequest,
GridPlanRequest,
GridRefreshRequest,
)
from .planner import GridPlanner
from .streamer import ScreenshotStreamer
settings = ServerSettings()
manager = GridManager(settings)
planner = GridPlanner()
streamer = ScreenshotStreamer()
app = FastAPI(
title="Clickthrough",
description="Grid-aware surface that lets an agent plan clicks, drags, and typing on a fake screenshot",
version="0.3.0",
)
client_dir = Path(__file__).resolve().parent.parent / "client"
if client_dir.exists():
app.mount("/ui", StaticFiles(directory=str(client_dir), html=True), name="ui")
@app.get("/")
async def root():
if client_dir.exists():
return RedirectResponse("/ui/")
return {"status": "ok", "grid_count": manager.grid_count}
@app.get("/health")
def health_check() -> dict[str, str]:
return {"status": "ok", "grid_count": str(manager.grid_count)}
@app.post("/grid/init", response_model=GridDescriptor)
def init_grid(request: GridInitRequest) -> GridDescriptor:
grid = manager.create_grid(request)
return grid.describe()
@app.post("/grid/action")
def apply_action(payload: ActionPayload):
try:
grid = manager.get_grid(payload.grid_id)
except KeyError as exc:
raise HTTPException(status_code=404, detail=str(exc)) from exc
return grid.apply_action(payload)
@app.get("/grid/{grid_id}/summary")
def grid_summary(grid_id: str):
try:
grid = manager.get_grid(grid_id)
except KeyError as exc:
raise HTTPException(status_code=404, detail=str(exc)) from exc
descriptor = grid.describe()
return {
"grid_id": grid_id,
"summary": planner.describe(descriptor),
"details": grid.summary(),
"descriptor": descriptor,
}
@app.get("/grid/{grid_id}/history")
def grid_history(grid_id: str):
try:
history = manager.get_history(grid_id)
except KeyError as exc:
raise HTTPException(status_code=404, detail=str(exc)) from exc
return {"grid_id": grid_id, "history": history}
@app.post("/grid/{grid_id}/plan")
def plan_grid(grid_id: str, request: GridPlanRequest):
try:
grid = manager.get_grid(grid_id)
except KeyError as exc:
raise HTTPException(status_code=404, detail=str(exc)) from exc
descriptor = grid.describe()
payload = planner.build_payload(
descriptor,
action=request.action,
preferred_label=request.preferred_label,
text=request.text,
comment=request.comment,
)
result = grid.preview_action(payload)
return {"plan": payload.model_dump(), "result": result, "descriptor": descriptor}
@app.post("/grid/{grid_id}/refresh")
async def refresh_grid(grid_id: str, payload: GridRefreshRequest):
try:
grid = manager.get_grid(grid_id)
except KeyError as exc:
raise HTTPException(status_code=404, detail=str(exc)) from exc
grid.update_screenshot(payload.screenshot_base64, payload.memo)
descriptor = grid.describe()
await streamer.broadcast(
grid_id,
{
"grid_id": grid_id,
"timestamp": time.time(),
"descriptor": descriptor,
"screenshot_base64": payload.screenshot_base64,
},
)
return {"status": "updated", "grid_id": grid_id}
@app.websocket("/stream/screenshots")
async def stream_screenshots(websocket: WebSocket, grid_id: str | None = None):
key = await streamer.connect(websocket, grid_id)
try:
while True:
await websocket.receive_text()
except WebSocketDisconnect:
streamer.disconnect(websocket, key)

View File

@@ -1,67 +0,0 @@
from __future__ import annotations
from enum import Enum
from typing import Any, Dict, List, Optional, Tuple
from pydantic import BaseModel, Field
class ActionType(str, Enum):
CLICK = "click"
DOUBLE_CLICK = "double_click"
DRAG = "drag"
TYPE = "type"
SCROLL = "scroll"
class GridInitRequest(BaseModel):
width: int
height: int
screenshot_base64: str
rows: Optional[int] = None
columns: Optional[int] = None
memo: Optional[str] = None
class GridCellModel(BaseModel):
cell_id: str
row: int
column: int
bounds: Tuple[int, int, int, int]
label: Optional[str] = None
class GridDescriptor(BaseModel):
grid_id: str
rows: int
columns: int
cells: List[GridCellModel]
metadata: Dict[str, Any] = Field(default_factory=dict)
class ActionPayload(BaseModel):
grid_id: str
action: ActionType
target_cell: Optional[str] = None
text: Optional[str] = None
comment: Optional[str] = None
data: Dict[str, Any] = Field(default_factory=dict)
class ActionResult(BaseModel):
success: bool
detail: str
coordinates: Optional[Tuple[int, int]] = None
payload: Dict[str, Any] = Field(default_factory=dict)
class GridPlanRequest(BaseModel):
preferred_label: Optional[str] = None
action: ActionType = ActionType.CLICK
text: Optional[str] = None
comment: Optional[str] = None
class GridRefreshRequest(BaseModel):
screenshot_base64: str
memo: Optional[str] = None

View File

@@ -1,70 +0,0 @@
from __future__ import annotations
from math import hypot
from typing import Sequence
from .models import ActionPayload, ActionType, GridCellModel, GridDescriptor
class GridPlanner:
"""Helper that picks a grid cell using simple heuristics."""
def select_cell(
self, descriptor: GridDescriptor, preferred_label: str | None = None
) -> GridCellModel | None:
if not descriptor.cells:
return None
if preferred_label:
match = self._match_label(descriptor.cells, preferred_label)
if match:
return match
center_point = self._grid_center(descriptor)
return min(descriptor.cells, key=lambda cell: self._distance(self._cell_center(cell), center_point))
def build_payload(
self,
descriptor: GridDescriptor,
action: ActionType = ActionType.CLICK,
preferred_label: str | None = None,
text: str | None = None,
comment: str | None = None,
) -> ActionPayload:
target = self.select_cell(descriptor, preferred_label)
return ActionPayload(
grid_id=descriptor.grid_id,
action=action,
target_cell=target.cell_id if target else None,
text=text,
comment=comment,
)
def describe(self, descriptor: GridDescriptor) -> str:
cell_count = len(descriptor.cells)
return (
f"Grid {descriptor.grid_id} is {descriptor.rows}x{descriptor.columns} with {cell_count} cells."
)
def _grid_center(self, descriptor: GridDescriptor) -> tuple[float, float]:
width = descriptor.metadata.get("width", 0)
height = descriptor.metadata.get("height", 0)
return (width / 2, height / 2)
def _cell_center(self, cell: GridCellModel) -> tuple[float, float]:
left, top, right, bottom = cell.bounds
return ((left + right) / 2, (top + bottom) / 2)
def _distance(
self, first: tuple[float, float], second: tuple[float, float]
) -> float:
return hypot(first[0] - second[0], first[1] - second[1])
def _match_label(
self, cells: Sequence[GridCellModel], label: str
) -> GridCellModel | None:
lowered = label.lower()
for cell in cells:
if cell.label and lowered in cell.label.lower():
return cell
return None

View File

@@ -1,38 +0,0 @@
from __future__ import annotations
from collections import defaultdict
from typing import Any, DefaultDict, Dict, List
from fastapi import WebSocket
from websockets.exceptions import ConnectionClosedError
class ScreenshotStreamer:
"""Keeps websocket listeners and pushes screenshot updates."""
def __init__(self) -> None:
self._listeners: DefaultDict[str, List[WebSocket]] = defaultdict(list)
async def connect(self, websocket: WebSocket, grid_id: str | None = None) -> str:
await websocket.accept()
key = grid_id or "*"
self._listeners[key].append(websocket)
return key
def disconnect(self, websocket: WebSocket, grid_key: str | None = None) -> None:
key = grid_key or "*"
sockets = self._listeners.get(key)
if not sockets:
return
if websocket in sockets:
sockets.remove(websocket)
if not sockets:
self._listeners.pop(key, None)
async def broadcast(self, grid_id: str, payload: Dict[str, Any]) -> None:
listeners = list(self._listeners.get(grid_id, [])) + list(self._listeners.get("*", []))
for websocket in listeners:
try:
await websocket.send_json(payload)
except (ConnectionClosedError, RuntimeError):
self.disconnect(websocket, grid_id)

View File

@@ -1,11 +0,0 @@
"""Utility helpers for the Clickthrough agent skill."""
from .agent_runner import AgentRunResult, ClickthroughAgentRunner
from .clickthrough_skill import ActionPlan, ClickthroughSkill
__all__ = [
"ClickthroughSkill",
"ActionPlan",
"ClickthroughAgentRunner",
"AgentRunResult",
]

View File

@@ -1,60 +0,0 @@
from dataclasses import dataclass
from typing import Any, Dict
from .clickthrough_skill import ActionPlan, ClickthroughSkill
@dataclass
class AgentRunResult:
summary: Dict[str, Any]
action: Dict[str, Any]
history: Dict[str, Any]
grid: Dict[str, Any]
plan_preview: Dict[str, Any]
class ClickthroughAgentRunner:
def __init__(self, skill: ClickthroughSkill) -> None:
self.skill = skill
def run_once(
self,
screenshot_base64: str,
width: int,
height: int,
rows: int = 4,
columns: int = 4,
preferred_label: str | None = None,
action: str = "click",
text: str | None = None,
) -> AgentRunResult:
grid = self.skill.describe_grid(
screenshot_base64=screenshot_base64,
width=width,
height=height,
rows=rows,
columns=columns,
)
plan_response = self.skill.plan_with_planner(
grid_id=grid["grid_id"],
preferred_label=preferred_label,
action=action,
text=text,
)
plan_payload = plan_response["plan"]
plan = ActionPlan(
grid_id=plan_payload["grid_id"],
target_cell=plan_payload.get("target_cell"),
action=plan_payload["action"],
text=plan_payload.get("text"),
)
action_result = self.skill.plan_action(plan)
summary = self.skill.grid_summary(grid["grid_id"])
history = self.skill.grid_history(grid["grid_id"])
return AgentRunResult(
summary=summary,
action=action_result,
history=history,
grid=grid,
plan_preview=plan_response,
)

View File

@@ -1,98 +0,0 @@
from dataclasses import dataclass
from typing import Any, Dict
import httpx
@dataclass
class ActionPlan:
grid_id: str
target_cell: str | None
action: str
text: str | None = None
class ClickthroughSkill:
"""Lightweight wrapper around the Clickthrough HTTP API."""
def __init__(self, server_url: str = "http://localhost:8000") -> None:
self._client = httpx.Client(base_url=server_url, timeout=10)
def describe_grid(
self,
screenshot_base64: str,
width: int,
height: int,
rows: int = 4,
columns: int = 4,
) -> Dict[str, Any]:
payload = {
"width": width,
"height": height,
"rows": rows,
"columns": columns,
"screenshot_base64": screenshot_base64,
"memo": "agent-powered grid",
}
response = self._client.post("/grid/init", json=payload)
response.raise_for_status()
return response.json()
def plan_action(self, plan: ActionPlan) -> Dict[str, Any]:
payload = {
"grid_id": plan.grid_id,
"action": plan.action,
"target_cell": plan.target_cell,
"text": plan.text,
"comment": "skill-generated plan",
}
response = self._client.post("/grid/action", json=payload)
response.raise_for_status()
return response.json()
def grid_summary(self, grid_id: str) -> Dict[str, Any]:
response = self._client.get(f"/grid/{grid_id}/summary")
response.raise_for_status()
return response.json()
def grid_history(self, grid_id: str) -> Dict[str, Any]:
response = self._client.get(f"/grid/{grid_id}/history")
response.raise_for_status()
return response.json()
def plan_with_planner(
self,
grid_id: str,
preferred_label: str | None = None,
action: str = "click",
text: str | None = None,
comment: str | None = None,
) -> Dict[str, Any]:
payload = {
"preferred_label": preferred_label,
"action": action,
"text": text,
"comment": comment or "planner-generated",
}
response = self._client.post(f"/grid/{grid_id}/plan", json=payload)
response.raise_for_status()
return response.json()
def refresh_grid(self, grid_id: str, screenshot_base64: str, memo: str | None = None) -> Dict[str, Any]:
payload = {"screenshot_base64": screenshot_base64, "memo": memo}
response = self._client.post(f"/grid/{grid_id}/refresh", json=payload)
response.raise_for_status()
return response.json()
if __name__ == "__main__":
import base64
dummy = base64.b64encode(b"fake-screenshot").decode()
skill = ClickthroughSkill()
grid = skill.describe_grid(dummy, width=800, height=600)
print("Grid cells:", len(grid.get("cells", [])))
if grid.get("cells"):
first_cell = grid["cells"][0]["cell_id"]
result = skill.plan_action(ActionPlan(grid_id=grid["grid_id"], target_cell=first_cell, action="click"))
print("Action result:", result)

View File

@@ -1,29 +0,0 @@
import base64
import pytest
from server.main import manager
@pytest.fixture
def fake_screenshot() -> str:
"""Return a reproducible base64 string representing a dummy screenshot."""
return base64.b64encode(b"clickthrough-dummy").decode()
@pytest.fixture
def default_grid_request(fake_screenshot):
return {
"width": 640,
"height": 480,
"screenshot_base64": fake_screenshot,
"rows": 3,
"columns": 3,
}
@pytest.fixture(autouse=True)
def reset_manager_state():
manager._grids.clear()
yield
manager._grids.clear()

View File

@@ -1,79 +0,0 @@
from typing import Any, Dict
from skill.agent_runner import ClickthroughAgentRunner
from skill.clickthrough_skill import ActionPlan, ClickthroughSkill
class DummySkill(ClickthroughSkill):
def __init__(self):
self.last_plan: ActionPlan | None = None
def describe_grid(
self,
screenshot_base64: str,
width: int,
height: int,
rows: int = 4,
columns: int = 4,
) -> Dict[str, Any]:
return {
"grid_id": "dummy-grid",
"cells": [
{"cell_id": "dummy-grid-1", "label": "button", "bounds": [0, 0, 100, 100]},
{"cell_id": "dummy-grid-2", "label": "target", "bounds": [100, 0, 200, 100]},
],
}
def plan_with_planner(
self,
grid_id: str,
preferred_label: str | None = None,
action: str = "click",
text: str | None = None,
comment: str | None = None,
) -> Dict[str, Any]:
cells = ["dummy-grid-1", "dummy-grid-2"]
if preferred_label == "target":
target = "dummy-grid-2"
else:
target = cells[len(cells) // 2]
plan = {
"grid_id": grid_id,
"target_cell": target,
"action": action,
"text": text,
"comment": comment,
}
return {
"plan": plan,
"result": {"success": True, "detail": "preview"},
"descriptor": {"grid_id": grid_id},
}
def plan_action(self, plan: ActionPlan) -> Dict[str, Any]:
self.last_plan = plan
return {"success": True, "target_cell": plan.target_cell}
def grid_summary(self, grid_id: str) -> Dict[str, Any]:
return {"grid_id": grid_id, "summary": "ok"}
def grid_history(self, grid_id: str) -> Dict[str, Any]:
return {"grid_id": grid_id, "history": []}
def test_agent_runner_prefers_label():
runner = ClickthroughAgentRunner(DummySkill())
result = runner.run_once(
screenshot_base64="AA==",
width=120,
height=80,
preferred_label="target",
)
assert result.action["target_cell"] == "dummy-grid-2"
assert result.summary["summary"] == "ok"
def test_agent_runner_defaults_to_center():
runner = ClickthroughAgentRunner(DummySkill())
result = runner.run_once(screenshot_base64="AA==", width=120, height=80)
assert result.action["target_cell"] == "dummy-grid-2"

View File

@@ -1,32 +0,0 @@
from fastapi.testclient import TestClient
from server.main import app, manager
test_client = TestClient(app)
def test_plan_endpoint(default_grid_request):
init_response = test_client.post("/grid/init", json=default_grid_request)
grid_id = init_response.json()["grid_id"]
plan_response = test_client.post(
f"/grid/{grid_id}/plan",
json={"preferred_label": None, "action": "click", "text": "hello"},
)
assert plan_response.status_code == 200
payload = plan_response.json()
assert payload["plan"]["grid_id"] == grid_id
assert payload["result"]["success"]
def test_refresh_endpoint(default_grid_request):
init_response = test_client.post("/grid/init", json=default_grid_request)
grid_id = init_response.json()["grid_id"]
refresh_response = test_client.post(
f"/grid/{grid_id}/refresh", json={"screenshot_base64": "AAA", "memo": "updated"}
)
assert refresh_response.status_code == 200
grid = manager.get_grid(grid_id)
assert grid.screenshot == "AAA"
assert grid.memo == "updated"

View File

@@ -1,51 +0,0 @@
from server.config import ServerSettings
from server.grid import GridManager
from server.models import ActionPayload, ActionType, GridInitRequest
def test_grid_creation_respects_dimensions(default_grid_request):
settings = ServerSettings(grid_rows=2, grid_cols=2)
manager = GridManager(settings)
request = GridInitRequest(**default_grid_request)
grid = manager.create_grid(request)
descriptor = grid.describe()
assert descriptor.grid_id
assert descriptor.rows == 3
assert descriptor.columns == 3
assert len(descriptor.cells) == 9
assert descriptor.metadata.get("width") == 640
assert descriptor.metadata.get("height") == 480
def test_grid_action_records_history(default_grid_request):
manager = GridManager(ServerSettings())
request = GridInitRequest(**default_grid_request)
grid = manager.create_grid(request)
descriptor = grid.describe()
target_cell = descriptor.cells[0].cell_id
payload = ActionPayload(
grid_id=descriptor.grid_id,
action=ActionType.CLICK,
target_cell=target_cell,
comment="click test",
)
result = grid.apply_action(payload)
assert result.success
assert result.coordinates is not None
assert grid.action_history[-1]["coordinates"] == result.coordinates
def test_manager_get_grid_missing(default_grid_request):
manager = GridManager(ServerSettings())
request = GridInitRequest(**default_grid_request)
_ = manager.create_grid(request)
try:
manager.get_grid("does-not-exist")
found = True
except KeyError:
found = False
assert not found

View File

@@ -1,32 +0,0 @@
from server.config import ServerSettings
from server.grid import GridManager
from server.planner import GridPlanner
from server.models import ActionType, GridInitRequest
def test_planner_preferred_label(default_grid_request):
settings = ServerSettings()
manager = GridManager(settings)
request = GridInitRequest(**default_grid_request)
grid = manager.create_grid(request)
descriptor = grid.describe()
descriptor.cells[0].label = "target"
planner = GridPlanner()
payload = planner.build_payload(descriptor, preferred_label="target", action=ActionType.CLICK)
assert payload.target_cell == descriptor.cells[0].cell_id
def test_planner_falls_back_to_center(default_grid_request):
settings = ServerSettings()
manager = GridManager(settings)
request = GridInitRequest(**default_grid_request)
grid = manager.create_grid(request)
descriptor = grid.describe()
planner = GridPlanner()
payload = planner.build_payload(descriptor, action=ActionType.CLICK)
assert payload.target_cell is not None
assert payload.grid_id == descriptor.grid_id

View File

@@ -1,41 +0,0 @@
import asyncio
from server.streamer import ScreenshotStreamer
class DummyWebSocket:
def __init__(self):
self.sent = []
self.accepted = False
async def accept(self) -> None:
self.accepted = True
async def send_json(self, payload):
self.sent.append(payload)
def test_streamer_broadcasts_to_grid():
streamer = ScreenshotStreamer()
socket = DummyWebSocket()
async def scenario():
key = await streamer.connect(socket, "grid-123")
await streamer.broadcast("grid-123", {"frame": 1})
streamer.disconnect(socket, key)
asyncio.run(scenario())
assert socket.sent == [{"frame": 1}]
def test_streamer_wildcard_listener_receives_updates():
streamer = ScreenshotStreamer()
socket = DummyWebSocket()
async def scenario():
key = await streamer.connect(socket, None)
await streamer.broadcast("grid-456", {"frame": 2})
streamer.disconnect(socket, key)
asyncio.run(scenario())
assert socket.sent == [{"frame": 2}]

View File

@@ -1,12 +0,0 @@
from fastapi.testclient import TestClient
from server.main import app
test_client = TestClient(app)
def test_ui_root_serves_index():
response = test_client.get("/ui/")
assert response.status_code == 200
assert "Clickthrough Control" in response.text