Compare commits

..

5 Commits

Author SHA1 Message Date
Space-Banane
8126b57404 Add lightweight analytics dashboard
All checks were successful
CI / test (push) Successful in 7s
CI / test (pull_request) Successful in 7s
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-27 22:34:26 +02:00
Space-Banane
cceed18cf1 feat: (literally) "enhance" functionality with new parameters and improved image processing
All checks were successful
CI / test (push) Successful in 7s
2026-05-27 22:14:32 +02:00
Space-Banane
880468ef02 Mark completed P1 TODO items as done 2026-05-27 22:05:57 +02:00
Space-Banane
b05a7be668 Compact screenshot context every 4 steps by default 2026-05-27 22:04:15 +02:00
Space-Banane
0c019474af Default model reasoning effort to medium 2026-05-27 22:02:20 +02:00
15 changed files with 977 additions and 25 deletions

View File

@@ -156,13 +156,21 @@ Each job payload includes:
- Read-only dashboard (no run controls)
- Requires token input
- Live updates via `/ws`
- Analytics dashboards for success rate by objective category and daily averages
- Set `DISABLE_UI=true` to disable UI
### Analytics API
- `GET /api/analytics`
- Returns objective-category success rates plus average steps/cost over time
## Agent Instructions (Practical)
- Prefer `execute_command` for deterministic actions (opening URLs, filesystem checks).
- Use `see_screen` before UI interaction.
- Use `enhance` when text is unclear.
- Use `enhance` before clicking small/ambiguous targets; prefer `region="small"` for compact controls.
- Use `enhance` `mode="text"` for tiny labels/text, or `mode="ui"` for general UI.
- Optionally set `enhance` `scale` (2-6) for tighter zoom control.
- Use `press_key` for non-text keys (Enter, Tab, arrows, Escape).
- For shortcuts, use one `press_key` call with combo syntax (example: `win+r`).
- Use `click` offsets via `offset_up/down/left/right` and optional `sleep_after_seconds`.

View File

@@ -37,6 +37,14 @@ Keyboard combo rule:
- For shortcuts, use one `press_key` call with combo syntax, for example: `win+r`, `ctrl+shift+esc`.
- Do not split modifier combos into separate calls.
Enhance-first click rule:
- Before clicking small buttons/icons, dense UI, or ambiguous targets, call `enhance` first.
- Preferred preset for tiny controls: `enhance(coordinate, region="small", mode="ui")`.
- For tiny labels/text: use `mode="text"` to improve readability.
- Optional zoom control: set `scale` from `2` to `6` (defaults are tuned by region).
- After checking the enhanced image, click using the same target coordinate (or a small directional offset if needed).
Verification rule:
- Before `task_complete`, verify actual on-screen content matches the expected outcome.

View File

@@ -9,7 +9,7 @@ import traceback
from typing import Any, Callable
from openai import OpenAI
from PIL import Image, ImageEnhance, ImageFilter, ImageOps
from PIL import Image, ImageDraw, ImageEnhance, ImageFilter, ImageOps
from .models import AgentResult, RunArtifacts, RuntimeOptions, UsageSummary
from .pricing import estimate_cost_usd
@@ -34,7 +34,8 @@ Rules:
- launching apps or running terminal checks
3) For UI tasks, inspect with see_screen before clicking/typing.
4) Coordinates are absolute screen pixels (x, y) from top-left.
5) Use enhance(coordinate) when text/UI is unclear.
5) Use enhance before risky clicks: small buttons/icons, dense UI, or when target confidence is below high.
5a) For tiny controls use enhance(coordinate, region="small", mode="ui"). For tiny text use mode="text".
6) For keyboard-heavy interactions, prefer press_key for special keys.
6a) For key combinations, call press_key once with combo syntax (example: "win+r", "ctrl+shift+esc"). Do not split modifier combos across separate calls.
7) You may call multiple tools in one step. If needed, do click then sleep.
@@ -76,11 +77,14 @@ class ScreenJobAgent:
self.final_data: Any | None = None
self.previous_response_id: str | None = None
self.usage = UsageSummary()
self.objective = ""
self.last_screen_data_url: str | None = None
self.last_screen_meta: dict[str, Any] | None = None
self.click_history: list[tuple[int, int, float]] = []
self.disabled_tools = {tool.strip() for tool in (options.disable_tools or set()) if tool.strip()}
self.recent_tool_summaries: list[str] = []
self.last_context_compact_step = 0
def _emit(self, event_type: str, payload: dict[str, Any]) -> None:
if self.event_callback is None:
@@ -192,7 +196,10 @@ class ScreenJobAgent:
{
"type": "function",
"name": "enhance",
"description": "Create enhanced zoom around a coordinate for readability.",
"description": (
"Create enhanced zoom around a coordinate for readability and precise targeting. "
"Prefer this before clicking tiny or ambiguous UI targets."
),
"parameters": {
"type": "object",
"properties": {
@@ -204,7 +211,19 @@ class ScreenJobAgent:
},
"required": ["x", "y"],
"additionalProperties": False,
}
},
"region": {
"type": "string",
"enum": ["small", "medium", "large"],
},
"mode": {
"type": "string",
"enum": ["ui", "text"],
},
"scale": {
"type": ["integer", "string"],
"description": "Zoom factor from 2 to 6. Defaults by region.",
},
},
"required": ["coordinate"],
"additionalProperties": False,
@@ -352,6 +371,23 @@ class ScreenJobAgent:
sec = max_seconds
return sec
def _parse_int(self, value: Any, default: int = 0) -> int:
if value is None:
return default
if isinstance(value, bool):
return int(value)
if isinstance(value, int):
return value
if isinstance(value, float):
return int(round(value))
text = str(value).strip()
if not text:
return default
try:
return int(float(text))
except Exception: # noqa: BLE001
return default
def _tool_see_screen(self, _: dict[str, Any]) -> dict[str, Any]:
image, meta = self._capture_screen(with_grid=True)
out_path = self.artifacts.shots_dir / f"screen_step_{self.step:03d}.png"
@@ -369,34 +405,106 @@ class ScreenJobAgent:
def _tool_enhance(self, args: dict[str, Any]) -> dict[str, Any]:
coord = args.get("coordinate") or {}
x = int(coord.get("x", 0))
y = int(coord.get("y", 0))
requested_x = self._parse_int(coord.get("x", 0), default=0)
requested_y = self._parse_int(coord.get("y", 0), default=0)
region = str(args.get("region", "small") or "small").strip().lower()
mode = str(args.get("mode", "ui") or "ui").strip().lower()
if region not in {"small", "medium", "large"}:
region = "small"
if mode not in {"ui", "text"}:
mode = "ui"
region_half_by_preset = {
"small": 96,
"medium": 160,
"large": 240,
}
default_scale_by_region = {
"small": 4,
"medium": 3,
"large": 2,
}
raw_scale = self._parse_int(args.get("scale"), default=0)
scale = raw_scale if raw_scale > 0 else default_scale_by_region[region]
scale = clamp(scale, 2, 6)
base, base_meta = self._capture_screen(with_grid=False)
width, height = base.size
region_half = 180
left = clamp(x - region_half, 0, width - 1)
top = clamp(y - region_half, 0, height - 1)
right = clamp(x + region_half, left + 1, width)
bottom = clamp(y + region_half, top + 1, height)
source_x = clamp(requested_x, 0, max(0, width - 1))
source_y = clamp(requested_y, 0, max(0, height - 1))
region_half = region_half_by_preset[region]
left = clamp(source_x - region_half, 0, width - 1)
top = clamp(source_y - region_half, 0, height - 1)
right = clamp(source_x + region_half, left + 1, width)
bottom = clamp(source_y + region_half, top + 1, height)
crop = base.crop((left, top, right, bottom))
upscaled = crop.resize((crop.width * 2, crop.height * 2), Image.Resampling.BICUBIC)
enhanced = ImageOps.autocontrast(upscaled)
enhanced = ImageEnhance.Sharpness(enhanced).enhance(2.0)
enhanced = ImageEnhance.Contrast(enhanced).enhance(1.25)
enhanced = enhanced.filter(ImageFilter.UnsharpMask(radius=1.8, percent=180, threshold=2))
out_w = max(2, crop.width * scale)
out_h = max(2, crop.height * scale)
upscaled = crop.resize((out_w, out_h), Image.Resampling.LANCZOS)
out_path = self.artifacts.enhance_dir / f"enhance_step_{self.step:03d}_{x}_{y}.png"
if mode == "text":
text_view = ImageOps.grayscale(upscaled)
text_view = ImageOps.autocontrast(text_view, cutoff=1)
text_view = ImageOps.equalize(text_view)
text_view = ImageEnhance.Contrast(text_view).enhance(1.35)
text_view = ImageEnhance.Sharpness(text_view).enhance(2.1)
processed = text_view.filter(ImageFilter.UnsharpMask(radius=1.2, percent=160, threshold=1)).convert("RGB")
else:
ui_view = ImageOps.autocontrast(upscaled, cutoff=1)
ui_view = ImageEnhance.Contrast(ui_view).enhance(1.2)
ui_view = ImageEnhance.Sharpness(ui_view).enhance(1.8)
processed = ui_view.filter(ImageFilter.UnsharpMask(radius=1.4, percent=150, threshold=2)).convert("RGB")
edges = upscaled.convert("L").filter(ImageFilter.FIND_EDGES)
edges = ImageOps.autocontrast(edges, cutoff=4)
edge_overlay = ImageOps.colorize(edges, black=(0, 0, 0), white=(60, 220, 255))
enhanced = Image.blend(processed, edge_overlay, alpha=0.18)
cx = clamp((source_x - left) * scale, 0, max(0, enhanced.width - 1))
cy = clamp((source_y - top) * scale, 0, max(0, enhanced.height - 1))
draw = ImageDraw.Draw(enhanced)
draw.rectangle([0, 0, enhanced.width - 1, enhanced.height - 1], outline=(255, 80, 80), width=2)
ring_radius = max(10, int(6 * scale / 2))
arm_len = max(14, int(9 * scale / 2))
gap = max(4, int(2 * scale / 2))
line_width = max(2, int(scale / 2))
draw.ellipse(
[cx - ring_radius, cy - ring_radius, cx + ring_radius, cy + ring_radius],
outline=(255, 80, 80),
width=line_width,
)
draw.line([(max(0, cx - arm_len), cy), (max(0, cx - gap), cy)], fill=(255, 80, 80), width=line_width)
draw.line(
[(min(enhanced.width - 1, cx + gap), cy), (min(enhanced.width - 1, cx + arm_len), cy)],
fill=(255, 80, 80),
width=line_width,
)
draw.line([(cx, max(0, cy - arm_len)), (cx, max(0, cy - gap))], fill=(255, 80, 80), width=line_width)
draw.line(
[(cx, min(enhanced.height - 1, cy + gap)), (cx, min(enhanced.height - 1, cy + arm_len))],
fill=(255, 80, 80),
width=line_width,
)
out_path = self.artifacts.enhance_dir / (
f"enhance_step_{self.step:03d}_{source_x}_{source_y}_{region}_{mode}_x{scale}.png"
)
self._save_image(enhanced, out_path)
data_url = image_to_data_url(enhanced, "PNG")
meta = {
"captured_at": utc_now_iso(),
"source_coord": {"x": x, "y": y},
"requested_coord": {"x": requested_x, "y": requested_y},
"source_coord": {"x": source_x, "y": source_y},
"source_box": {"left": left, "top": top, "right": right, "bottom": bottom},
"scale": 2,
"region": region,
"mode": mode,
"scale": scale,
"path": str(out_path.resolve()),
"size": {"width": enhanced.width, "height": enhanced.height},
"target_pixel": {"x": cx, "y": cy},
"screen_size": {"width": width, "height": height},
"base_capture_meta": base_meta,
}
@@ -628,6 +736,9 @@ class ScreenJobAgent:
return {"_raw": raw}
def _call_model(self, input_items: list[dict[str, Any]]) -> Any:
effort = str(self.options.reasoning_effort or "medium").strip().lower()
if effort not in {"low", "medium", "high"}:
effort = "medium"
return self.client.responses.create(
model=self.options.model,
instructions=SYSTEM_PROMPT,
@@ -636,9 +747,85 @@ class ScreenJobAgent:
previous_response_id=self.previous_response_id,
parallel_tool_calls=True,
max_tool_calls=8,
reasoning={"effort": effort},
)
def _record_tool_summary(self, tool_name: str, result: dict[str, Any]) -> None:
ok = bool(result.get("ok"))
status = "ok" if ok else "fail"
summary = f"step={self.step} tool={tool_name} status={status}"
if tool_name == "click":
clicked = result.get("clicked") if isinstance(result.get("clicked"), dict) else {}
x = clicked.get("x")
y = clicked.get("y")
if isinstance(x, int) and isinstance(y, int):
summary = f"{summary} at=({x},{y})"
elif tool_name == "type":
typed_length = int(result.get("typed_length", 0) or 0)
summary = f"{summary} typed_length={typed_length}"
elif tool_name == "press_key":
key = str(result.get("key") or "").strip()
if key:
summary = f"{summary} key={key}"
elif tool_name == "execute_command":
exit_code = result.get("exit_code")
if exit_code is not None:
summary = f"{summary} exit_code={exit_code}"
elif tool_name in {"see_screen", "enhance"}:
meta = result.get("meta") if isinstance(result.get("meta"), dict) else {}
path = str(meta.get("path") or result.get("path") or "").strip()
if path:
summary = f"{summary} image={path}"
if not ok:
error_text = str(result.get("error") or "").strip()
if error_text:
summary = f"{summary} error={error_text[:140]}"
self.recent_tool_summaries.append(summary)
self.recent_tool_summaries = self.recent_tool_summaries[-20:]
def _should_compact_context(self) -> bool:
interval = max(0, int(self.options.screen_context_decay_steps or 0))
if interval <= 0:
return False
if self.previous_response_id is None:
return False
return (self.step - self.last_context_compact_step) >= interval
def _build_compacted_pending_input(self) -> list[dict[str, Any]]:
recent = self.recent_tool_summaries[-8:]
lines = "\n".join(f"- {line}" for line in recent) if recent else "- No recent tool activity."
content = (
"Context compaction activated to decay stale screenshots and reduce token usage.\n"
f"JOB: {self.objective}\n"
f"Current step: {self.step}\n"
"Recent tool activity:\n"
f"{lines}\n"
"Continue execution from the latest screen state. "
"Use tools only, and finish with task_complete when done."
)
compacted_input: list[dict[str, Any]] = [
{
"role": "user",
"content": [
{
"type": "input_text",
"text": content,
}
],
}
]
if self.last_screen_data_url and self.last_screen_meta:
compacted_input.append(
self._build_visual_message(
"Current screen after context compaction",
self.last_screen_data_url,
self.last_screen_meta,
)
)
return compacted_input
def run(self, job: str) -> AgentResult:
self.objective = job
started_at = time.time()
self.logger.info("Starting run_id=%s model=%s", self.artifacts.run_id, self.options.model)
self.logger.info("Job: %s", job)
@@ -648,6 +835,8 @@ class ScreenJobAgent:
{
"run_id": self.artifacts.run_id,
"model": self.options.model,
"reasoning_effort": self.options.reasoning_effort,
"screen_context_decay_steps": self.options.screen_context_decay_steps,
"objective": job,
"disabled_tools": sorted(self.disabled_tools),
},
@@ -664,6 +853,8 @@ class ScreenJobAgent:
f"JOB: {job}\n"
"You are in an action loop. Prefer execute_command for deterministic actions. "
"For modifier shortcuts, use a single press_key combo (example: win+r). "
"Before clicking tiny buttons/icons or dense UI areas, call enhance first "
"(use region='small'; use mode='text' for tiny text labels). "
"You can return multiple tool calls in one step (example: click then sleep). "
"When done call task_complete(return=..., data=...). "
"Before task_complete, verify the screen content is what was expected "
@@ -692,6 +883,19 @@ class ScreenJobAgent:
self.step += 1
self.logger.info("---- Agent step %d/%d ----", self.step, self.options.max_steps)
self._emit("step_started", {"step": self.step, "max_steps": self.options.max_steps})
if self._should_compact_context():
self.previous_response_id = None
pending_input = self._build_compacted_pending_input()
self.last_context_compact_step = self.step
self.logger.info("Compacted model context at step %d.", self.step)
self._emit(
"context_compacted",
{
"step": self.step,
"decay_steps": self.options.screen_context_decay_steps,
"recent_tool_summaries": self.recent_tool_summaries[-8:],
},
)
try:
response = self._call_model(pending_input)
self._register_usage(response)
@@ -720,6 +924,8 @@ class ScreenJobAgent:
"text": (
"No function call was returned. Continue by using tools. "
"Use one press_key call for key combos like win+r. "
"Prefer enhance before clicking small/unclear targets "
"(region='small', mode='ui' or 'text'). "
"You may call multiple tools in one step. "
"Before task_complete, verify expected screen content with see_screen/enhance "
"and include observed_result in data. "
@@ -763,6 +969,7 @@ class ScreenJobAgent:
name,
json.dumps(result, ensure_ascii=False)[:2500],
)
self._record_tool_summary(name, result)
self._emit("tool_result", {"step": self.step, "tool": name, "result": result})
next_input.append(
{

View File

@@ -28,6 +28,18 @@ def build_parser() -> argparse.ArgumentParser:
parser.add_argument("--command-timeout", type=int, default=45, help="Timeout in seconds for execute_command.")
parser.add_argument("--type-interval", type=float, default=0.02, help="Seconds between typed characters.")
parser.add_argument("--click-pause", type=float, default=0.10, help="Mouse move duration before click.")
parser.add_argument(
"--reasoning-effort",
choices=["low", "medium", "high"],
default="medium",
help="Reasoning effort passed to the model.",
)
parser.add_argument(
"--screen-context-decay-steps",
type=int,
default=4,
help="Compact model context every N steps to decay old screenshots (0 disables).",
)
parser.add_argument("--disable-tool", action="append", default=[], help="Disable a tool by name.")
parser.add_argument("--skip-safety-check", action="store_true", help="Bypass pre-flight safety check.")
parser.add_argument("--no-failsafe", action="store_true", help="Disable PyAutoGUI fail-safe.")
@@ -78,6 +90,8 @@ def main(argv: list[str] | None = None) -> int:
command_timeout=args.command_timeout,
type_interval=args.type_interval,
click_pause=args.click_pause,
reasoning_effort=args.reasoning_effort,
screen_context_decay_steps=max(0, int(args.screen_context_decay_steps)),
disable_tools=set(disabled_tools),
)
try:

View File

@@ -58,4 +58,6 @@ class RuntimeOptions:
command_timeout: int = 45
type_interval: float = 0.02
click_pause: float = 0.10
reasoning_effort: str = "medium"
screen_context_decay_steps: int = 4
disable_tools: set[str] | None = None

View File

@@ -16,6 +16,7 @@ from .config import AppConfig, load_app_config
from .storage import HistoryDB
from .task_manager import JobManager
from .ui import monitoring_js_path, monitoring_page_html
from .utils import utc_now_iso
class CreateJobRequest(BaseModel):
@@ -25,6 +26,8 @@ class CreateJobRequest(BaseModel):
command_timeout: int = Field(45, ge=1, le=600)
type_interval: float = Field(0.02, ge=0.0, le=1.0)
click_pause: float = Field(0.10, ge=0.0, le=2.0)
reasoning_effort: str = Field("medium", pattern="^(low|medium|high)$")
screen_context_decay_steps: int = Field(4, ge=0, le=50)
disabled_tools: list[str] = Field(default_factory=list)
safety_override: bool = False
no_failsafe: bool = False
@@ -301,6 +304,8 @@ def create_app(config: AppConfig | None = None) -> FastAPI:
command_timeout=payload.command_timeout,
type_interval=payload.type_interval,
click_pause=payload.click_pause,
reasoning_effort=payload.reasoning_effort,
screen_context_decay_steps=payload.screen_context_decay_steps,
disabled_tools=payload.disabled_tools,
safety_override=payload.safety_override,
no_failsafe=payload.no_failsafe,
@@ -382,6 +387,12 @@ def create_app(config: AppConfig | None = None) -> FastAPI:
def stats(_: None = Depends(require_token)) -> dict[str, Any]:
return manager.stats()
@app.get("/api/analytics")
def analytics(_: None = Depends(require_token)) -> dict[str, Any]:
payload = manager.analytics()
payload["generated_at"] = utc_now_iso()
return payload
if not app_config.disable_ui:
@app.get("/", response_class=HTMLResponse)
def ui_root() -> str:

View File

@@ -7,6 +7,39 @@ from pathlib import Path
from typing import Any
_TERMINAL_STATUSES = {"completed", "failed", "cancelled"}
_CATEGORY_RULES: tuple[tuple[str, tuple[str, ...]], ...] = (
(
"Browser / web",
("browser", "website", "webpage", "chrome", "url", "amazon", "google", "login", "shopping", "checkout", "orders"),
),
(
"Files / terminal",
("file", "folder", "directory", "terminal", "shell", "command", "cli", "script", "git", "repo", "install", "pip", "npm", "powershell", "bash"),
),
(
"Writing / docs",
("write", "summary", "summarize", "document", "docs", "report", "email", "message", "readme", "markdown", "note", "proposal"),
),
(
"Data / analysis",
("data", "analysis", "analyze", "csv", "spreadsheet", "sheet", "table", "chart", "dashboard", "metric", "metrics", "sql"),
),
(
"Development / ops",
("code", "bug", "fix", "test", "debug", "api", "backend", "frontend", "database", "deploy", "docker", "service", "build"),
),
)
def _objective_category(objective: str) -> str:
text = objective.lower()
for category, keywords in _CATEGORY_RULES:
if any(keyword in text for keyword in keywords):
return category
return "Other"
class HistoryDB:
def __init__(self, db_path: Path) -> None:
self.db_path = db_path
@@ -184,6 +217,131 @@ class HistoryDB:
).fetchone()
return dict(totals) if totals else {}
def analytics(self) -> dict[str, Any]:
with self._connect() as conn:
rows = conn.execute(
"""
SELECT job_id, objective, status, steps, estimated_cost_usd, created_at
FROM jobs
ORDER BY created_at ASC, job_id ASC
"""
).fetchall()
total_jobs = 0
finished_jobs = 0
completed_jobs = 0
failed_jobs = 0
cancelled_jobs = 0
steps_sum = 0
steps_count = 0
cost_sum = 0.0
cost_count = 0
by_category: dict[str, dict[str, Any]] = {}
by_day: dict[str, dict[str, Any]] = {}
def _bucket(target: dict[str, dict[str, Any]], key: str) -> dict[str, Any]:
bucket = target.setdefault(
key,
{
"label": key,
"total_jobs": 0,
"finished_jobs": 0,
"completed_jobs": 0,
"failed_jobs": 0,
"cancelled_jobs": 0,
"steps_sum": 0,
"steps_count": 0,
"cost_sum": 0.0,
"cost_count": 0,
},
)
return bucket
for row in rows:
total_jobs += 1
status = str(row["status"] or "")
finished = status in _TERMINAL_STATUSES
completed = status == "completed"
objective = str(row["objective"] or "")
category = _objective_category(objective)
created_at = str(row["created_at"] or "")
day = created_at[:10] if len(created_at) >= 10 else created_at or "unknown"
category_bucket = _bucket(by_category, category)
day_bucket = _bucket(by_day, day)
for bucket in (category_bucket, day_bucket):
bucket["total_jobs"] += 1
if not finished:
continue
finished_jobs += 1
if completed:
completed_jobs += 1
elif status == "failed":
failed_jobs += 1
elif status == "cancelled":
cancelled_jobs += 1
steps = row["steps"]
if steps is not None:
step_value = int(steps)
steps_sum += step_value
steps_count += 1
for bucket in (category_bucket, day_bucket):
bucket["steps_sum"] += step_value
bucket["steps_count"] += 1
estimated_cost = row["estimated_cost_usd"]
if estimated_cost is not None:
cost_value = float(estimated_cost)
cost_sum += cost_value
cost_count += 1
for bucket in (category_bucket, day_bucket):
bucket["cost_sum"] += cost_value
bucket["cost_count"] += 1
for bucket in (category_bucket, day_bucket):
bucket["finished_jobs"] += 1
if completed:
bucket["completed_jobs"] += 1
elif status == "failed":
bucket["failed_jobs"] += 1
elif status == "cancelled":
bucket["cancelled_jobs"] += 1
def _finalize(bucket: dict[str, Any]) -> dict[str, Any]:
finished = bucket["finished_jobs"]
return {
"label": bucket["label"],
"total_jobs": bucket["total_jobs"],
"finished_jobs": finished,
"completed_jobs": bucket["completed_jobs"],
"failed_jobs": bucket["failed_jobs"],
"cancelled_jobs": bucket["cancelled_jobs"],
"success_rate": round((bucket["completed_jobs"] / finished) * 100, 2) if finished else 0.0,
"avg_steps": round(bucket["steps_sum"] / bucket["steps_count"], 2) if bucket["steps_count"] else None,
"avg_cost_usd": round(bucket["cost_sum"] / bucket["cost_count"], 6) if bucket["cost_count"] else None,
}
category_rows = [_finalize(bucket) for bucket in by_category.values()]
category_rows.sort(key=lambda item: (-item["success_rate"], item["label"]))
day_rows = [_finalize(bucket) for bucket in by_day.values()]
day_rows.sort(key=lambda item: item["label"])
return {
"total_jobs": total_jobs,
"finished_jobs": finished_jobs,
"completed_jobs": completed_jobs,
"failed_jobs": failed_jobs,
"cancelled_jobs": cancelled_jobs,
"success_rate": round((completed_jobs / finished_jobs) * 100, 2) if finished_jobs else 0.0,
"avg_steps": round(steps_sum / steps_count, 2) if steps_count else None,
"avg_cost_usd": round(cost_sum / cost_count, 6) if cost_count else None,
"by_category": category_rows,
"timeline": day_rows,
}
def _row_to_job(self, row: sqlite3.Row) -> dict[str, Any]:
disabled_tools: list[str] = []
try:

View File

@@ -48,6 +48,8 @@ class JobManager:
command_timeout: int = 45,
type_interval: float = 0.02,
click_pause: float = 0.10,
reasoning_effort: str = "medium",
screen_context_decay_steps: int = 4,
disabled_tools: list[str] | None = None,
safety_override: bool = False,
no_failsafe: bool = False,
@@ -93,6 +95,8 @@ class JobManager:
"command_timeout": command_timeout,
"type_interval": type_interval,
"click_pause": click_pause,
"reasoning_effort": reasoning_effort,
"screen_context_decay_steps": screen_context_decay_steps,
"no_failsafe": no_failsafe,
"cancel_event": cancel_event,
},
@@ -121,6 +125,8 @@ class JobManager:
command_timeout: int,
type_interval: float,
click_pause: float,
reasoning_effort: str,
screen_context_decay_steps: int,
no_failsafe: bool,
cancel_event: threading.Event,
) -> None:
@@ -218,6 +224,8 @@ class JobManager:
command_timeout=command_timeout,
type_interval=type_interval,
click_pause=click_pause,
reasoning_effort=reasoning_effort,
screen_context_decay_steps=max(0, int(screen_context_decay_steps)),
disable_tools=set(disabled_tools),
)
try:
@@ -343,6 +351,9 @@ class JobManager:
stats["live_running_threads"] = sum(1 for job in self._running.values() if job.thread.is_alive())
return stats
def analytics(self) -> dict[str, Any]:
return self.db.analytics()
def _normalize_job_payload(self, job: dict[str, Any]) -> dict[str, Any]:
response = job.get("response")
if not isinstance(response, dict):

View File

@@ -21,6 +21,30 @@
<section class="grid grid-cols-2 md:grid-cols-6 gap-3" id="stats"></section>
<section class="space-y-3">
<div class="flex items-center justify-between gap-3">
<h2 class="font-semibold">Analytics</h2>
<div id="analyticsMeta" class="text-[11px] text-slate-400"></div>
</div>
<div id="analyticsSummary" class="grid grid-cols-2 md:grid-cols-4 gap-3"></div>
<div class="grid grid-cols-1 xl:grid-cols-2 gap-4">
<div class="bg-slate-900/70 border border-slate-800 rounded-xl p-4 space-y-3">
<div class="flex items-center justify-between gap-3">
<h3 class="font-semibold text-sm">Success by Objective Category</h3>
<div id="analyticsCategorySummary" class="text-[11px] text-slate-400"></div>
</div>
<div id="analyticsCategories" class="space-y-3"></div>
</div>
<div class="bg-slate-900/70 border border-slate-800 rounded-xl p-4 space-y-3">
<div class="flex items-center justify-between gap-3">
<h3 class="font-semibold text-sm">Avg Steps / Cost Over Time</h3>
<div id="analyticsTrendSummary" class="text-[11px] text-slate-400"></div>
</div>
<div id="analyticsTrends" class="space-y-4"></div>
</div>
</div>
</section>
<section class="grid grid-cols-1 lg:grid-cols-5 gap-4">
<div class="lg:col-span-2 bg-slate-900/70 border border-slate-800 rounded-xl p-4">
<div class="flex items-center justify-between mb-3">

View File

@@ -17,6 +17,12 @@ const replayPrevBtn = document.getElementById("replayPrevBtn");
const replayNextBtn = document.getElementById("replayNextBtn");
const replaySpeedEl = document.getElementById("replaySpeed");
const replaySeekEl = document.getElementById("replaySeek");
const analyticsMetaEl = document.getElementById("analyticsMeta");
const analyticsSummaryEl = document.getElementById("analyticsSummary");
const analyticsCategorySummaryEl = document.getElementById("analyticsCategorySummary");
const analyticsCategoriesEl = document.getElementById("analyticsCategories");
const analyticsTrendSummaryEl = document.getElementById("analyticsTrendSummary");
const analyticsTrendsEl = document.getElementById("analyticsTrends");
const state = {
token: localStorage.getItem("screenjob_token") || "",
@@ -35,6 +41,7 @@ const state = {
}
};
const manuallyClosedSockets = new WeakSet();
const analyticsRefreshEvents = new Set(["job_finished", "job_failed", "job_rejected"]);
tokenInput.value = state.token;
function authHeaders() {
@@ -66,6 +73,197 @@ function renderStats(stats) {
`).join("");
}
function escapeHtml(value) {
return String(value ?? "").replace(/[&<>"']/g, (ch) => ({
"&": "&amp;",
"<": "&lt;",
">": "&gt;",
'"': "&quot;",
"'": "&#39;"
})[ch]);
}
function formatNumber(value, digits = 2) {
const num = Number(value);
return Number.isFinite(num) ? num.toFixed(digits) : "—";
}
function formatCurrency(value, digits = 6) {
const num = Number(value);
return Number.isFinite(num) ? `$${num.toFixed(digits)}` : "—";
}
function formatPercent(value) {
const num = Number(value);
return Number.isFinite(num) ? `${num.toFixed(1)}%` : "—";
}
function formatDateLabel(value) {
const dt = new Date(value);
if (Number.isNaN(dt.getTime())) return String(value || "—");
return dt.toLocaleDateString(undefined, { month: "short", day: "numeric" });
}
function renderMetricCard(label, value) {
return `
<div class="bg-slate-950 border border-slate-800 rounded-xl p-3">
<div class="text-[11px] uppercase tracking-wide text-slate-400">${escapeHtml(label)}</div>
<div class="text-xl font-semibold mt-1">${escapeHtml(value)}</div>
</div>
`;
}
function renderLineChart(title, points, options = {}) {
const color = options.color || "#22d3ee";
const valueLabel = options.valueLabel || "";
const sourcePoints = Array.isArray(points)
? points.filter((point) => Number.isFinite(Number(point.value)))
: [];
if (!sourcePoints.length) {
return `
<div class="rounded-lg border border-slate-800 bg-slate-950/70 p-3">
<div class="flex items-center justify-between gap-3">
<div>
<div class="text-xs text-slate-400">${escapeHtml(title)}</div>
<div class="text-sm text-slate-200 font-semibold">No data yet</div>
</div>
</div>
</div>
`;
}
const width = 640;
const height = 220;
const margin = { top: 20, right: 18, bottom: 34, left: 44 };
const values = sourcePoints.map((point) => Number(point.value));
const minValue = Math.min(...values);
const maxValue = Math.max(...values);
const span = maxValue - minValue || 1;
const chartWidth = width - margin.left - margin.right;
const chartHeight = height - margin.top - margin.bottom;
const xStep = sourcePoints.length > 1 ? chartWidth / (sourcePoints.length - 1) : 0;
const coords = sourcePoints.map((point, index) => ({
x: margin.left + (index * xStep),
y: margin.top + ((maxValue - Number(point.value)) / span) * chartHeight,
}));
const linePath = coords.map((point, index) => `${index === 0 ? "M" : "L"} ${point.x} ${point.y}`).join(" ");
const baseline = height - margin.bottom;
const midIndex = Math.floor(sourcePoints.length / 2);
const xLabels = [
{ index: 0, label: sourcePoints[0].label },
{ index: midIndex, label: sourcePoints[midIndex].label },
{ index: sourcePoints.length - 1, label: sourcePoints[sourcePoints.length - 1].label },
].filter((item, index, array) => item.label && array.findIndex((candidate) => candidate.index === item.index) === index);
const minLabel = options.formatValue ? options.formatValue(minValue) : formatNumber(minValue, 2);
const maxLabel = options.formatValue ? options.formatValue(maxValue) : formatNumber(maxValue, 2);
const latest = sourcePoints[sourcePoints.length - 1];
const latestValue = options.formatValue ? options.formatValue(latest.value) : formatNumber(latest.value, 2);
return `
<div class="rounded-lg border border-slate-800 bg-slate-950/70 p-3 space-y-2">
<div class="flex items-center justify-between gap-3">
<div>
<div class="text-xs text-slate-400">${escapeHtml(title)}</div>
<div class="text-sm text-slate-200 font-semibold">${escapeHtml(latestValue)}${valueLabel ? ` <span class="text-slate-500 font-normal">${escapeHtml(valueLabel)}</span>` : ""}</div>
</div>
<div class="text-[11px] text-slate-400 text-right">
<div>${escapeHtml(sourcePoints.length)} points</div>
<div>${escapeHtml(minLabel)} - ${escapeHtml(maxLabel)}</div>
</div>
</div>
<svg viewBox="0 0 ${width} ${height}" class="w-full h-56">
${Array.from({ length: 4 }, (_, idx) => {
const y = margin.top + (chartHeight / 3) * idx;
return `<line x1="${margin.left}" y1="${y}" x2="${width - margin.right}" y2="${y}" stroke="rgba(51, 65, 85, 0.7)" stroke-width="1" />`;
}).join("")}
<line x1="${margin.left}" y1="${baseline}" x2="${width - margin.right}" y2="${baseline}" stroke="rgba(71, 85, 105, 0.8)" stroke-width="1.5" />
<path d="${linePath}" fill="none" stroke="${color}" stroke-width="3" stroke-linecap="round" stroke-linejoin="round" />
${coords.map((point) => `
<circle cx="${point.x}" cy="${point.y}" r="4.5" fill="${color}" />
`).join("")}
<text x="${margin.left - 8}" y="${margin.top + 4}" text-anchor="end" class="fill-slate-400 text-[10px]">${escapeHtml(maxLabel)}</text>
<text x="${margin.left - 8}" y="${baseline}" text-anchor="end" class="fill-slate-400 text-[10px]">${escapeHtml(minLabel)}</text>
${xLabels.map((item) => `
<text x="${coords[item.index].x}" y="${height - 10}" text-anchor="middle" class="fill-slate-500 text-[10px]">${escapeHtml(formatDateLabel(item.label))}</text>
`).join("")}
</svg>
</div>
`;
}
function renderAnalytics(payload) {
const analytics = payload || {};
const categories = Array.isArray(analytics.by_category) ? analytics.by_category : [];
const timeline = Array.isArray(analytics.timeline) ? analytics.timeline : [];
const finishedCategories = categories.filter((row) => Number(row.finished_jobs || 0) > 0);
if (analyticsMetaEl) {
analyticsMetaEl.textContent = analytics.generated_at
? `Updated ${new Date(analytics.generated_at).toLocaleString()}`
: "Historical snapshot";
}
analyticsSummaryEl.innerHTML = [
renderMetricCard("Finished Jobs", analytics.finished_jobs || 0),
renderMetricCard("Success Rate", formatPercent(analytics.success_rate)),
renderMetricCard("Avg Steps", formatNumber(analytics.avg_steps, 1)),
renderMetricCard("Avg Cost", formatCurrency(analytics.avg_cost_usd)),
].join("");
analyticsCategorySummaryEl.textContent = finishedCategories.length
? `${finishedCategories.length} categories`
: "No finished jobs yet";
if (finishedCategories.length) {
analyticsCategoriesEl.innerHTML = finishedCategories.map((row) => {
const successRate = Number(row.success_rate || 0);
const completed = Number(row.completed_jobs || 0);
const finished = Number(row.finished_jobs || 0);
const total = Number(row.total_jobs || 0);
const avgSteps = row.avg_steps == null ? "—" : formatNumber(row.avg_steps, 1);
const avgCost = row.avg_cost_usd == null ? "—" : formatCurrency(row.avg_cost_usd);
return `
<div class="rounded-lg border border-slate-800 bg-slate-950/70 p-3 space-y-2">
<div class="flex items-start justify-between gap-3">
<div>
<div class="font-medium">${escapeHtml(row.label || "Other")}</div>
<div class="text-[11px] text-slate-400">${finished} finished · ${completed} completed · ${total} total</div>
</div>
<div class="text-right">
<div class="text-base font-semibold">${formatPercent(successRate)}</div>
<div class="text-[11px] text-slate-500">success rate</div>
</div>
</div>
<div class="h-2 rounded bg-slate-800 overflow-hidden">
<div class="h-full rounded bg-cyan-400" style="width: ${Math.max(0, Math.min(successRate, 100))}%"></div>
</div>
<div class="grid grid-cols-2 gap-2 text-[11px] text-slate-300">
<div>Avg steps: ${escapeHtml(avgSteps)}</div>
<div>Avg cost: ${escapeHtml(avgCost)}</div>
</div>
</div>
`;
}).join("");
} else {
analyticsCategoriesEl.innerHTML = `
<div class="rounded-lg border border-dashed border-slate-800 bg-slate-950/70 p-4 text-sm text-slate-400">
No finished jobs yet.
</div>
`;
}
analyticsTrendSummaryEl.textContent = timeline.length ? `${timeline.length} days` : "No daily data yet";
analyticsTrendsEl.innerHTML = [
renderLineChart("Average steps per day", timeline.map((row) => ({ label: row.label, value: row.avg_steps })), { color: "#38bdf8" }),
renderLineChart("Average cost per day", timeline.map((row) => ({ label: row.label, value: row.avg_cost_usd })), {
color: "#34d399",
valueLabel: "USD",
formatValue: (value) => formatCurrency(value),
}),
].join("");
}
function renderJobs() {
jobListEl.innerHTML = state.jobs.map((job) => {
const active = job.job_id === state.selectedJobId;
@@ -310,6 +508,11 @@ async function refreshStats() {
renderStats(payload);
}
async function refreshAnalytics() {
const payload = await api("/api/analytics");
renderAnalytics(payload);
}
async function refreshJobDetail() {
if (!state.selectedJobId) return;
const [job, events, replay] = await Promise.all([
@@ -345,6 +548,9 @@ function connectWs() {
}
await refreshJobs();
await refreshStats();
if (analyticsRefreshEvents.has(payload.event_type)) {
await refreshAnalytics();
}
} catch (err) {
console.error(err);
}
@@ -362,6 +568,7 @@ function connectWs() {
async function fullRefresh() {
await refreshJobs();
await refreshStats();
await refreshAnalytics();
await refreshJobDetail();
}

View File

@@ -91,6 +91,41 @@ def test_click_supports_directional_offsets(tmp_path: Path, monkeypatch) -> None
assert click_result["clicked"] == {"x": 110, "y": 102}
def test_enhance_defaults_to_small_ui_preset(tmp_path: Path, monkeypatch) -> None:
agent = _build_agent(tmp_path, monkeypatch)
result = agent._tool_enhance({"coordinate": {"x": 100, "y": 120}})
assert result["ok"] is True
meta = result["meta"]
assert meta["region"] == "small"
assert meta["mode"] == "ui"
assert meta["scale"] == 4
assert Path(meta["path"]).exists()
assert meta["target_pixel"]["x"] >= 0
assert meta["target_pixel"]["y"] >= 0
def test_enhance_supports_text_mode_and_scale_clamp(tmp_path: Path, monkeypatch) -> None:
agent = _build_agent(tmp_path, monkeypatch)
result = agent._tool_enhance(
{
"coordinate": {"x": -99, "y": 9999},
"region": "medium",
"mode": "text",
"scale": 99,
}
)
assert result["ok"] is True
meta = result["meta"]
assert meta["region"] == "medium"
assert meta["mode"] == "text"
assert meta["scale"] == 6
assert meta["requested_coord"] == {"x": -99, "y": 9999}
assert meta["source_coord"] == {"x": 0, "y": 719}
assert Path(meta["path"]).exists()
def test_press_key_supports_hotkey_combo(tmp_path: Path, monkeypatch) -> None:
agent = _build_agent(tmp_path, monkeypatch)
result = agent._tool_press_key({"key": "meta+r"})
@@ -98,3 +133,21 @@ def test_press_key_supports_hotkey_combo(tmp_path: Path, monkeypatch) -> None:
assert result["key"] == "win+r"
assert result["message"] == "Key combo executed."
assert agent_module.pyautogui.last_hotkey == ("win", "r")
def test_context_compaction_trigger_and_payload(tmp_path: Path, monkeypatch) -> None:
agent = _build_agent(tmp_path, monkeypatch)
agent.objective = "Open settings app"
agent.previous_response_id = "resp_123"
agent.step = 4
agent.last_context_compact_step = 0
agent.options.screen_context_decay_steps = 4
agent.recent_tool_summaries = ["step=1 tool=see_screen status=ok"]
agent.last_screen_data_url = "data:image/png;base64,abc"
agent.last_screen_meta = {"width": 1280, "height": 720, "path": "C:/tmp/frame.png"}
assert agent._should_compact_context() is True
compacted = agent._build_compacted_pending_input()
assert len(compacted) == 2
assert "Context compaction activated" in compacted[0]["content"][0]["text"]
assert "Open settings app" in compacted[0]["content"][0]["text"]

View File

@@ -29,7 +29,10 @@ def test_cli_emits_structured_return_and_data(monkeypatch: Any, capsys, tmp_path
def fake_assess_task_safety(*_args, **_kwargs):
return True, "safe", {"safe": True}
captured_kwargs: dict[str, Any] = {}
def fake_run_job(*_args, **_kwargs):
captured_kwargs.update(_kwargs)
result = AgentResult(
completed=True,
result="Done",
@@ -66,3 +69,5 @@ def test_cli_emits_structured_return_and_data(monkeypatch: Any, capsys, tmp_path
assert payload["response"]["data"] == "file1.txt\nfile2.txt"
assert payload["return"] == "Task completed successfully"
assert payload["data"] == "file1.txt\nfile2.txt"
assert captured_kwargs["options"].reasoning_effort == "medium"
assert captured_kwargs["options"].screen_context_decay_steps == 4

View File

@@ -9,6 +9,24 @@ import src.server as server_module
from src.config import AppConfig
_TERMINAL_STATUSES = {"completed", "failed", "cancelled"}
def _objective_category(objective: str) -> str:
text = objective.lower()
if any(keyword in text for keyword in ("browser", "website", "amazon", "google", "login", "shopping", "checkout", "orders")):
return "Browser / web"
if any(keyword in text for keyword in ("file", "folder", "directory", "terminal", "shell", "command", "cli", "script", "git", "repo", "install", "pip", "npm")):
return "Files / terminal"
if any(keyword in text for keyword in ("write", "summary", "document", "docs", "report", "email", "message", "readme", "markdown")):
return "Writing / docs"
if any(keyword in text for keyword in ("data", "analysis", "csv", "spreadsheet", "sheet", "table", "chart", "dashboard", "metric", "sql")):
return "Data / analysis"
if any(keyword in text for keyword in ("code", "bug", "fix", "test", "debug", "api", "backend", "frontend", "database", "deploy", "docker", "service", "build")):
return "Development / ops"
return "Other"
class FakeJobManager:
def __init__(self, *, config: AppConfig, db: Any, broadcast: Any = None) -> None:
self.config = config
@@ -26,6 +44,8 @@ class FakeJobManager:
command_timeout: int = 45,
type_interval: float = 0.02,
click_pause: float = 0.10,
reasoning_effort: str = "medium",
screen_context_decay_steps: int = 4,
disabled_tools: list[str] | None = None,
safety_override: bool = False,
no_failsafe: bool = False,
@@ -37,6 +57,7 @@ class FakeJobManager:
artifacts_dir.mkdir(parents=True, exist_ok=True)
screenshot_path = artifacts_dir / "screen_step_001.png"
screenshot_path.write_bytes(b"not-a-real-png")
created_at = f"2026-05-27T00:00:{self._counter:02d}Z"
self.last_submit_payload = {
"objective": objective,
"model": selected_model,
@@ -46,6 +67,8 @@ class FakeJobManager:
"command_timeout": command_timeout,
"type_interval": type_interval,
"click_pause": click_pause,
"reasoning_effort": reasoning_effort,
"screen_context_decay_steps": screen_context_decay_steps,
"no_failsafe": no_failsafe,
}
self._jobs[job_id] = {
@@ -53,6 +76,10 @@ class FakeJobManager:
"objective": objective,
"model": selected_model,
"status": "running",
"created_at": created_at,
"started_at": created_at,
"ended_at": None,
"steps": 1,
"result": "Running",
"response": {"return": "Running", "data": None},
"return": "Running",
@@ -145,6 +172,114 @@ class FakeJobManager:
"live_running_threads": 0,
}
def analytics(self) -> dict[str, Any]:
by_category: dict[str, dict[str, Any]] = {}
by_day: dict[str, dict[str, Any]] = {}
def bucket(target: dict[str, dict[str, Any]], key: str) -> dict[str, Any]:
return target.setdefault(
key,
{
"label": key,
"total_jobs": 0,
"finished_jobs": 0,
"completed_jobs": 0,
"failed_jobs": 0,
"cancelled_jobs": 0,
"steps_sum": 0,
"steps_count": 0,
"cost_sum": 0.0,
"cost_count": 0,
},
)
total_jobs = 0
finished_jobs = 0
completed_jobs = 0
failed_jobs = 0
cancelled_jobs = 0
steps_sum = 0
steps_count = 0
cost_sum = 0.0
cost_count = 0
for job in self._jobs.values():
total_jobs += 1
status = str(job.get("status") or "")
finished = status in _TERMINAL_STATUSES
category = _objective_category(str(job.get("objective") or ""))
day = str(job.get("created_at") or "")[:10] or "unknown"
category_bucket = bucket(by_category, category)
day_bucket = bucket(by_day, day)
for item in (category_bucket, day_bucket):
item["total_jobs"] += 1
if not finished:
continue
finished_jobs += 1
if status == "completed":
completed_jobs += 1
elif status == "failed":
failed_jobs += 1
elif status == "cancelled":
cancelled_jobs += 1
steps_raw = job.get("steps")
if steps_raw is not None:
steps = int(steps_raw)
steps_sum += steps
steps_count += 1
for item in (category_bucket, day_bucket):
item["steps_sum"] += steps
item["steps_count"] += 1
estimated_cost_raw = (job.get("usage") or {}).get("estimated_cost_usd")
if estimated_cost_raw is not None:
estimated_cost = float(estimated_cost_raw)
cost_sum += estimated_cost
cost_count += 1
for item in (category_bucket, day_bucket):
item["cost_sum"] += estimated_cost
item["cost_count"] += 1
for item in (category_bucket, day_bucket):
item["finished_jobs"] += 1
if status == "completed":
item["completed_jobs"] += 1
elif status == "failed":
item["failed_jobs"] += 1
elif status == "cancelled":
item["cancelled_jobs"] += 1
def finalize(item: dict[str, Any]) -> dict[str, Any]:
finished = item["finished_jobs"]
return {
"label": item["label"],
"total_jobs": item["total_jobs"],
"finished_jobs": finished,
"completed_jobs": item["completed_jobs"],
"failed_jobs": item["failed_jobs"],
"cancelled_jobs": item["cancelled_jobs"],
"success_rate": round((item["completed_jobs"] / finished) * 100, 2) if finished else 0.0,
"avg_steps": round(item["steps_sum"] / item["steps_count"], 2) if item["steps_count"] else None,
"avg_cost_usd": round(item["cost_sum"] / item["cost_count"], 6) if item["cost_count"] else None,
}
return {
"total_jobs": total_jobs,
"finished_jobs": finished_jobs,
"completed_jobs": completed_jobs,
"failed_jobs": failed_jobs,
"cancelled_jobs": cancelled_jobs,
"success_rate": round((completed_jobs / finished_jobs) * 100, 2) if finished_jobs else 0.0,
"avg_steps": round(steps_sum / steps_count, 2) if steps_count else None,
"avg_cost_usd": round(cost_sum / cost_count, 6) if cost_count else None,
"by_category": sorted((finalize(item) for item in by_category.values()), key=lambda item: (-item["success_rate"], item["label"])),
"timeline": sorted((finalize(item) for item in by_day.values()), key=lambda item: item["label"]),
}
def _build_app(tmp_path: Path, monkeypatch: Any, disable_ui: bool = False):
monkeypatch.setattr(server_module, "JobManager", FakeJobManager)
@@ -189,6 +324,8 @@ def test_create_job_returns_only_job_id_and_defaults_model(tmp_path: Path, monke
manager = app.state.manager
assert manager.last_submit_payload["model"] == "gpt-5.4-mini"
assert manager.last_submit_payload["disabled_tools"] == ["click"]
assert manager.last_submit_payload["reasoning_effort"] == "medium"
assert manager.last_submit_payload["screen_context_decay_steps"] == 4
status_res = client.get(f"/api/jobs/{job_id}/status", headers=headers)
assert status_res.status_code == 200
@@ -270,12 +407,67 @@ def test_replay_endpoint_skips_visual_paths_outside_artifacts(tmp_path: Path, mo
assert payload["total_frames"] == 1
def test_analytics_endpoint_groups_by_category_and_time(tmp_path: Path, monkeypatch: Any) -> None:
app, _ = _build_app(tmp_path, monkeypatch, disable_ui=False)
manager = app.state.manager
client = TestClient(app)
headers = {"Authorization": "Bearer test_token"}
browser_completed = client.post("/api/jobs", headers=headers, json={"job": "Open amazon.de and checkout"}).json()["job_id"]
browser_failed = client.post("/api/jobs", headers=headers, json={"job": "Open website and login"}).json()["job_id"]
terminal_completed = client.post("/api/jobs", headers=headers, json={"job": "Run a shell command to inspect files"}).json()["job_id"]
manager._jobs[browser_completed].update(
status="completed",
ended_at="2026-05-27T00:10:00Z",
steps=4,
created_at="2026-05-27T00:00:01Z",
usage={**manager._jobs[browser_completed]["usage"], "estimated_cost_usd": 0.12},
)
manager._jobs[browser_failed].update(
status="failed",
ended_at="2026-05-28T00:10:00Z",
steps=6,
created_at="2026-05-28T00:00:01Z",
usage={**manager._jobs[browser_failed]["usage"], "estimated_cost_usd": 0.24},
)
manager._jobs[terminal_completed].update(
status="completed",
ended_at="2026-05-28T00:15:00Z",
steps=10,
created_at="2026-05-28T00:00:02Z",
usage={**manager._jobs[terminal_completed]["usage"], "estimated_cost_usd": 0.05},
)
analytics = client.get("/api/analytics", headers=headers)
assert analytics.status_code == 200
payload = analytics.json()
assert payload["total_jobs"] == 3
assert payload["finished_jobs"] == 3
assert payload["completed_jobs"] == 2
assert payload["failed_jobs"] == 1
assert payload["success_rate"] == 66.67
assert payload["avg_steps"] == 6.67
assert payload["avg_cost_usd"] == 0.136667
browser = next(row for row in payload["by_category"] if row["label"] == "Browser / web")
terminal = next(row for row in payload["by_category"] if row["label"] == "Files / terminal")
assert browser["finished_jobs"] == 2
assert browser["success_rate"] == 50.0
assert browser["avg_steps"] == 5.0
assert terminal["success_rate"] == 100.0
assert [row["label"] for row in payload["timeline"]] == ["2026-05-27", "2026-05-28"]
def test_ui_toggle(tmp_path: Path, monkeypatch: Any) -> None:
app_enabled, _ = _build_app(tmp_path / "enabled", monkeypatch, disable_ui=False)
client_enabled = TestClient(app_enabled)
root_enabled = client_enabled.get("/")
assert root_enabled.status_code == 200
assert "ScreenJob Monitor" in root_enabled.text
assert "Success by Objective Category" in root_enabled.text
js_enabled = client_enabled.get("/ui/monitoring.js")
assert js_enabled.status_code == 200
assert "const tokenInput" in js_enabled.text

View File

@@ -72,3 +72,55 @@ def test_storage_response_fallback_uses_result_when_json_missing(tmp_path: Path)
assert job is not None
assert job["response"]["return"] == "Legacy result string"
assert job["response"]["data"] is None
def test_history_db_analytics_groups_by_category_and_day(tmp_path: Path) -> None:
db = HistoryDB(tmp_path / "screenjob_test_analytics.db")
db.create_job(
job_id="job_browser_ok",
objective="Open amazon.de and checkout",
model="gpt-5.4-mini",
created_at="2026-05-27T00:00:01Z",
safety_override=False,
disabled_tools=[],
)
db.update_job("job_browser_ok", status="completed", steps=4, estimated_cost_usd=0.12)
db.create_job(
job_id="job_browser_fail",
objective="Open website and login",
model="gpt-5.4-mini",
created_at="2026-05-28T00:00:01Z",
safety_override=False,
disabled_tools=[],
)
db.update_job("job_browser_fail", status="failed", steps=6, estimated_cost_usd=0.24)
db.create_job(
job_id="job_terminal_ok",
objective="Run a shell command to inspect files",
model="gpt-5.4-mini",
created_at="2026-05-28T00:00:02Z",
safety_override=False,
disabled_tools=[],
)
db.update_job("job_terminal_ok", status="completed", steps=10, estimated_cost_usd=0.05)
analytics = db.analytics()
assert analytics["total_jobs"] == 3
assert analytics["finished_jobs"] == 3
assert analytics["completed_jobs"] == 2
assert analytics["failed_jobs"] == 1
assert analytics["success_rate"] == 66.67
assert analytics["avg_steps"] == 6.67
assert analytics["avg_cost_usd"] == 0.136667
browser = next(row for row in analytics["by_category"] if row["label"] == "Browser / web")
terminal = next(row for row in analytics["by_category"] if row["label"] == "Files / terminal")
assert browser["finished_jobs"] == 2
assert browser["success_rate"] == 50.0
assert browser["avg_steps"] == 5.0
assert terminal["success_rate"] == 100.0
assert [row["label"] for row in analytics["timeline"]] == ["2026-05-27", "2026-05-28"]

10
todo.md
View File

@@ -4,12 +4,12 @@
- [Bug] Enforce single active desktop-control run (or a strict queue) so concurrent jobs cannot fight over the same mouse/keyboard/screen session.
- [Bug] Fix run artifact collisions in `setup_artifacts()` (`run_id` is second-granularity, so two jobs in the same second can share/overwrite the same directory).
- [Bug] Remove global logger handler clobbering in `setup_logger()` (`logging.getLogger("screenjob").handlers.clear()` breaks concurrent runs and can redirect logs to the wrong file).
- [Bug] More consistent clicks and more uses of enhance images.
- [x] More consistent clicks and more uses of enhance images.
## P1
- [Idea] Move ui.py into a seperate html file and js file.
- [Idea] Think harder using effort "medium" by default.
- [Idea] Decay old screenshots after 3 to 5 steps to save (1) tokens and (2) brain fuck in the agents.
- [x] Move ui.py into a seperate html file and js file.
- [x] Think harder using effort "medium" by default.
- [x] Decay old screenshots after 3 to 5 steps to save (1) tokens and (2) brain fuck in the agents.
- [Bug] Validate `disabled_tools` against an allowlist and disallow disabling critical completion flow (`task_complete`) to avoid guaranteed step-limit failures.
- [Bug] Improve `execute_command` cancellation/timeout handling to terminate full process trees, not only the parent shell process.
@@ -20,4 +20,4 @@
## P3
- [x] Add Replay Mode; Ability to replay a session by reconstructing the screen from screenshots and overlaying tool calls and click and type events.
- [Idea] Add lightweight analytics dashboards (success rate by objective category, avg steps/cost over time).
- [x] Add lightweight analytics dashboards (success rate by objective category, avg steps/cost over time).