screenjob/README.md

# ScreenJob

ScreenJob is an autonomous desktop-and-terminal execution service.
It lets an LLM use controlled local tools (screen, mouse, keyboard, clipboard, shell) to complete GUI-heavy tasks on a real computer.

## What It Solves

- Runs agent-driven tasks that require a graphical interface.
- Exposes both CLI and HTTP API modes.
- Stores job history and events in SQLite.
- Streams live monitoring updates over WebSocket.
- Returns structured agent output as:
  - `return`: human-readable completion message
  - `data`: structured payload (for example command output)

## Core Features

- Hybrid control model: screenshot grounding plus Windows-native window, dialog, and UI-element helpers when available
- Tool-based agent loop (`execute_command`, `see_screen`, `enhance`, `list_windows`, `find_window`, `focus_window`, `close_window`, `wait_for_window`, `wait_for_focus_change`, `detect_dialog`, `dialog_action`, `dialog_set_filename`, `wait_for_dialog_close`, `list_ui_elements`, `invoke_ui_element`, `set_ui_element_value`, `select_ui_element`, `wait_for_ui_element`, `click`, `scroll`, `drag`, `move_mouse`, `type`, `press_key`, `clipboard_get`, `clipboard_set`, `get_cursor_position`, `get_active_window`, `sleep`, `task_complete`)
- Safety pre-check with override support
- Per-job tool disable list
- Live/final usage and cost estimates
- Read-only Tailwind monitoring UI
- Persistent job and event history

## Project Layout

```text
main.py
screenjob.py
requirements.txt
start_backend.ps1
src/
  agent.py
  app_main.py
  cli.py
  config.py
  models.py
  pricing.py
  runtime.py
  safety.py
  server.py
  storage.py
  task_manager.py
  ui.py
  utils.py
tests/
  test_agent_tools.py
  test_pricing.py
  test_server_api.py
  test_storage.py
.gitea/workflows/ci.yml
```

## Setup

1. Install Python 3.11+.
2. Install dependencies:

```powershell
pip install -r requirements.txt
```

3. Create `.env` in project root:

```env
OPENAI_API_KEY=...
SCREENJOB_TOKEN=choose_a_strong_token

# Optional
SCREENJOB_DEFAULT_MODEL=gpt-5.4-mini
SCREENJOB_SAFETY_MODEL=gpt-5.4-mini
SCREENJOB_HOST=127.0.0.1
SCREENJOB_PORT=8787
DISABLE_UI=false
```

## Usage

### CLI

```powershell
python main.py run "Open amazon.de and go to my orders"
```

CLI JSON output includes both legacy and structured fields:

```json
{
  "completed": true,
  "result": "Task completed successfully",
  "response": {
    "return": "Task completed successfully",
    "data": "file1.txt\nfile2.txt"
  },
  "return": "Task completed successfully",
  "data": "file1.txt\nfile2.txt"
}
```

### Server

```powershell
python main.py server
```

Or use the PowerShell launcher:

```powershell
.\start_backend.ps1
```

### Backend Startup

For screenshot-driven automation, start the backend in the logged-in user session.
That gives `pyautogui` access to the interactive desktop, which Windows services do not.
If you previously installed the legacy service, remove it once from an elevated PowerShell session with `.\uninstall_backend_service.ps1`.

Install a sign-in launcher for the current user:

```powershell
.\install_backend_service.ps1
```

Install it for all users:

```powershell
.\install_backend_service.ps1 -AllUsers
```

Start it immediately after installing:

```powershell
.\install_backend_service.ps1 -StartNow
```

Remove the launcher:

```powershell
.\uninstall_backend_service.ps1
```

The launcher runs `start_backend.ps1` hidden via `start_backend_hidden.vbs`.
If you need to start the backend manually, run:

```powershell
.\start_backend.ps1
```

The legacy Windows service host remains in the tree for reference, but it is not the recommended path for GUI tasks.

### System Tray Icon (Windows)

Start tray icon now:

```powershell
powershell -NoProfile -ExecutionPolicy Bypass -STA -File .\screenjob_tray.ps1
```

Install startup shortcut (current user):

```powershell
.\install_tray_startup_shortcut.ps1
```

Install startup shortcut for all users:

```powershell
.\install_tray_startup_shortcut.ps1 -AllUsers
```

Remove startup shortcut:

```powershell
.\install_tray_startup_shortcut.ps1 -Remove
```

Tray menu actions:

- The service controls are for the legacy Windows service host.
- Refresh service status
- Start/Stop/Restart service (prompts for admin/UAC)
- Open dashboard URL from `.env` `SCREENJOB_HOST` / `SCREENJOB_PORT`
- Open service logs folder
- Exit tray icon process

Auth for all API routes:

- `Authorization: Bearer <SCREENJOB_TOKEN>`
- `X-ScreenJob-Token: <SCREENJOB_TOKEN>`
- Query fallback `?token=` (mainly for UI/websocket/artifact fetch)

### Create Job

`POST /api/jobs`

```json
{
  "job": "run \"ls -a\" in C:/Users/username/Documents and return output",
  "model": "gpt-5.4-mini",
  "native_automation_mode": "prefer",
  "dialog_timeout_seconds": 12,
  "focus_timeout_seconds": 8,
  "ui_element_timeout_seconds": 8,
  "max_retries_per_surface": 3,
  "disabled_tools": [],
  "safety_override": false
}
```

Response:

```json
{ "job_id": "job_..." }
```

### Job Status / History

- `GET /api/jobs/{job_id}`
- `GET /api/jobs/{job_id}/status`
- `GET /api/jobs/{job_id}/events`
- `GET /api/jobs`
- `POST /api/jobs/{job_id}/cancel`
- `GET /api/stats`

Each job payload includes:

- `result` (compat string)
- `response.return`
- `response.data`
- top-level `return` and `data` aliases

### Monitoring UI

- URL: `/`
- Read-only dashboard (no run controls)
- Requires token input
- Live updates via `/ws`
- Analytics dashboards for success rate by objective category and daily averages
- Set `DISABLE_UI=true` to disable UI

### Analytics API

- `GET /api/analytics`
- Returns objective-category success rates plus average steps/cost over time

## Agent Instructions (Practical)

- Prefer `execute_command` for deterministic actions (opening URLs, filesystem checks).
- First classify the current Windows surface, then choose the control channel.
- Prefer native window/dialog/element tools for focus changes, file pickers, modal confirmations, and browser-owned dialogs when available.
- Use `see_screen` before UI interaction.
- Use `enhance` before clicking small/ambiguous targets; prefer `region="small"` for compact controls.
- Use `enhance` `mode="text"` for tiny labels/text, or `mode="ui"` for general UI.
- Optionally set `enhance` `scale` (2-6) for tighter zoom control.
- Use `list_windows`, `find_window`, `focus_window`, and `wait_for_focus_change` instead of blind Alt+Tab retries.
- Use `detect_dialog`, `dialog_set_filename`, `dialog_action`, and `wait_for_dialog_close` for native open/save/confirm flows.
- Use `list_ui_elements`, `invoke_ui_element`, `set_ui_element_value`, `select_ui_element`, and `wait_for_ui_element` when controls are exposed natively.
- Use `press_key` for non-text keys (Enter, Tab, arrows, Escape).
- For shortcuts, use one `press_key` call with combo syntax (example: `win+r`).
- Use `click` offsets via `offset_up/down/left/right`; set `button` and `click_count` there instead of inventing one-off click tools.
- Use `move_mouse` when you need hover-only behavior and `drag` for slider, selection, or window moves.
- Use `scroll` for vertical navigation; positive amounts scroll up and negative amounts scroll down.
- Use `clipboard_get` / `clipboard_set` for copy-paste workflows, `get_cursor_position` for cursor inspection, and `get_active_window` before interacting with uncertain focus.
- If native automation is unavailable or disabled, ScreenJob falls back to screenshots plus mouse/keyboard control and emits fallback events.
- When done, call:
  - `task_complete(return="...", data=...)`
- Before `task_complete`, verify expected on-screen content with `see_screen` (and `enhance` if needed), and include an `observed_result` summary in `data`.

Per-job `disabled_tools` must match the built-in tool allowlist. `task_complete` cannot be disabled.

`data` should contain useful structured output for the requester (text, object, list, etc.).

## Verification

Local:

```powershell
pytest -q
```

CI:

- `.gitea/workflows/ci.yml` runs compile checks + tests on push/PR.

## Compatibility Entry Point

- `python screenjob.py "<job>"` remains supported as a wrapper to `main.py`.

## License

Apache License 2.0. See `LICENSE`.