# ScreenJob ScreenJob is an autonomous desktop-and-terminal execution service. It lets an LLM use controlled local tools (screen, mouse, keyboard, clipboard, shell) to complete GUI-heavy tasks on a real computer. ## What It Solves - Runs agent-driven tasks that require a graphical interface. - Exposes both CLI and HTTP API modes. - Stores job history and events in SQLite. - Streams live monitoring updates over WebSocket. - Returns structured agent output as: - `return`: human-readable completion message - `data`: structured payload (for example command output) ## Core Features - Hybrid control model: screenshot grounding plus Windows-native window, dialog, and UI-element helpers when available - Tool-based agent loop (`execute_command`, `see_screen`, `enhance`, `list_windows`, `find_window`, `focus_window`, `close_window`, `wait_for_window`, `wait_for_focus_change`, `detect_dialog`, `dialog_action`, `dialog_set_filename`, `wait_for_dialog_close`, `list_ui_elements`, `invoke_ui_element`, `set_ui_element_value`, `select_ui_element`, `wait_for_ui_element`, `click`, `scroll`, `drag`, `move_mouse`, `type`, `press_key`, `clipboard_get`, `clipboard_set`, `get_cursor_position`, `get_active_window`, `sleep`, `task_complete`) - Safety pre-check with override support - Per-job tool disable list - Live/final usage and cost estimates - Read-only Tailwind monitoring UI - Persistent job and event history ## Project Layout ```text main.py screenjob.py requirements.txt start_backend.ps1 src/ agent.py app_main.py cli.py config.py models.py pricing.py runtime.py safety.py server.py storage.py task_manager.py ui.py utils.py tests/ test_agent_tools.py test_pricing.py test_server_api.py test_storage.py .gitea/workflows/ci.yml ``` ## Setup 1. Install Python 3.11+. 2. Install dependencies: ```powershell pip install -r requirements.txt ``` 3. Create `.env` in project root: ```env OPENAI_API_KEY=... SCREENJOB_TOKEN=choose_a_strong_token # Optional SCREENJOB_DEFAULT_MODEL=gpt-5.4-mini SCREENJOB_SAFETY_MODEL=gpt-5.4-mini SCREENJOB_HOST=127.0.0.1 SCREENJOB_PORT=8787 DISABLE_UI=false ``` ## Usage ### CLI ```powershell python main.py run "Open amazon.de and go to my orders" ``` CLI JSON output includes both legacy and structured fields: ```json { "completed": true, "result": "Task completed successfully", "response": { "return": "Task completed successfully", "data": "file1.txt\nfile2.txt" }, "return": "Task completed successfully", "data": "file1.txt\nfile2.txt" } ``` ### Server ```powershell python main.py server ``` Or use the PowerShell launcher: ```powershell .\start_backend.ps1 ``` ### Backend Startup For screenshot-driven automation, start the backend in the logged-in user session. That gives `pyautogui` access to the interactive desktop, which Windows services do not. If you previously installed the legacy service, remove it once from an elevated PowerShell session with `.\uninstall_backend_service.ps1`. Install a sign-in launcher for the current user: ```powershell .\install_backend_service.ps1 ``` Install it for all users: ```powershell .\install_backend_service.ps1 -AllUsers ``` Start it immediately after installing: ```powershell .\install_backend_service.ps1 -StartNow ``` Remove the launcher: ```powershell .\uninstall_backend_service.ps1 ``` The launcher runs `start_backend.ps1` hidden via `start_backend_hidden.vbs`. If you need to start the backend manually, run: ```powershell .\start_backend.ps1 ``` The legacy Windows service host remains in the tree for reference, but it is not the recommended path for GUI tasks. ### System Tray Icon (Windows) Start tray icon now: ```powershell powershell -NoProfile -ExecutionPolicy Bypass -STA -File .\screenjob_tray.ps1 ``` Install startup shortcut (current user): ```powershell .\install_tray_startup_shortcut.ps1 ``` Install startup shortcut for all users: ```powershell .\install_tray_startup_shortcut.ps1 -AllUsers ``` Remove startup shortcut: ```powershell .\install_tray_startup_shortcut.ps1 -Remove ``` Tray menu actions: - The service controls are for the legacy Windows service host. - Refresh service status - Start/Stop/Restart service (prompts for admin/UAC) - Open dashboard URL from `.env` `SCREENJOB_HOST` / `SCREENJOB_PORT` - Open service logs folder - Exit tray icon process Auth for all API routes: - `Authorization: Bearer ` - `X-ScreenJob-Token: ` - Query fallback `?token=` (mainly for UI/websocket/artifact fetch) ### Create Job `POST /api/jobs` ```json { "job": "run \"ls -a\" in C:/Users/username/Documents and return output", "model": "gpt-5.4-mini", "native_automation_mode": "prefer", "dialog_timeout_seconds": 12, "focus_timeout_seconds": 8, "ui_element_timeout_seconds": 8, "max_retries_per_surface": 3, "disabled_tools": [], "safety_override": false } ``` Response: ```json { "job_id": "job_..." } ``` ### Job Status / History - `GET /api/jobs/{job_id}` - `GET /api/jobs/{job_id}/status` - `GET /api/jobs/{job_id}/events` - `GET /api/jobs` - `POST /api/jobs/{job_id}/cancel` - `GET /api/stats` Each job payload includes: - `result` (compat string) - `response.return` - `response.data` - top-level `return` and `data` aliases ### Monitoring UI - URL: `/` - Read-only dashboard (no run controls) - Requires token input - Live updates via `/ws` - Analytics dashboards for success rate by objective category and daily averages - Set `DISABLE_UI=true` to disable UI ### Analytics API - `GET /api/analytics` - Returns objective-category success rates plus average steps/cost over time ## Agent Instructions (Practical) - Prefer `execute_command` for deterministic actions (opening URLs, filesystem checks). - First classify the current Windows surface, then choose the control channel. - Prefer native window/dialog/element tools for focus changes, file pickers, modal confirmations, and browser-owned dialogs when available. - Use `see_screen` before UI interaction. - Use `enhance` before clicking small/ambiguous targets; prefer `region="small"` for compact controls. - Use `enhance` `mode="text"` for tiny labels/text, or `mode="ui"` for general UI. - Optionally set `enhance` `scale` (2-6) for tighter zoom control. - Use `list_windows`, `find_window`, `focus_window`, and `wait_for_focus_change` instead of blind Alt+Tab retries. - Use `detect_dialog`, `dialog_set_filename`, `dialog_action`, and `wait_for_dialog_close` for native open/save/confirm flows. - Use `list_ui_elements`, `invoke_ui_element`, `set_ui_element_value`, `select_ui_element`, and `wait_for_ui_element` when controls are exposed natively. - Use `press_key` for non-text keys (Enter, Tab, arrows, Escape). - For shortcuts, use one `press_key` call with combo syntax (example: `win+r`). - Use `click` offsets via `offset_up/down/left/right`; set `button` and `click_count` there instead of inventing one-off click tools. - Use `move_mouse` when you need hover-only behavior and `drag` for slider, selection, or window moves. - Use `scroll` for vertical navigation; positive amounts scroll up and negative amounts scroll down. - Use `clipboard_get` / `clipboard_set` for copy-paste workflows, `get_cursor_position` for cursor inspection, and `get_active_window` before interacting with uncertain focus. - If native automation is unavailable or disabled, ScreenJob falls back to screenshots plus mouse/keyboard control and emits fallback events. - When done, call: - `task_complete(return="...", data=...)` - Before `task_complete`, verify expected on-screen content with `see_screen` (and `enhance` if needed), and include an `observed_result` summary in `data`. Per-job `disabled_tools` must match the built-in tool allowlist. `task_complete` cannot be disabled. `data` should contain useful structured output for the requester (text, object, list, etc.). ## Verification Local: ```powershell pytest -q ``` CI: - `.gitea/workflows/ci.yml` runs compile checks + tests on push/PR. ## Compatibility Entry Point - `python screenjob.py ""` remains supported as a wrapper to `main.py`. ## License Apache License 2.0. See `LICENSE`.