Files

Space-Banane 79c9e98842 Switch backend startup to interactive session

2026-05-31 20:43:36 +02:00

8.0 KiB

Raw Blame History

ScreenJob

ScreenJob is an autonomous desktop-and-terminal execution service.
It lets an LLM use controlled local tools (screen, mouse, keyboard, clipboard, shell) to complete GUI-heavy tasks on a real computer.

What It Solves

Runs agent-driven tasks that require a graphical interface.
Exposes both CLI and HTTP API modes.
Stores job history and events in SQLite.
Streams live monitoring updates over WebSocket.
Returns structured agent output as:
- return: human-readable completion message
- data: structured payload (for example command output)

Core Features

Hybrid control model: screenshot grounding plus Windows-native window, dialog, and UI-element helpers when available
Tool-based agent loop (execute_command, see_screen, enhance, list_windows, find_window, focus_window, close_window, wait_for_window, wait_for_focus_change, detect_dialog, dialog_action, dialog_set_filename, wait_for_dialog_close, list_ui_elements, invoke_ui_element, set_ui_element_value, select_ui_element, wait_for_ui_element, click, scroll, drag, move_mouse, type, press_key, clipboard_get, clipboard_set, get_cursor_position, get_active_window, sleep, task_complete)
Safety pre-check with override support
Per-job tool disable list
Live/final usage and cost estimates
Read-only Tailwind monitoring UI
Persistent job and event history

Project Layout

main.py
screenjob.py
requirements.txt
start_backend.ps1
src/
  agent.py
  app_main.py
  cli.py
  config.py
  models.py
  pricing.py
  runtime.py
  safety.py
  server.py
  storage.py
  task_manager.py
  ui.py
  utils.py
tests/
  test_agent_tools.py
  test_pricing.py
  test_server_api.py
  test_storage.py
.gitea/workflows/ci.yml

Setup

Install Python 3.11+.
Install dependencies:

pip install -r requirements.txt

Create .env in project root:

OPENAI_API_KEY=...
SCREENJOB_TOKEN=choose_a_strong_token

# Optional
SCREENJOB_DEFAULT_MODEL=gpt-5.4-mini
SCREENJOB_SAFETY_MODEL=gpt-5.4-mini
SCREENJOB_HOST=127.0.0.1
SCREENJOB_PORT=8787
DISABLE_UI=false

Usage

CLI

python main.py run "Open amazon.de and go to my orders"

CLI JSON output includes both legacy and structured fields:

{
  "completed": true,
  "result": "Task completed successfully",
  "response": {
    "return": "Task completed successfully",
    "data": "file1.txt\nfile2.txt"
  },
  "return": "Task completed successfully",
  "data": "file1.txt\nfile2.txt"
}

Server

python main.py server

Or use the PowerShell launcher:

.\start_backend.ps1

Backend Startup

For screenshot-driven automation, start the backend in the logged-in user session. That gives pyautogui access to the interactive desktop, which Windows services do not. If you previously installed the legacy service, remove it once from an elevated PowerShell session with .\uninstall_backend_service.ps1.

Install a sign-in launcher for the current user:

.\install_backend_service.ps1

Install it for all users:

.\install_backend_service.ps1 -AllUsers

Start it immediately after installing:

.\install_backend_service.ps1 -StartNow

Remove the launcher:

.\uninstall_backend_service.ps1

The launcher runs start_backend.ps1 hidden via start_backend_hidden.vbs. If you need to start the backend manually, run:

.\start_backend.ps1

The legacy Windows service host remains in the tree for reference, but it is not the recommended path for GUI tasks.

System Tray Icon (Windows)

Start tray icon now:

powershell -NoProfile -ExecutionPolicy Bypass -STA -File .\screenjob_tray.ps1

Install startup shortcut (current user):

.\install_tray_startup_shortcut.ps1

Install startup shortcut for all users:

.\install_tray_startup_shortcut.ps1 -AllUsers

Remove startup shortcut:

.\install_tray_startup_shortcut.ps1 -Remove

Tray menu actions:

The service controls are for the legacy Windows service host.
Refresh service status
Start/Stop/Restart service (prompts for admin/UAC)
Open dashboard URL from .env SCREENJOB_HOST / SCREENJOB_PORT
Open service logs folder
Exit tray icon process

Auth for all API routes:

Authorization: Bearer <SCREENJOB_TOKEN>
X-ScreenJob-Token: <SCREENJOB_TOKEN>
Query fallback ?token= (mainly for UI/websocket/artifact fetch)

Create Job

POST /api/jobs

{
  "job": "run \"ls -a\" in C:/Users/username/Documents and return output",
  "model": "gpt-5.4-mini",
  "native_automation_mode": "prefer",
  "dialog_timeout_seconds": 12,
  "focus_timeout_seconds": 8,
  "ui_element_timeout_seconds": 8,
  "max_retries_per_surface": 3,
  "disabled_tools": [],
  "safety_override": false
}

Response:

{ "job_id": "job_..." }

Job Status / History

GET /api/jobs/{job_id}
GET /api/jobs/{job_id}/status
GET /api/jobs/{job_id}/events
GET /api/jobs
POST /api/jobs/{job_id}/cancel
GET /api/stats

Each job payload includes:

result (compat string)
response.return
response.data
top-level return and data aliases

Monitoring UI

URL: /
Read-only dashboard (no run controls)
Requires token input
Live updates via /ws
Analytics dashboards for success rate by objective category and daily averages
Set DISABLE_UI=true to disable UI

Analytics API

GET /api/analytics
Returns objective-category success rates plus average steps/cost over time

Agent Instructions (Practical)

Prefer execute_command for deterministic actions (opening URLs, filesystem checks).
First classify the current Windows surface, then choose the control channel.
Prefer native window/dialog/element tools for focus changes, file pickers, modal confirmations, and browser-owned dialogs when available.
Use see_screen before UI interaction.
Use enhance before clicking small/ambiguous targets; prefer region="small" for compact controls.
Use enhance mode="text" for tiny labels/text, or mode="ui" for general UI.
Optionally set enhance scale (2-6) for tighter zoom control.
Use list_windows, find_window, focus_window, and wait_for_focus_change instead of blind Alt+Tab retries.
Use detect_dialog, dialog_set_filename, dialog_action, and wait_for_dialog_close for native open/save/confirm flows.
Use list_ui_elements, invoke_ui_element, set_ui_element_value, select_ui_element, and wait_for_ui_element when controls are exposed natively.
Use press_key for non-text keys (Enter, Tab, arrows, Escape).
For shortcuts, use one press_key call with combo syntax (example: win+r).
Use click offsets via offset_up/down/left/right; set button and click_count there instead of inventing one-off click tools.
Use move_mouse when you need hover-only behavior and drag for slider, selection, or window moves.
Use scroll for vertical navigation; positive amounts scroll up and negative amounts scroll down.
Use clipboard_get / clipboard_set for copy-paste workflows, get_cursor_position for cursor inspection, and get_active_window before interacting with uncertain focus.
If native automation is unavailable or disabled, ScreenJob falls back to screenshots plus mouse/keyboard control and emits fallback events.
When done, call:
- task_complete(return="...", data=...)
Before task_complete, verify expected on-screen content with see_screen (and enhance if needed), and include an observed_result summary in data.

Per-job disabled_tools must match the built-in tool allowlist. task_complete cannot be disabled.

data should contain useful structured output for the requester (text, object, list, etc.).

Verification

Local:

pytest -q

CI:

.gitea/workflows/ci.yml runs compile checks + tests on push/PR.

Compatibility Entry Point

python screenjob.py "<job>" remains supported as a wrapper to main.py.

License

Apache License 2.0. See LICENSE.

8.0 KiB Raw Blame History