8.0 KiB
ScreenJob
ScreenJob is an autonomous desktop-and-terminal execution service.
It lets an LLM use controlled local tools (screen, mouse, keyboard, clipboard, shell) to complete GUI-heavy tasks on a real computer.
What It Solves
- Runs agent-driven tasks that require a graphical interface.
- Exposes both CLI and HTTP API modes.
- Stores job history and events in SQLite.
- Streams live monitoring updates over WebSocket.
- Returns structured agent output as:
return: human-readable completion messagedata: structured payload (for example command output)
Core Features
- Hybrid control model: screenshot grounding plus Windows-native window, dialog, and UI-element helpers when available
- Tool-based agent loop (
execute_command,see_screen,enhance,list_windows,find_window,focus_window,close_window,wait_for_window,wait_for_focus_change,detect_dialog,dialog_action,dialog_set_filename,wait_for_dialog_close,list_ui_elements,invoke_ui_element,set_ui_element_value,select_ui_element,wait_for_ui_element,click,scroll,drag,move_mouse,type,press_key,clipboard_get,clipboard_set,get_cursor_position,get_active_window,sleep,task_complete) - Safety pre-check with override support
- Per-job tool disable list
- Live/final usage and cost estimates
- Read-only Tailwind monitoring UI
- Persistent job and event history
Project Layout
main.py
screenjob.py
requirements.txt
start_backend.ps1
src/
agent.py
app_main.py
cli.py
config.py
models.py
pricing.py
runtime.py
safety.py
server.py
storage.py
task_manager.py
ui.py
utils.py
tests/
test_agent_tools.py
test_pricing.py
test_server_api.py
test_storage.py
.gitea/workflows/ci.yml
Setup
- Install Python 3.11+.
- Install dependencies:
pip install -r requirements.txt
- Create
.envin project root:
OPENAI_API_KEY=...
SCREENJOB_TOKEN=choose_a_strong_token
# Optional
SCREENJOB_DEFAULT_MODEL=gpt-5.4-mini
SCREENJOB_SAFETY_MODEL=gpt-5.4-mini
SCREENJOB_HOST=127.0.0.1
SCREENJOB_PORT=8787
DISABLE_UI=false
Usage
CLI
python main.py run "Open amazon.de and go to my orders"
CLI JSON output includes both legacy and structured fields:
{
"completed": true,
"result": "Task completed successfully",
"response": {
"return": "Task completed successfully",
"data": "file1.txt\nfile2.txt"
},
"return": "Task completed successfully",
"data": "file1.txt\nfile2.txt"
}
Server
python main.py server
Or use the PowerShell launcher:
.\start_backend.ps1
Backend Startup
For screenshot-driven automation, start the backend in the logged-in user session.
That gives pyautogui access to the interactive desktop, which Windows services do not.
If you previously installed the legacy service, remove it once from an elevated PowerShell session with .\uninstall_backend_service.ps1.
Install a sign-in launcher for the current user:
.\install_backend_service.ps1
Install it for all users:
.\install_backend_service.ps1 -AllUsers
Start it immediately after installing:
.\install_backend_service.ps1 -StartNow
Remove the launcher:
.\uninstall_backend_service.ps1
The launcher runs start_backend.ps1 hidden via start_backend_hidden.vbs.
If you need to start the backend manually, run:
.\start_backend.ps1
The legacy Windows service host remains in the tree for reference, but it is not the recommended path for GUI tasks.
System Tray Icon (Windows)
Start tray icon now:
powershell -NoProfile -ExecutionPolicy Bypass -STA -File .\screenjob_tray.ps1
Install startup shortcut (current user):
.\install_tray_startup_shortcut.ps1
Install startup shortcut for all users:
.\install_tray_startup_shortcut.ps1 -AllUsers
Remove startup shortcut:
.\install_tray_startup_shortcut.ps1 -Remove
Tray menu actions:
- The service controls are for the legacy Windows service host.
- Refresh service status
- Start/Stop/Restart service (prompts for admin/UAC)
- Open dashboard URL from
.envSCREENJOB_HOST/SCREENJOB_PORT - Open service logs folder
- Exit tray icon process
Auth for all API routes:
Authorization: Bearer <SCREENJOB_TOKEN>X-ScreenJob-Token: <SCREENJOB_TOKEN>- Query fallback
?token=(mainly for UI/websocket/artifact fetch)
Create Job
POST /api/jobs
{
"job": "run \"ls -a\" in C:/Users/username/Documents and return output",
"model": "gpt-5.4-mini",
"native_automation_mode": "prefer",
"dialog_timeout_seconds": 12,
"focus_timeout_seconds": 8,
"ui_element_timeout_seconds": 8,
"max_retries_per_surface": 3,
"disabled_tools": [],
"safety_override": false
}
Response:
{ "job_id": "job_..." }
Job Status / History
GET /api/jobs/{job_id}GET /api/jobs/{job_id}/statusGET /api/jobs/{job_id}/eventsGET /api/jobsPOST /api/jobs/{job_id}/cancelGET /api/stats
Each job payload includes:
result(compat string)response.returnresponse.data- top-level
returnanddataaliases
Monitoring UI
- URL:
/ - Read-only dashboard (no run controls)
- Requires token input
- Live updates via
/ws - Analytics dashboards for success rate by objective category and daily averages
- Set
DISABLE_UI=trueto disable UI
Analytics API
GET /api/analytics- Returns objective-category success rates plus average steps/cost over time
Agent Instructions (Practical)
- Prefer
execute_commandfor deterministic actions (opening URLs, filesystem checks). - First classify the current Windows surface, then choose the control channel.
- Prefer native window/dialog/element tools for focus changes, file pickers, modal confirmations, and browser-owned dialogs when available.
- Use
see_screenbefore UI interaction. - Use
enhancebefore clicking small/ambiguous targets; preferregion="small"for compact controls. - Use
enhancemode="text"for tiny labels/text, ormode="ui"for general UI. - Optionally set
enhancescale(2-6) for tighter zoom control. - Use
list_windows,find_window,focus_window, andwait_for_focus_changeinstead of blind Alt+Tab retries. - Use
detect_dialog,dialog_set_filename,dialog_action, andwait_for_dialog_closefor native open/save/confirm flows. - Use
list_ui_elements,invoke_ui_element,set_ui_element_value,select_ui_element, andwait_for_ui_elementwhen controls are exposed natively. - Use
press_keyfor non-text keys (Enter, Tab, arrows, Escape). - For shortcuts, use one
press_keycall with combo syntax (example:win+r). - Use
clickoffsets viaoffset_up/down/left/right; setbuttonandclick_countthere instead of inventing one-off click tools. - Use
move_mousewhen you need hover-only behavior anddragfor slider, selection, or window moves. - Use
scrollfor vertical navigation; positive amounts scroll up and negative amounts scroll down. - Use
clipboard_get/clipboard_setfor copy-paste workflows,get_cursor_positionfor cursor inspection, andget_active_windowbefore interacting with uncertain focus. - If native automation is unavailable or disabled, ScreenJob falls back to screenshots plus mouse/keyboard control and emits fallback events.
- When done, call:
task_complete(return="...", data=...)
- Before
task_complete, verify expected on-screen content withsee_screen(andenhanceif needed), and include anobserved_resultsummary indata.
Per-job disabled_tools must match the built-in tool allowlist. task_complete cannot be disabled.
data should contain useful structured output for the requester (text, object, list, etc.).
Verification
Local:
pytest -q
CI:
.gitea/workflows/ci.ymlruns compile checks + tests on push/PR.
Compatibility Entry Point
python screenjob.py "<job>"remains supported as a wrapper tomain.py.
License
Apache License 2.0. See LICENSE.