Switch backend startup to interactive session

2026-05-31 20:43:01 +02:00
parent a521142b89
commit 79c9e98842
7 changed files with 795 additions and 137 deletions
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 # ScreenJob

 ScreenJob is an autonomous desktop-and-terminal execution service.  
-It lets an LLM use controlled local tools (screen, click, type, shell) to complete GUI-heavy tasks on a real computer.
+It lets an LLM use controlled local tools (screen, mouse, keyboard, clipboard, shell) to complete GUI-heavy tasks on a real computer.

 ## What It Solves

@@ -15,7 +15,8 @@ It lets an LLM use controlled local tools (screen, click, type, shell) to comple

 ## Core Features

- Tool-based agent loop (`execute_command`, `see_screen`, `enhance`, `click`, `type`, `press_key`, `sleep`, `task_complete`)
+- Hybrid control model: screenshot grounding plus Windows-native window, dialog, and UI-element helpers when available
+- Tool-based agent loop (`execute_command`, `see_screen`, `enhance`, `list_windows`, `find_window`, `focus_window`, `close_window`, `wait_for_window`, `wait_for_focus_change`, `detect_dialog`, `dialog_action`, `dialog_set_filename`, `wait_for_dialog_close`, `list_ui_elements`, `invoke_ui_element`, `set_ui_element_value`, `select_ui_element`, `wait_for_ui_element`, `click`, `scroll`, `drag`, `move_mouse`, `type`, `press_key`, `clipboard_get`, `clipboard_set`, `get_cursor_position`, `get_active_window`, `sleep`, `task_complete`)
 - Safety pre-check with override support
 - Per-job tool disable list
 - Live/final usage and cost estimates
@@ -109,43 +110,45 @@ Or use the PowerShell launcher:
 .\start_backend.ps1
 ```

-### Windows Service
+### Backend Startup

-Run these from an elevated PowerShell session (Run as Administrator):
-Requires .NET SDK 10+ (installer publishes a native service host executable).
+For screenshot-driven automation, start the backend in the logged-in user session.
+That gives `pyautogui` access to the interactive desktop, which Windows services do not.
+If you previously installed the legacy service, remove it once from an elevated PowerShell session with `.\uninstall_backend_service.ps1`.

-Install and start at boot:
+Install a sign-in launcher for the current user:

 ```powershell
-.\install_backend_service.ps1 -ForceReinstall -StartAfterInstall -DelayedAutoStart
+.\install_backend_service.ps1
 ```

-Check status:
+Install it for all users:

 ```powershell
-Get-Service -Name ScreenJobBackend
+.\install_backend_service.ps1 -AllUsers
 ```

-Stop/start manually:
+Start it immediately after installing:

 ```powershell
-Stop-Service -Name ScreenJobBackend
-Start-Service -Name ScreenJobBackend
+.\install_backend_service.ps1 -StartNow
 ```

-Uninstall:
+Remove the launcher:

 ```powershell
 .\uninstall_backend_service.ps1
 ```

-Service logs are written to:
+The launcher runs `start_backend.ps1` hidden via `start_backend_hidden.vbs`.
+If you need to start the backend manually, run:

-```text
-screenjob_runs/service/backend-service.stdout.log
-screenjob_runs/service/backend-service.stderr.log
+```powershell
+.\start_backend.ps1
 ```

+The legacy Windows service host remains in the tree for reference, but it is not the recommended path for GUI tasks.
+
 ### System Tray Icon (Windows)

 Start tray icon now:
@@ -174,6 +177,7 @@ Remove startup shortcut:

 Tray menu actions:

+- The service controls are for the legacy Windows service host.
 - Refresh service status
 - Start/Stop/Restart service (prompts for admin/UAC)
 - Open dashboard URL from `.env` `SCREENJOB_HOST` / `SCREENJOB_PORT`
@@ -194,6 +198,11 @@ Auth for all API routes:
 {
  "job": "run \"ls -a\" in C:/Users/username/Documents and return output",
  "model": "gpt-5.4-mini",
+  "native_automation_mode": "prefer",
+  "dialog_timeout_seconds": 12,
+  "focus_timeout_seconds": 8,
+  "ui_element_timeout_seconds": 8,
+  "max_retries_per_surface": 3,
  "disabled_tools": [],
  "safety_override": false
 }
@@ -238,17 +247,28 @@ Each job payload includes:
 ## Agent Instructions (Practical)

 - Prefer `execute_command` for deterministic actions (opening URLs, filesystem checks).
+- First classify the current Windows surface, then choose the control channel.
+- Prefer native window/dialog/element tools for focus changes, file pickers, modal confirmations, and browser-owned dialogs when available.
 - Use `see_screen` before UI interaction.
 - Use `enhance` before clicking small/ambiguous targets; prefer `region="small"` for compact controls.
 - Use `enhance` `mode="text"` for tiny labels/text, or `mode="ui"` for general UI.
 - Optionally set `enhance` `scale` (2-6) for tighter zoom control.
+- Use `list_windows`, `find_window`, `focus_window`, and `wait_for_focus_change` instead of blind Alt+Tab retries.
+- Use `detect_dialog`, `dialog_set_filename`, `dialog_action`, and `wait_for_dialog_close` for native open/save/confirm flows.
+- Use `list_ui_elements`, `invoke_ui_element`, `set_ui_element_value`, `select_ui_element`, and `wait_for_ui_element` when controls are exposed natively.
 - Use `press_key` for non-text keys (Enter, Tab, arrows, Escape).
 - For shortcuts, use one `press_key` call with combo syntax (example: `win+r`).
- Use `click` offsets via `offset_up/down/left/right` and optional `sleep_after_seconds`.
+- Use `click` offsets via `offset_up/down/left/right`; set `button` and `click_count` there instead of inventing one-off click tools.
+- Use `move_mouse` when you need hover-only behavior and `drag` for slider, selection, or window moves.
+- Use `scroll` for vertical navigation; positive amounts scroll up and negative amounts scroll down.
+- Use `clipboard_get` / `clipboard_set` for copy-paste workflows, `get_cursor_position` for cursor inspection, and `get_active_window` before interacting with uncertain focus.
+- If native automation is unavailable or disabled, ScreenJob falls back to screenshots plus mouse/keyboard control and emits fallback events.
 - When done, call:
  - `task_complete(return="...", data=...)`
 - Before `task_complete`, verify expected on-screen content with `see_screen` (and `enhance` if needed), and include an `observed_result` summary in `data`.

+Per-job `disabled_tools` must match the built-in tool allowlist. `task_complete` cannot be disabled.
+
 `data` should contain useful structured output for the requester (text, object, list, etc.).

 ## Verification