ScreenJob Skill (OpenClaw Agents)

What ScreenJob Solves

ScreenJob lets an agent execute tasks that require a real desktop UI plus terminal access, with structured tool calls and job tracking.

Main Features

Hybrid control model: screenshot grounding plus Windows-native window/dialog/element helpers when available
Screen perception (see_screen, enhance)
Mouse/keyboard control (click, type, press_key)
Native window/dialog control (list_windows, find_window, focus_window, detect_dialog, dialog_action, dialog_set_filename, list_ui_elements)
Terminal execution (execute_command, sleep)
Structured completion payload (task_complete(return=..., data=...))
Safety gate, auth, history, and live monitoring

Important Environment Note

ScreenJob runs on a separate computer (the human/operator machine), not inside the agent's own runtime environment.

Why It Is Useful

Agents can use ScreenJob to launch and control GUI workflows, including orchestrating other GUI agents/tools on a human computer.

Example Tasks

Open amazon.de and buy a USB-C to USB-C cable for 10 EUR or less.
Open google.com, go to my account, and change my profile picture to a provided image URL.
Run ls -a in C:/Users/username/Documents and return the output in data.

Practical Usage

Submit job via CLI or API.
Agent performs tool loop.
Read final response.return and response.data from job status.

Keyboard combo rule:

For shortcuts, use one press_key call with combo syntax, for example: win+r, ctrl+shift+esc.
Do not split modifier combos into separate calls.

Enhance-first click rule:

Before clicking small buttons/icons, dense UI, or ambiguous targets, call enhance first.
Preferred preset for tiny controls: enhance(coordinate, region="small", mode="ui").
For tiny labels/text: use mode="text" to improve readability.
Optional zoom control: set scale from 2 to 6 (defaults are tuned by region).
After checking the enhanced image, click using the same target coordinate (or a small directional offset if needed).

Windows-native routing rule:

First classify whether the current surface is a normal app window, browser window, #32770 dialog, Explorer file picker, or another system surface.
Prefer native window/dialog/element tools for focus changes, save/open dialogs, modal confirmations, and exposed controls.
Fall back to screenshots plus mouse/keyboard only when native automation is unavailable or the UI is custom-drawn.

Verification rule:

Before task_complete, verify actual on-screen content matches the expected outcome.
Use see_screen (and enhance if needed) for this check.
Include a concise observed_result in data when completing the task.

Patience / rerun rule:

If a job is still running, do not assume it is stuck just because it looks slow, repetitive, or token-heavy.
Prefer waiting longer and checking for a final status/result before starting a replacement run.
Only restart or replace a running job when there is clear evidence it is failed, irrecoverably stuck, or the user explicitly asks for a restart.
If you do replace a run, say why in one short sentence and reference the specific blocker you observed.

API Quick Reference

Base URL:

http://127.0.0.1:8787 (default)

Auth (required on all /api/* routes):

Authorization: Bearer <SCREENJOB_TOKEN>
or X-ScreenJob-Token: <SCREENJOB_TOKEN>

Create a job:

POST /api/jobs
Body:

{
  "job": "Open amazon.de and go to my orders",
  "model": "gpt-5.4-mini",
  "disabled_tools": [],
  "safety_override": false
}

Response:

{ "job_id": "job_..." }

Check progress/result:

GET /api/jobs/{job_id}
GET /api/jobs/{job_id}/status
GET /api/jobs/{job_id}/events
GET /api/jobs
POST /api/jobs/{job_id}/cancel
GET /api/stats

Result contract in job payload:

{
  "status": "completed",
  "response": {
    "return": "Task completed successfully",
    "data": "file1.txt\nfile2.txt"
  },
  "return": "Task completed successfully",
  "data": "file1.txt\nfile2.txt"
}

Artifacts (screenshots/enhanced images):

GET /api/jobs/{job_id}/artifact?path=<absolute_artifact_path>&token=<SCREENJOB_TOKEN>

4.2 KiB Raw Blame History