diff --git a/README.md b/README.md
index 8714d73..4658a10 100644
--- a/README.md
+++ b/README.md
@@ -6,6 +6,7 @@ Let an Agent interact with your computer over HTTP, with grid-aware screenshots
 
 - **Visual endpoints**: full-screen capture with optional grid overlay and labeled cells (`asImage=true` can return raw image bytes)
 - **Zoom endpoint**: crop around a point with denser grid for fine targeting (`asImage=true` supported)
+- **Multi-display support**: list displays with `GET /displays` and select one with `?screen=0`, `?screen=1`, ...
 - **Action endpoints**: move/click/right-click/double-click/middle-click/scroll/type/hotkey
 - **OCR endpoint**: extract text blocks with bounding boxes via `POST /ocr`
 - **Command execution endpoint**: run PowerShell/Bash/CMD commands via `POST /exec`
@@ -30,11 +31,12 @@ For OCR support, install the native `tesseract` binary on the host (in addition
 
 ## Minimal API flow
 
-1. `GET /screen` with grid
-2. Decide cell / target
-3. Optional `POST /zoom` for finer targeting
-4. `POST /action` to execute
-5. `GET /screen` again to verify result
+1. `GET /displays` if you need a non-primary monitor
+2. `GET /screen?screen=0` with grid
+3. Decide cell / target
+4. Optional `POST /zoom?screen=0` for finer targeting
+5. `POST /action?screen=0` to execute
+6. `GET /screen?screen=0` again to verify result
 
 See:
 - `docs/API.md`
diff --git a/docs/API.md b/docs/API.md
index dbfb58e..26e10af 100644
--- a/docs/API.md
+++ b/docs/API.md
@@ -12,19 +12,39 @@ x-clickthrough-token: <token>
 
 Returns status and runtime safety flags, including `exec` capability config.
 
+## `GET /displays`
+
+Returns detected displays in API screen order.
+
+```json
+{
+  "ok": true,
+  "default_screen": 0,
+  "displays": [
+    {"screen": 0, "mss_index": 1, "primary": true, "x": 0, "y": 0, "width": 1920, "height": 1080},
+    {"screen": 1, "mss_index": 2, "primary": false, "x": 1920, "y": 0, "width": 1920, "height": 1080}
+  ]
+}
+```
+
+`screen` is zero-based. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend.
+Invalid `screen` values fall back to `0`.
+
 ## `GET /screen`
 
 Query params:
 
+- `screen` (int, default `0`) — zero-based display selector; invalid values fall back to `0`
 - `with_grid` (bool, default `true`)
 - `grid_rows` (int, default env or `12`)
 - `grid_cols` (int, default env or `12`)
 - `include_labels` (bool, default `true`)
 - `image_format` (`png`|`jpeg`, default `png`)
 - `jpeg_quality` (1-100, default `85`)
-- `asImage` (bool, default `false`) — if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
+- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
 
-Default response includes base64 image and metadata (`meta.region`, optional `meta.grid`).
+Default response includes base64 image and metadata (`meta.region`, `meta.screen`, `meta.displays`, optional `meta.grid`).
+`meta.region` uses global desktop coordinates.
 
 ## `POST /zoom`
 
@@ -47,14 +67,21 @@ Body:
 
 Query params:
 
-- `asImage` (bool, default `false`) — if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
+- `screen` (int, default `0`) - zero-based display selector; invalid values fall back to `0`
+- `asImage` (bool, default `false`) - if `true`, return raw image bytes only (`image/png` or `image/jpeg`)
 
-Default response returns cropped image + region metadata in global pixel coordinates.
+Default response returns cropped image + region metadata in global pixel coordinates. `center_x` and `center_y` are also global coordinates; use the selected display's `meta.region` from `/screen?screen=X` as the coordinate base.
 
 ## `POST /action`
 
 Body: one action.
 
+Query params:
+
+- `screen` (int, default `0`) - zero-based display selector included in the response metadata; invalid values fall back to `0`
+
+Pointer coordinates remain global desktop coordinates. For multi-display actions, first capture `/screen?screen=X` and use that response's `meta.region` or grid metadata to compute the target.
+
 ### Pointer target modes
 
 #### Pixel target
@@ -147,6 +174,10 @@ Hotkey:
 
 Extract visible text from either a full screenshot, a region crop, or caller-provided image bytes.
 
+Query params:
+
+- `screen` (int, default `0`) - zero-based display selector for `mode=screen` and `mode=region`; invalid values fall back to `0`
+
 Body:
 
 ```json
@@ -158,7 +189,7 @@ Body:
 ```
 
 Modes:
-- `screen` (default): OCR over full captured monitor
+- `screen` (default): OCR over full selected monitor
 - `region`: OCR over explicit region (`region_x`, `region_y`, `region_width`, `region_height`)
 - `image`: OCR over provided `image_base64` (supports plain base64 or data URL)
 
@@ -246,6 +277,10 @@ Response includes `stdout`, `stderr`, `exit_code`, timeout state, and execution
 
 Runs multiple `action` payloads sequentially.
 
+Query params:
+
+- `screen` (int, default `0`) - zero-based display selector applied to each action response; invalid values fall back to `0`
+
 ```json
 {
   "actions": [
diff --git a/docs/coordinate-system.md b/docs/coordinate-system.md
index 07d5f34..047af84 100644
--- a/docs/coordinate-system.md
+++ b/docs/coordinate-system.md
@@ -1,6 +1,8 @@
 # Coordinate System
 
-All interactions ultimately execute in **global pixel coordinates** of the primary monitor.
+All interactions ultimately execute in **global desktop pixel coordinates**.
+
+Use `GET /displays` to list available displays. Visual endpoints accept `?screen=X` where `X` is a zero-based display index. `screen=0` is the primary display when detectable, falling back to the first monitor reported by the capture backend. Invalid screen values fall back to `0`.
 
 ## Regions
 
@@ -12,6 +14,12 @@ Visual endpoints return a `region` object:
 
 This describes where the image sits in global desktop space.
 
+For a second display to the right of the primary display, `GET /screen?screen=1` might return:
+
+```json
+{"x": 1920, "y": 0, "width": 1920, "height": 1080}
+```
+
 ## Grid indexing
 
 - Rows/cols are **zero-based**
@@ -35,7 +43,7 @@ Interpretation:
 
 ## Recommended agent loop
 
-1. Capture `/screen` with coarse grid
+1. Capture `/screen?screen=0` with coarse grid, or choose another display with `/screen?screen=1`
 2. Find candidate cell
 3. If uncertain, use `/zoom` around candidate
 4. Convert target to grid action
diff --git a/examples/quickstart.py b/examples/quickstart.py
index 876d9d1..5aba923 100644
--- a/examples/quickstart.py
+++ b/examples/quickstart.py
@@ -5,6 +5,7 @@ import requests
 
 BASE_URL = os.getenv("CLICKTHROUGH_URL", "http://127.0.0.1:8123")
 TOKEN = os.getenv("CLICKTHROUGH_TOKEN", "")
+SCREEN = int(os.getenv("CLICKTHROUGH_SCREEN", "0"))
 
 headers = {}
 if TOKEN:
@@ -16,10 +17,14 @@ def main():
     r.raise_for_status()
     print("health:", r.json())
 
+    d = requests.get(f"{BASE_URL}/displays", headers=headers, timeout=10)
+    d.raise_for_status()
+    print("displays:", d.json().get("displays", []))
+
     s = requests.get(
         f"{BASE_URL}/screen",
         headers=headers,
-        params={"with_grid": True, "grid_rows": 12, "grid_cols": 12},
+        params={"screen": SCREEN, "with_grid": True, "grid_rows": 12, "grid_cols": 12},
         timeout=30,
     )
     s.raise_for_status()
diff --git a/server/app.py b/server/app.py
index 5726da2..fe19ce1 100644
--- a/server/app.py
+++ b/server/app.py
@@ -192,13 +192,73 @@ def _import_capture_libs():
         raise HTTPException(status_code=500, detail=f"capture backend unavailable: {exc}") from exc
 
 
-def _capture_screen():
+def _display_region(mon: dict, screen: int, mss_index: int, primary: bool) -> dict:
+    return {
+        "screen": screen,
+        "mss_index": mss_index,
+        "primary": primary,
+        "x": mon["left"],
+        "y": mon["top"],
+        "width": mon["width"],
+        "height": mon["height"],
+    }
+
+
+def _ordered_displays(sct) -> list[dict]:
+    raw_monitors = list(enumerate(sct.monitors[1:], start=1))
+    if not raw_monitors:
+        raise HTTPException(status_code=500, detail="no displays detected")
+
+    primary_pos = next(
+        (idx for idx, (_, mon) in enumerate(raw_monitors) if mon["left"] == 0 and mon["top"] == 0),
+        0,
+    )
+    ordered = [raw_monitors[primary_pos]] + [
+        item for idx, item in enumerate(raw_monitors) if idx != primary_pos
+    ]
+    return [
+        _display_region(mon, screen=index, mss_index=mss_index, primary=(index == 0))
+        for index, (mss_index, mon) in enumerate(ordered)
+    ]
+
+
+def _get_displays() -> list[dict]:
+    _, _, mss = _import_capture_libs()
+    with mss.mss() as sct:
+        return _ordered_displays(sct)
+
+
+def _select_display(screen: int) -> tuple[dict, list[dict], dict]:
+    displays = _get_displays()
+    selected = displays[screen] if 0 <= screen < len(displays) else displays[0]
+    selection = {
+        "requested": screen,
+        "selected": selected["screen"],
+        "fallback": selected["screen"] != screen,
+    }
+    return selected, displays, selection
+
+
+def _capture_screen(screen: int = 0):
     Image, _, mss = _import_capture_libs()
     with mss.mss() as sct:
-        mon = sct.monitors[1]
-        shot = sct.grab(mon)
+        displays = _ordered_displays(sct)
+        mon = displays[screen] if 0 <= screen < len(displays) else displays[0]
+        shot = sct.grab(
+            {
+                "left": mon["x"],
+                "top": mon["y"],
+                "width": mon["width"],
+                "height": mon["height"],
+            }
+        )
         image = Image.frombytes("RGB", shot.size, shot.rgb)
-        return image, {"x": mon["left"], "y": mon["top"], "width": mon["width"], "height": mon["height"]}
+        selection = {
+            "requested": screen,
+            "selected": mon["screen"],
+            "fallback": mon["screen"] != screen,
+        }
+        return image, mon, displays, selection
 
 
 def _serialize_image(image, image_format: str, jpeg_quality: int) -> bytes:
@@ -503,8 +563,9 @@ def _exec_command(req: ExecRequest) -> dict:
     }
 
 
-def _exec_action(req: ActionRequest) -> dict:
+def _exec_action(req: ActionRequest, screen: int = 0) -> dict:
     run_dry = SETTINGS["dry_run"] or req.dry_run
+    selected_display, displays, screen_selection = _select_display(screen)
 
     pyautogui = None if run_dry else _import_input_lib()
     resolved_target = None
@@ -561,6 +622,8 @@ def _exec_action(req: ActionRequest) -> dict:
         "action": req.action,
         "executed": not run_dry,
         "dry_run": run_dry,
+        "screen": screen_selection,
+        "display": selected_display,
         "resolved_target": resolved_target,
     }
 
@@ -585,6 +648,18 @@ def health(_: None = Depends(_auth)):
     }
 
 
+@app.get("/displays")
+def displays(_: None = Depends(_auth)):
+    detected = _get_displays()
+    return {
+        "ok": True,
+        "request_id": _request_id(),
+        "time_ms": _now_ms(),
+        "displays": detected,
+        "default_screen": 0,
+    }
+
+
 @app.get("/screen")
 def screen(
     with_grid: bool = True,
@@ -594,6 +669,7 @@ def screen(
     image_format: Literal["png", "jpeg"] = "png",
     jpeg_quality: int = 85,
     asImage: bool = False,
+    screen: int = 0,
     _: None = Depends(_auth),
 ):
     req = ScreenRequest(
@@ -605,8 +681,8 @@ def screen(
         jpeg_quality=jpeg_quality,
     )
 
-    base_img, mon = _capture_screen()
-    meta = {"region": mon}
+    base_img, mon, displays, screen_selection = _capture_screen(screen)
+    meta = {"region": mon, "screen": screen_selection, "displays": displays}
     out_img = base_img
 
     if req.with_grid:
@@ -634,8 +710,8 @@ def screen(
 
 
 @app.post("/zoom")
-def zoom(req: ZoomRequest, asImage: bool = False, _: None = Depends(_auth)):
-    base_img, mon = _capture_screen()
+def zoom(req: ZoomRequest, asImage: bool = False, screen: int = 0, _: None = Depends(_auth)):
+    base_img, mon, displays, screen_selection = _capture_screen(screen)
 
     cx = req.center_x - mon["x"]
     cy = req.center_y - mon["y"]
@@ -655,6 +731,8 @@ def zoom(req: ZoomRequest, asImage: bool = False, _: None = Depends(_auth)):
 
     meta = {
         "source_monitor": mon,
+        "screen": screen_selection,
+        "displays": displays,
         "region": {
             "x": region_x,
             "y": region_y,
@@ -690,8 +768,8 @@ def zoom(req: ZoomRequest, asImage: bool = False, _: None = Depends(_auth)):
 
 
 @app.post("/action")
-def action(req: ActionRequest, _: None = Depends(_auth)):
-    result = _exec_action(req)
+def action(req: ActionRequest, screen: int = 0, _: None = Depends(_auth)):
+    result = _exec_action(req, screen)
     return {
         "ok": True,
         "request_id": _request_id(),
@@ -722,14 +800,14 @@ def exec_command(
 
 
 @app.post("/ocr")
-def ocr(req: OCRRequest, _: None = Depends(_auth)):
+def ocr(req: OCRRequest, screen: int = 0, _: None = Depends(_auth)):
     source = req.mode
     if source == "image":
         image = _decode_image_base64(req.image_base64 or "")
         region = {"x": 0, "y": 0, "width": image.size[0], "height": image.size[1]}
         blocks = _run_ocr(image, req.language_hint, req.min_confidence, 0, 0)
     else:
-        base_img, mon = _capture_screen()
+        base_img, mon, displays, screen_selection = _capture_screen(screen)
         if source == "screen":
             image = base_img
             region = {"x": mon["x"], "y": mon["y"], "width": mon["width"], "height": mon["height"]}
@@ -762,6 +840,8 @@ def ocr(req: OCRRequest, _: None = Depends(_auth)):
         "time_ms": _now_ms(),
         "result": {
             "mode": source,
+            "screen": screen_selection if source != "image" else None,
+            "display": mon if source != "image" else None,
             "language_hint": req.language_hint,
             "min_confidence": req.min_confidence,
             "region": region,
@@ -771,11 +851,11 @@ def ocr(req: OCRRequest, _: None = Depends(_auth)):
 
 
 @app.post("/batch")
-def batch(req: BatchRequest, _: None = Depends(_auth)):
+def batch(req: BatchRequest, screen: int = 0, _: None = Depends(_auth)):
     results = []
     for index, item in enumerate(req.actions):
         try:
-            item_result = _exec_action(item)
+            item_result = _exec_action(item, screen)
             results.append({"index": index, "ok": True, "result": item_result})
         except Exception as exc:
             results.append({"index": index, "ok": False, "error": str(exc)})
diff --git a/skill/SKILL.md b/skill/SKILL.md
index fc11c9a..6a06972 100644
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@@ -33,13 +33,20 @@ The agent should not assume it can self-install this stack.
 ## Mini API map
 
 - `GET /health` → server status + safety flags
-- `GET /screen` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
-- `POST /zoom` → cropped screenshot around point/region (also supports `asImage=true`)
+- `GET /displays` → detected displays in zero-based API order
+- `GET /screen?screen=0` → full screenshot (JSON with base64 by default, or raw image with `asImage=true`)
+- `POST /zoom?screen=0` → cropped screenshot around point/region (also supports `asImage=true`)
 - `POST /ocr` → text extraction with bounding boxes from full screen, region, or provided image bytes
-- `POST /action` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
-- `POST /batch` → sequential action list
+- `POST /action?screen=0` → single interaction (`move`, `click`, `scroll`, `type`, `hotkey`, ...)
+- `POST /batch?screen=0` → sequential action list
 - `POST /exec` → PowerShell/Bash/CMD command execution (requires configured exec secret + header)
 
+### Display selection
+
+- Use `GET /displays` before operating on multi-monitor systems.
+- Use `?screen=X` on `/screen`, `/zoom`, `/ocr`, `/action`, and `/batch`; invalid values fall back to `screen=0`.
+- Treat returned `region` and OCR bounding boxes as global desktop coordinates, not screen-local coordinates.
+
 ### OCR usage
 
 - Prefer `POST /ocr` when targeting text-heavy UI (menus, labels, buttons, dialogs).
@@ -55,7 +62,7 @@ The agent should not assume it can self-install this stack.
 
 ## Core workflow (mandatory)
 
-1. Call `GET /screen` with coarse grid (e.g., 12x12).
+1. Call `GET /screen?screen=0` with coarse grid (e.g., 12x12), or another selected display.
 2. Identify likely target region and compute an initial confidence score.
 3. If confidence < 0.85, call `POST /zoom` with denser grid (e.g., 20x20) and re-evaluate.
 4. **Before any click**, verify target identity (OCR text/icon/location consistency).