Add stuck-job recovery for running jobs #1

luna · 2026-05-22T23:11:08+02:00

luna commented

2026-05-22 23:11:08 +02:00

Implements lease-timeout recovery for stuck running jobs so crashed workers do not deadlock the queue.

What changed

Added JOB_LEASE_TIMEOUT_SECONDS (default 300)
Added STUCK_JOB_RECOVERY_ACTION (requeue or fail, default requeue)
Worker now recovers stale running jobs before claiming next queued job
Marks stale running run attempts as failed with timeout reason
Added transition tests for both requeue and fail recovery actions
Added env examples for both new settings

Verification

python3 -m compileall -q src tests
Could not run pytest in this environment because pytest is not installed.

Implements lease-timeout recovery for stuck `running` jobs so crashed workers do not deadlock the queue. ### What changed - Added `JOB_LEASE_TIMEOUT_SECONDS` (default `300`) - Added `STUCK_JOB_RECOVERY_ACTION` (`requeue` or `fail`, default `requeue`) - Worker now recovers stale running jobs before claiming next queued job - Marks stale running run attempts as failed with timeout reason - Added transition tests for both requeue and fail recovery actions - Added env examples for both new settings ### Verification - `python3 -m compileall -q src tests` - Could not run pytest in this environment because `pytest` is not installed.

luna commented

2026-05-22 23:26:06 +02:00

follow-up pushed in 1fb67b6:

added MAX_STUCK_JOB_RETRIES (default 1)
with STUCK_JOB_RECOVERY_ACTION=requeue, a stuck job is retried once, then failed on next timeout
when retries are exhausted, bot posts a PR error comment and stops requeueing that job

this addresses the infinite-requeue concern.

follow-up pushed in `1fb67b6`: - added `MAX_STUCK_JOB_RETRIES` (default `1`) - with `STUCK_JOB_RECOVERY_ACTION=requeue`, a stuck job is retried once, then failed on next timeout - when retries are exhausted, bot posts a PR error comment and stops requeueing that job this addresses the infinite-requeue concern.

space commented

2026-05-22 23:46:02 +02:00

@codex review "can you also checkout how big the merge conflict is? rate it one to ten"

gitea-codex commented

2026-05-22 23:46:03 +02:00

First-time contributor

👀 Codex review queued for commit 1fb67b6.

👀 Codex review queued for commit `1fb67b6`.

gitea-codex commented

2026-05-22 23:47:16 +02:00

First-time contributor

Review Result

I found one correctness issue introduced by this PR.

Findings

1) Lease timeout default conflicts with allowed review runtime (High)

JOB_LEASE_TIMEOUT_SECONDS is added with default 300 in .env.example.
Existing default MAX_REVIEW_MINUTES=10 permits up to 600 seconds of normal runtime.
recover_stuck_jobs treats any running job older than lease timeout as stuck, then requeues/fails it.

This means valid jobs running between 5 and 10 minutes can be incorrectly recovered, leading to duplicate work or premature failure.

Recommendation: enforce JOB_LEASE_TIMEOUT_SECONDS >= MAX_REVIEW_MINUTES * 60 (+ safety buffer) via config validation, and/or increase the default lease timeout accordingly.

Merge Conflict Size (requested)

Based on the scope and touch points in this PR (job state machine + dispatcher + tests), estimated conflict risk is 4/10: moderate in worker/job-flow areas, low elsewhere.

Note: model gpt-5.3-codex, input 13651, output 1386 tokens used.

## Review Result I found one correctness issue introduced by this PR. ## Findings ### 1) Lease timeout default conflicts with allowed review runtime (High) - `JOB_LEASE_TIMEOUT_SECONDS` is added with default `300` in `.env.example`. - Existing default `MAX_REVIEW_MINUTES=10` permits up to `600` seconds of normal runtime. - `recover_stuck_jobs` treats any running job older than lease timeout as stuck, then requeues/fails it. This means valid jobs running between 5 and 10 minutes can be incorrectly recovered, leading to duplicate work or premature failure. **Recommendation:** enforce `JOB_LEASE_TIMEOUT_SECONDS >= MAX_REVIEW_MINUTES * 60 (+ safety buffer)` via config validation, and/or increase the default lease timeout accordingly. ## Merge Conflict Size (requested) Based on the scope and touch points in this PR (job state machine + dispatcher + tests), estimated conflict risk is **4/10**: moderate in worker/job-flow areas, low elsewhere. _Note: model `gpt-5.3-codex`, input `13651`, output `1386` tokens used._

luna added 3 commits 2026-05-22 23:51:04 +02:00

Add stuck running job recovery with lease timeout dfd67c1cce

Cap stuck-job requeue retries and notify on exhaustion d24b4f4f79

Validate lease timeout against max review runtime

ci / test (pull_request) Successful in 30s

Details

ci / publish (pull_request) Has been skipped

Details

c73aadc660

luna force-pushed feat/stuck-job-recovery from 1fb67b630d to c73aadc660

2026-05-22 23:51:04 +02:00

Compare

luna commented

2026-05-22 23:51:09 +02:00

@codex review

gitea-codex commented

2026-05-22 23:51:11 +02:00

First-time contributor

👀 Codex review queued for commit c73aadc.

👀 Codex review queued for commit `c73aadc`.

gitea-codex commented

2026-05-22 23:52:35 +02:00

First-time contributor

PR Review Summary

The stuck-job recovery addition is directionally good and tests cover core requeue/fail behavior. I found two correctness/configuration issues that should be addressed before merge.

Findings

1) Lease-timeout retry accounting can be bypassed in stale-state edge cases

File: src/gitea_codex_bot/services/jobs.py

recover_stuck_jobs only writes the timeout failure to ReviewRun when latest_run.status == running.

Because retry exhaustion is computed from failed runs with the timeout prefix, a running job whose latest run is already non-running (partial/crashed state transition, manual DB repair, etc.) will not record this timeout event. That can allow repeated requeues without advancing timeout-retry history.

Suggested fix: Always persist a timeout failure run (or move retry tracking to a dedicated job-level counter).

2) Negative retry config is accepted

File: src/gitea_codex_bot/config.py

MAX_STUCK_JOB_RETRIES is currently any integer. Negative values are accepted and cause immediate failure on first recovery attempt (lease_retries_used >= max_retries is always true for negative thresholds).

Suggested fix: Validate MAX_STUCK_JOB_RETRIES >= 0 and add a test for invalid negative values.

Requested Follow-up

Add a regression test for stale job recovery when latest run is not running.
Add config validation/test for negative MAX_STUCK_JOB_RETRIES.

Note: model gpt-5.3-codex, input 17899, output 1713 tokens used.

## PR Review Summary The stuck-job recovery addition is directionally good and tests cover core requeue/fail behavior. I found two correctness/configuration issues that should be addressed before merge. ## Findings ### 1) Lease-timeout retry accounting can be bypassed in stale-state edge cases **File:** `src/gitea_codex_bot/services/jobs.py` `recover_stuck_jobs` only writes the timeout failure to `ReviewRun` when `latest_run.status == running`. Because retry exhaustion is computed from failed runs with the timeout prefix, a `running` job whose latest run is already non-running (partial/crashed state transition, manual DB repair, etc.) will not record this timeout event. That can allow repeated requeues without advancing timeout-retry history. **Suggested fix:** Always persist a timeout failure run (or move retry tracking to a dedicated job-level counter). ### 2) Negative retry config is accepted **File:** `src/gitea_codex_bot/config.py` `MAX_STUCK_JOB_RETRIES` is currently any integer. Negative values are accepted and cause immediate failure on first recovery attempt (`lease_retries_used >= max_retries` is always true for negative thresholds). **Suggested fix:** Validate `MAX_STUCK_JOB_RETRIES >= 0` and add a test for invalid negative values. ## Requested Follow-up - Add a regression test for stale job recovery when latest run is not `running`. - Add config validation/test for negative `MAX_STUCK_JOB_RETRIES`. _Note: model `gpt-5.3-codex`, input `17899`, output `1713` tokens used._

luna added 1 commit 2026-05-22 23:54:57 +02:00

Harden stuck-job retry accounting and retry config validation

ci / test (pull_request) Successful in 27s

Details

ci / publish (pull_request) Has been skipped

Details

9028d5b56c

luna commented

2026-05-22 23:55:03 +02:00

@codex review

gitea-codex commented

2026-05-22 23:55:06 +02:00

First-time contributor

👀 Codex review queued for commit 9028d5b.

👀 Codex review queued for commit `9028d5b`.

gitea-codex commented

2026-05-22 23:55:49 +02:00

First-time contributor

Review Result

This PR introduces useful stuck-job recovery behavior, but there is one correctness issue that should be addressed before merge.

Findings

High: Requeued timed-out jobs can still be completed by the original worker

In recover_stuck_jobs, timed-out running jobs are moved back to queued for retry, but there is no lease/version guard to prevent the original (timed-out) worker from later calling finish_job successfully.

That creates a race:

worker A runs job J slowly
dispatcher times out J and requeues it
worker B claims J and processes it
worker A eventually returns and calls finish_job

Without a lease/token check, stale completion from worker A can overwrite job state and produce duplicate/conflicting PR comments.

Suggested fix:

Add a lease identity (lease_version or strict run_id) set at claim time.
Require that identity in finish_job update conditions.
Invalidate/rotate the lease during stuck-job recovery so late completions from old attempts are rejected.

Notes

Tests added in this PR cover recovery transitions well, but there is currently no test for this late-finisher race after requeue.

Note: model gpt-5.3-codex, input 14353, output 1549 tokens used.

## Review Result This PR introduces useful stuck-job recovery behavior, but there is one correctness issue that should be addressed before merge. ## Findings ### High: Requeued timed-out jobs can still be completed by the original worker In `recover_stuck_jobs`, timed-out running jobs are moved back to `queued` for retry, but there is no lease/version guard to prevent the original (timed-out) worker from later calling `finish_job` successfully. That creates a race: - worker A runs job J slowly - dispatcher times out J and requeues it - worker B claims J and processes it - worker A eventually returns and calls `finish_job` Without a lease/token check, stale completion from worker A can overwrite job state and produce duplicate/conflicting PR comments. Suggested fix: - Add a lease identity (`lease_version` or strict `run_id`) set at claim time. - Require that identity in `finish_job` update conditions. - Invalidate/rotate the lease during stuck-job recovery so late completions from old attempts are rejected. ## Notes Tests added in this PR cover recovery transitions well, but there is currently no test for this late-finisher race after requeue. _Note: model `gpt-5.3-codex`, input `14353`, output `1549` tokens used._

space commented

2026-05-23 00:01:21 +02:00

this pr has eaten so many tokens holy shit, i aborted luna, this is annoying me

space closed this pull request

2026-05-23 00:01:21 +02:00

ci / test (pull_request) Successful in 27s

Details

ci / publish (pull_request) Has been skipped

Details

Pull request closed

This pull request cannot be reopened because the branch was deleted.

Sign in to join this conversation.

3 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: space/gitea-codex#1