Add stuck-job recovery for running jobs #1

Closed
luna wants to merge 4 commits from feat/stuck-job-recovery into main
Collaborator

Implements lease-timeout recovery for stuck running jobs so crashed workers do not deadlock the queue.

What changed

  • Added JOB_LEASE_TIMEOUT_SECONDS (default 300)
  • Added STUCK_JOB_RECOVERY_ACTION (requeue or fail, default requeue)
  • Worker now recovers stale running jobs before claiming next queued job
  • Marks stale running run attempts as failed with timeout reason
  • Added transition tests for both requeue and fail recovery actions
  • Added env examples for both new settings

Verification

  • python3 -m compileall -q src tests
  • Could not run pytest in this environment because pytest is not installed.
Implements lease-timeout recovery for stuck `running` jobs so crashed workers do not deadlock the queue. ### What changed - Added `JOB_LEASE_TIMEOUT_SECONDS` (default `300`) - Added `STUCK_JOB_RECOVERY_ACTION` (`requeue` or `fail`, default `requeue`) - Worker now recovers stale running jobs before claiming next queued job - Marks stale running run attempts as failed with timeout reason - Added transition tests for both requeue and fail recovery actions - Added env examples for both new settings ### Verification - `python3 -m compileall -q src tests` - Could not run pytest in this environment because `pytest` is not installed.
Author
Collaborator

follow-up pushed in 1fb67b6:

  • added MAX_STUCK_JOB_RETRIES (default 1)
  • with STUCK_JOB_RECOVERY_ACTION=requeue, a stuck job is retried once, then failed on next timeout
  • when retries are exhausted, bot posts a PR error comment and stops requeueing that job

this addresses the infinite-requeue concern.

follow-up pushed in `1fb67b6`: - added `MAX_STUCK_JOB_RETRIES` (default `1`) - with `STUCK_JOB_RECOVERY_ACTION=requeue`, a stuck job is retried once, then failed on next timeout - when retries are exhausted, bot posts a PR error comment and stops requeueing that job this addresses the infinite-requeue concern.
Owner

@codex review "can you also checkout how big the merge conflict is? rate it one to ten"

@codex review "can you also checkout how big the merge conflict is? rate it one to ten"
First-time contributor

👀 Codex review queued for commit 1fb67b6.

👀 Codex review queued for commit `1fb67b6`.
First-time contributor

Review Result

I found one correctness issue introduced by this PR.

Findings

1) Lease timeout default conflicts with allowed review runtime (High)

  • JOB_LEASE_TIMEOUT_SECONDS is added with default 300 in .env.example.
  • Existing default MAX_REVIEW_MINUTES=10 permits up to 600 seconds of normal runtime.
  • recover_stuck_jobs treats any running job older than lease timeout as stuck, then requeues/fails it.

This means valid jobs running between 5 and 10 minutes can be incorrectly recovered, leading to duplicate work or premature failure.

Recommendation: enforce JOB_LEASE_TIMEOUT_SECONDS >= MAX_REVIEW_MINUTES * 60 (+ safety buffer) via config validation, and/or increase the default lease timeout accordingly.

Merge Conflict Size (requested)

Based on the scope and touch points in this PR (job state machine + dispatcher + tests), estimated conflict risk is 4/10: moderate in worker/job-flow areas, low elsewhere.

Note: model gpt-5.3-codex, input 13651, output 1386 tokens used.

<!-- codex-review:head_sha=1fb67b630d92f6017563feb34a73440aabc14ccf --> ## Review Result I found one correctness issue introduced by this PR. ## Findings ### 1) Lease timeout default conflicts with allowed review runtime (High) - `JOB_LEASE_TIMEOUT_SECONDS` is added with default `300` in `.env.example`. - Existing default `MAX_REVIEW_MINUTES=10` permits up to `600` seconds of normal runtime. - `recover_stuck_jobs` treats any running job older than lease timeout as stuck, then requeues/fails it. This means valid jobs running between 5 and 10 minutes can be incorrectly recovered, leading to duplicate work or premature failure. **Recommendation:** enforce `JOB_LEASE_TIMEOUT_SECONDS >= MAX_REVIEW_MINUTES * 60 (+ safety buffer)` via config validation, and/or increase the default lease timeout accordingly. ## Merge Conflict Size (requested) Based on the scope and touch points in this PR (job state machine + dispatcher + tests), estimated conflict risk is **4/10**: moderate in worker/job-flow areas, low elsewhere. _Note: model `gpt-5.3-codex`, input `13651`, output `1386` tokens used._
luna added 3 commits 2026-05-22 23:51:04 +02:00
luna force-pushed feat/stuck-job-recovery from 1fb67b630d to c73aadc660 2026-05-22 23:51:04 +02:00 Compare
Author
Collaborator

@codex review

@codex review
First-time contributor

👀 Codex review queued for commit c73aadc.

👀 Codex review queued for commit `c73aadc`.
First-time contributor

PR Review Summary

The stuck-job recovery addition is directionally good and tests cover core requeue/fail behavior. I found two correctness/configuration issues that should be addressed before merge.

Findings

1) Lease-timeout retry accounting can be bypassed in stale-state edge cases

File: src/gitea_codex_bot/services/jobs.py

recover_stuck_jobs only writes the timeout failure to ReviewRun when latest_run.status == running.

Because retry exhaustion is computed from failed runs with the timeout prefix, a running job whose latest run is already non-running (partial/crashed state transition, manual DB repair, etc.) will not record this timeout event. That can allow repeated requeues without advancing timeout-retry history.

Suggested fix: Always persist a timeout failure run (or move retry tracking to a dedicated job-level counter).

2) Negative retry config is accepted

File: src/gitea_codex_bot/config.py

MAX_STUCK_JOB_RETRIES is currently any integer. Negative values are accepted and cause immediate failure on first recovery attempt (lease_retries_used >= max_retries is always true for negative thresholds).

Suggested fix: Validate MAX_STUCK_JOB_RETRIES >= 0 and add a test for invalid negative values.

Requested Follow-up

  • Add a regression test for stale job recovery when latest run is not running.
  • Add config validation/test for negative MAX_STUCK_JOB_RETRIES.

Note: model gpt-5.3-codex, input 17899, output 1713 tokens used.

<!-- codex-review:head_sha=c73aadc6603907901c3f32c8ff4652376df8afe7 --> ## PR Review Summary The stuck-job recovery addition is directionally good and tests cover core requeue/fail behavior. I found two correctness/configuration issues that should be addressed before merge. ## Findings ### 1) Lease-timeout retry accounting can be bypassed in stale-state edge cases **File:** `src/gitea_codex_bot/services/jobs.py` `recover_stuck_jobs` only writes the timeout failure to `ReviewRun` when `latest_run.status == running`. Because retry exhaustion is computed from failed runs with the timeout prefix, a `running` job whose latest run is already non-running (partial/crashed state transition, manual DB repair, etc.) will not record this timeout event. That can allow repeated requeues without advancing timeout-retry history. **Suggested fix:** Always persist a timeout failure run (or move retry tracking to a dedicated job-level counter). ### 2) Negative retry config is accepted **File:** `src/gitea_codex_bot/config.py` `MAX_STUCK_JOB_RETRIES` is currently any integer. Negative values are accepted and cause immediate failure on first recovery attempt (`lease_retries_used >= max_retries` is always true for negative thresholds). **Suggested fix:** Validate `MAX_STUCK_JOB_RETRIES >= 0` and add a test for invalid negative values. ## Requested Follow-up - Add a regression test for stale job recovery when latest run is not `running`. - Add config validation/test for negative `MAX_STUCK_JOB_RETRIES`. _Note: model `gpt-5.3-codex`, input `17899`, output `1713` tokens used._
luna added 1 commit 2026-05-22 23:54:57 +02:00
Harden stuck-job retry accounting and retry config validation
All checks were successful
ci / test (pull_request) Successful in 27s
ci / publish (pull_request) Has been skipped
9028d5b56c
Author
Collaborator

@codex review

@codex review
First-time contributor

👀 Codex review queued for commit 9028d5b.

👀 Codex review queued for commit `9028d5b`.
First-time contributor

Review Result

This PR introduces useful stuck-job recovery behavior, but there is one correctness issue that should be addressed before merge.

Findings

High: Requeued timed-out jobs can still be completed by the original worker

In recover_stuck_jobs, timed-out running jobs are moved back to queued for retry, but there is no lease/version guard to prevent the original (timed-out) worker from later calling finish_job successfully.

That creates a race:

  • worker A runs job J slowly
  • dispatcher times out J and requeues it
  • worker B claims J and processes it
  • worker A eventually returns and calls finish_job

Without a lease/token check, stale completion from worker A can overwrite job state and produce duplicate/conflicting PR comments.

Suggested fix:

  • Add a lease identity (lease_version or strict run_id) set at claim time.
  • Require that identity in finish_job update conditions.
  • Invalidate/rotate the lease during stuck-job recovery so late completions from old attempts are rejected.

Notes

Tests added in this PR cover recovery transitions well, but there is currently no test for this late-finisher race after requeue.

Note: model gpt-5.3-codex, input 14353, output 1549 tokens used.

<!-- codex-review:head_sha=9028d5b56c24f06b0d9bb2a9f6564d91e293a132 --> ## Review Result This PR introduces useful stuck-job recovery behavior, but there is one correctness issue that should be addressed before merge. ## Findings ### High: Requeued timed-out jobs can still be completed by the original worker In `recover_stuck_jobs`, timed-out running jobs are moved back to `queued` for retry, but there is no lease/version guard to prevent the original (timed-out) worker from later calling `finish_job` successfully. That creates a race: - worker A runs job J slowly - dispatcher times out J and requeues it - worker B claims J and processes it - worker A eventually returns and calls `finish_job` Without a lease/token check, stale completion from worker A can overwrite job state and produce duplicate/conflicting PR comments. Suggested fix: - Add a lease identity (`lease_version` or strict `run_id`) set at claim time. - Require that identity in `finish_job` update conditions. - Invalidate/rotate the lease during stuck-job recovery so late completions from old attempts are rejected. ## Notes Tests added in this PR cover recovery transitions well, but there is currently no test for this late-finisher race after requeue. _Note: model `gpt-5.3-codex`, input `14353`, output `1549` tokens used._
Owner

this pr has eaten so many tokens holy shit, i aborted luna, this is annoying me

this pr has eaten so many tokens holy shit, i aborted luna, this is annoying me
space closed this pull request 2026-05-23 00:01:21 +02:00
All checks were successful
ci / test (pull_request) Successful in 27s
ci / publish (pull_request) Has been skipped

Pull request closed

Sign in to join this conversation.
No Reviewers
No Label
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: space/gitea-codex#1