Add stuck-job recovery for running jobs #1
Reference in New Issue
Block a user
Delete Branch "feat/stuck-job-recovery"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Implements lease-timeout recovery for stuck
runningjobs so crashed workers do not deadlock the queue.What changed
JOB_LEASE_TIMEOUT_SECONDS(default300)STUCK_JOB_RECOVERY_ACTION(requeueorfail, defaultrequeue)Verification
python3 -m compileall -q src testspytestis not installed.follow-up pushed in
1fb67b6:MAX_STUCK_JOB_RETRIES(default1)STUCK_JOB_RECOVERY_ACTION=requeue, a stuck job is retried once, then failed on next timeoutthis addresses the infinite-requeue concern.
@codex review "can you also checkout how big the merge conflict is? rate it one to ten"
👀 Codex review queued for commit
1fb67b6.Review Result
I found one correctness issue introduced by this PR.
Findings
1) Lease timeout default conflicts with allowed review runtime (High)
JOB_LEASE_TIMEOUT_SECONDSis added with default300in.env.example.MAX_REVIEW_MINUTES=10permits up to600seconds of normal runtime.recover_stuck_jobstreats any running job older than lease timeout as stuck, then requeues/fails it.This means valid jobs running between 5 and 10 minutes can be incorrectly recovered, leading to duplicate work or premature failure.
Recommendation: enforce
JOB_LEASE_TIMEOUT_SECONDS >= MAX_REVIEW_MINUTES * 60 (+ safety buffer)via config validation, and/or increase the default lease timeout accordingly.Merge Conflict Size (requested)
Based on the scope and touch points in this PR (job state machine + dispatcher + tests), estimated conflict risk is 4/10: moderate in worker/job-flow areas, low elsewhere.
Note: model
gpt-5.3-codex, input13651, output1386tokens used.1fb67b630dtoc73aadc660@codex review
👀 Codex review queued for commit
c73aadc.PR Review Summary
The stuck-job recovery addition is directionally good and tests cover core requeue/fail behavior. I found two correctness/configuration issues that should be addressed before merge.
Findings
1) Lease-timeout retry accounting can be bypassed in stale-state edge cases
File:
src/gitea_codex_bot/services/jobs.pyrecover_stuck_jobsonly writes the timeout failure toReviewRunwhenlatest_run.status == running.Because retry exhaustion is computed from failed runs with the timeout prefix, a
runningjob whose latest run is already non-running (partial/crashed state transition, manual DB repair, etc.) will not record this timeout event. That can allow repeated requeues without advancing timeout-retry history.Suggested fix: Always persist a timeout failure run (or move retry tracking to a dedicated job-level counter).
2) Negative retry config is accepted
File:
src/gitea_codex_bot/config.pyMAX_STUCK_JOB_RETRIESis currently any integer. Negative values are accepted and cause immediate failure on first recovery attempt (lease_retries_used >= max_retriesis always true for negative thresholds).Suggested fix: Validate
MAX_STUCK_JOB_RETRIES >= 0and add a test for invalid negative values.Requested Follow-up
running.MAX_STUCK_JOB_RETRIES.Note: model
gpt-5.3-codex, input17899, output1713tokens used.@codex review
👀 Codex review queued for commit
9028d5b.Review Result
This PR introduces useful stuck-job recovery behavior, but there is one correctness issue that should be addressed before merge.
Findings
High: Requeued timed-out jobs can still be completed by the original worker
In
recover_stuck_jobs, timed-out running jobs are moved back toqueuedfor retry, but there is no lease/version guard to prevent the original (timed-out) worker from later callingfinish_jobsuccessfully.That creates a race:
finish_jobWithout a lease/token check, stale completion from worker A can overwrite job state and produce duplicate/conflicting PR comments.
Suggested fix:
lease_versionor strictrun_id) set at claim time.finish_jobupdate conditions.Notes
Tests added in this PR cover recovery transitions well, but there is currently no test for this late-finisher race after requeue.
Note: model
gpt-5.3-codex, input14353, output1549tokens used.this pr has eaten so many tokens holy shit, i aborted luna, this is annoying me
Pull request closed