Skip to content

fix(cleanupFailedJobs): delete orphaned pending jobs in addition to failed ones#561

Merged
kelvinkipruto merged 4 commits intomainfrom
ft/cleanup-orphaned-jobs
Mar 10, 2026
Merged

fix(cleanupFailedJobs): delete orphaned pending jobs in addition to failed ones#561
kelvinkipruto merged 4 commits intomainfrom
ft/cleanup-orphaned-jobs

Conversation

@kelvinkipruto
Copy link
Contributor

Description

Previously, the cleanup task only removed failed jobs older than the retention window. This left orphaned pending jobs (never attempted, not scheduled, and older than retention) to accumulate indefinitely, especially in unused queues like "default".

Now the task also deletes these orphaned jobs, using parallel counting and deletion for efficiency. The output now distinguishes between deleted failed and deleted orphaned job counts.

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Screenshots

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation

…ailed ones

Previously, the cleanup task only removed failed jobs older than the retention window. This left orphaned pending jobs (never attempted, not scheduled, and older than retention) to accumulate indefinitely, especially in unused queues like "default".

Now the task also deletes these orphaned jobs, using parallel counting and deletion for efficiency. The output now distinguishes between deleted failed and deleted orphaned job counts.
@vercel
Copy link

vercel bot commented Mar 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
PromiseTracker Ready Ready Preview, Comment Mar 10, 2026 2:50pm

Request Review

@kelvinkipruto
Copy link
Contributor Author

@claude Review

@claude
Copy link

claude bot commented Mar 10, 2026

Claude finished @kelvinkipruto's task in 1m 37s —— View job


Code Review

  • Read changed files
  • Analyze code for bugs, performance, and correctness
  • Post review feedback

Overall the change is well-structured and the logic is sound. A few things worth noting:


Correctness

The orphanedWhere query correctly identifies stranded jobs using:

  • processing: false — not actively running
  • hasError: false — no error state
  • totalTried: 0 — never attempted
  • createdAt < cutoffDate — older than retention window
  • waitUntil doesn't exist or is in the past — not scheduled for the future

This is a correct and safe set of conditions. A job with a future waitUntil (even if old) won't be deleted, which is the right call.


Issues

1. Short-circuit && pattern in Promise.all is misleading (cleanupFailedJobs.ts:72-85)

await Promise.all([
  failedCount && payload.delete(...),   // evaluates to `0` when falsy
  orphanedCount && payload.delete(...), // evaluates to `0` when falsy
]);

When failedCount is 0, the expression evaluates to 0 (not a Promise). Promise.all wraps non-Promise values via Promise.resolve(), so this is technically correct — but it's a subtle pattern that could confuse readers and could mask bugs if the type of failedCount ever changes. Consider using an explicit conditional:

const deleteOps = [];
if (failedCount) deleteOps.push(payload.delete({ collection: "payload-jobs", where: failedWhere, overrideAccess: true }));
if (orphanedCount) deleteOps.push(payload.delete({ collection: "payload-jobs", where: orphanedWhere, overrideAccess: true }));
await Promise.all(deleteOps);

Fix this →


2. Breaking change in output shape (cleanupFailedJobs.ts:94-96)

The output changed from { deleted: number } to { deletedFailed: number, deletedOrphaned: number }. Since TaskCleanupFailedJobs.output is typed as unknown in payload-types.ts, TypeScript won't catch downstream consumers. If any monitoring, alerting, or logging dashboards read the deleted field from job output, they would silently receive undefined. Worth confirming no external consumers rely on the old shape.


Minor Notes

  • The now variable (line 28) is used exclusively for the waitUntil < now condition. The naming is clear, though a comment like // Current time for waitUntil comparison could help distinguish it from cutoffDate for future readers.
  • Using createdAt for orphaned jobs vs. completedAt for failed jobs is the correct distinction — orphaned jobs were never completed, so this is intentional and fine.

Summary: The core logic is correct and the parallel count + delete approach is a good pattern. The main actionable items are the && short-circuit style issue and confirming no consumers break on the output shape change.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af1f771996

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Add uploadError field to extraction objects to store permanent failure reasons.
Extractions with uploadError set are excluded from upload retries and marked as
permanently failed in the admin UI. Update completion logic to handle mixed
success/failure states and provide detailed status in Airtable.
@kelvinkipruto
Copy link
Contributor Author

@claude Review

@kelvinkipruto
Copy link
Contributor Author

@codex Review

@claude
Copy link

claude bot commented Mar 10, 2026

Claude finished @kelvinkipruto's task in 1m 48s —— View job


Code Review

  • Read changed files
  • Analyze code for bugs, performance, and correctness
  • Post review feedback

This PR touches two independent concerns: (1) orphaned job cleanup in cleanupFailedJobs.ts and (2) permanent failure tracking in uploadToMeedan.ts. Both changes are well-motivated, but each has issues worth discussing.


cleanupFailedJobs.ts

P1 — Orphaned job deletion is not scoped by queue

The orphanedWhere predicate deletes any untried, non-processing job older than the retention window, across all queues. This is the same concern raised by the Codex reviewer.

The problem: if a legitimate queue (not just the "default" queue) experiences a worker outage longer than the retention window (24h by default), every unprocessed job in that queue would be permanently deleted rather than executed when the worker comes back. The comment says "jobs queued into a queue that has no worker (e.g. the 'default' queue)", but the code doesn't enforce this scope.

Consider adding a queue filter to restrict orphan cleanup to known unserved queues:

const orphanedWhere: Where = {
  and: [
    { queue: { equals: "default" } }, // or a list of known unserved queues
    { processing: { equals: false } },
    { hasError: { equals: false } },
    { totalTried: { equals: 0 } },
    { createdAt: { less_than: cutoffDate.toISOString() } },
    {
      or: [
        { waitUntil: { exists: false } },
        { waitUntil: { less_than: now } },
      ],
    },
  ],
};

Or expose the queue names as an environment variable (e.g. PAYLOAD_JOBS_ORPHANED_QUEUES=default,legacy) so it can be configured per deployment.

Fix this →

Minor — Count/delete race window

The count and delete happen in separate operations. Between the count and the delete, new orphaned jobs could be queued, or the cleanup could run concurrently. This only affects the accuracy of the logged count — the delete itself is still safe. This is acceptable in practice, just worth noting in a comment.

Minor — Output shape is a breaking change

Confirmed in this PR: the output changes from { deleted: number } to { deletedFailed: number, deletedOrphaned: number }. Since TaskCleanupFailedJobs.output is typed unknown in payload-types.ts, TypeScript won't catch downstream consumers reading deleted. Ensure no monitoring, alerting, or admin UI reads the old field.


uploadToMeedan.ts

Behavioral change — setDocumentFailedStatus is no longer called on extraction failures

Previously, when docFailedExtractions > 0, the code called setDocumentFailedStatus(...) in Airtable and then continued to the next doc (skipping the markDocumentAsProcessedWhenReady call). The document was explicitly marked as failed in Airtable.

In this PR, setDocumentFailedStatus is removed from this path entirely. Now, failed extractions are persisted to the DB and the document is marked complete with a partial-success status like "Uploaded to Meedan (2 out of 5 failed)".

This is a product decision rather than a bug, but the trade-off is: Airtable operators no longer see a "Failed: ..." status for documents with failed extractions — they see a success variant instead. If Airtable workflows depend on the Failed: prefix to triage issues, this change could cause failures to go unnoticed.

Edge case: if all extractions for a document fail, the document would be marked "Uploaded to Meedan (5 out of 5 failed)". The label "Uploaded to Meedan" with a 100% failure rate is misleading. Consider a different status like "Failed: all extractions failed" for the total-failure case.

Fix this →

Good — hasPendingExtractions and hasHandledExtractions logic is correct

The updated predicates correctly account for uploadError when determining whether extractions are pending or handled. The canMarkDocumentAsProcessed check will correctly wait until all extractions are either uploaded or permanently failed before marking a document complete.

Good — Concurrent update risk is low

The payload.update inside the extraction catch block updates the full extractions array based on the in-memory doc.extractions. If two extraction tasks for the same doc ran concurrently, one could overwrite the other's update. But since the task loops sequentially over extractions for a given doc, this race doesn't apply within a single task run. The only risk is across concurrent task runs for the same doc, which is unlikely given the task structure.


Summary

Issue Severity File
Orphaned job deletion not scoped by queue High cleanupFailedJobs.ts:41
All-failed case shows misleading "Uploaded to Meedan" status Medium uploadToMeedan.ts:590
setDocumentFailedStatus removal changes Airtable failure visibility Medium (product decision) uploadToMeedan.ts:575
Output shape deleteddeletedFailed/deletedOrphaned breaks consumers Low cleanupFailedJobs.ts:93

The most critical item to address before merge is the queue-scoping of orphaned job deletion — without it, any active queue that's down for more than 24 hours will have its jobs silently deleted.


Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dc0ab211b9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…tatus

When a document has no extractions to upload, calculate a more accurate status message based on existing failures. Also, fetch a fresh document before updating extraction errors to avoid overwriting concurrent updates from other loop iterations.
@kelvinkipruto kelvinkipruto merged commit 76a14d4 into main Mar 10, 2026
5 of 6 checks passed
@kelvinkipruto kelvinkipruto deleted the ft/cleanup-orphaned-jobs branch March 10, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant