fix(cleanupFailedJobs): delete orphaned pending jobs in addition to failed ones by kelvinkipruto · Pull Request #561 · CodeForAfrica/PromiseTracker

kelvinkipruto · 2026-03-10T13:55:04Z

Description

Previously, the cleanup task only removed failed jobs older than the retention window. This left orphaned pending jobs (never attempted, not scheduled, and older than retention) to accumulate indefinitely, especially in unused queues like "default".

Now the task also deletes these orphaned jobs, using parallel counting and deletion for efficiency. The output now distinguishes between deleted failed and deleted orphaned job counts.

Fixes # (issue)

Type of change

Bug fix (non-breaking change which fixes an issue)

Screenshots

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation

…ailed ones Previously, the cleanup task only removed failed jobs older than the retention window. This left orphaned pending jobs (never attempted, not scheduled, and older than retention) to accumulate indefinitely, especially in unused queues like "default". Now the task also deletes these orphaned jobs, using parallel counting and deletion for efficiency. The output now distinguishes between deleted failed and deleted orphaned job counts.

vercel · 2026-03-10T13:55:10Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
PromiseTracker	Ready	Preview, Comment	Mar 10, 2026 2:50pm

kelvinkipruto · 2026-03-10T13:55:28Z

@claude Review

claude · 2026-03-10T13:55:46Z

Claude finished @kelvinkipruto's task in 1m 37s —— View job

Code Review

Read changed files
Analyze code for bugs, performance, and correctness
Post review feedback

Overall the change is well-structured and the logic is sound. A few things worth noting:

Correctness

The orphanedWhere query correctly identifies stranded jobs using:

processing: false — not actively running
hasError: false — no error state
totalTried: 0 — never attempted
createdAt < cutoffDate — older than retention window
waitUntil doesn't exist or is in the past — not scheduled for the future

This is a correct and safe set of conditions. A job with a future waitUntil (even if old) won't be deleted, which is the right call.

Issues

1. Short-circuit && pattern in Promise.all is misleading (cleanupFailedJobs.ts:72-85)

await Promise.all([
  failedCount && payload.delete(...),   // evaluates to `0` when falsy
  orphanedCount && payload.delete(...), // evaluates to `0` when falsy
]);

When failedCount is 0, the expression evaluates to 0 (not a Promise). Promise.all wraps non-Promise values via Promise.resolve(), so this is technically correct — but it's a subtle pattern that could confuse readers and could mask bugs if the type of failedCount ever changes. Consider using an explicit conditional:

const deleteOps = [];
if (failedCount) deleteOps.push(payload.delete({ collection: "payload-jobs", where: failedWhere, overrideAccess: true }));
if (orphanedCount) deleteOps.push(payload.delete({ collection: "payload-jobs", where: orphanedWhere, overrideAccess: true }));
await Promise.all(deleteOps);

Fix this →

2. Breaking change in output shape (cleanupFailedJobs.ts:94-96)

The output changed from { deleted: number } to { deletedFailed: number, deletedOrphaned: number }. Since TaskCleanupFailedJobs.output is typed as unknown in payload-types.ts, TypeScript won't catch downstream consumers. If any monitoring, alerting, or logging dashboards read the deleted field from job output, they would silently receive undefined. Worth confirming no external consumers rely on the old shape.

Minor Notes

The now variable (line 28) is used exclusively for the waitUntil < now condition. The naming is clear, though a comment like // Current time for waitUntil comparison could help distinguish it from cutoffDate for future readers.
Using createdAt for orphaned jobs vs. completedAt for failed jobs is the correct distinction — orphaned jobs were never completed, so this is intentional and fine.

Summary: The core logic is correct and the parallel count + delete approach is a good pattern. The main actionable items are the && short-circuit style issue and confirming no consumers break on the output shape change.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af1f771996

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/tasks/cleanupFailedJobs.ts

…it array

Add uploadError field to extraction objects to store permanent failure reasons. Extractions with uploadError set are excluded from upload retries and marked as permanently failed in the admin UI. Update completion logic to handle mixed success/failure states and provide detailed status in Airtable.

kelvinkipruto · 2026-03-10T14:27:55Z

@claude Review

kelvinkipruto · 2026-03-10T14:28:03Z

@codex Review

claude · 2026-03-10T14:28:13Z

Claude finished @kelvinkipruto's task in 1m 48s —— View job

Code Review

Read changed files
Analyze code for bugs, performance, and correctness
Post review feedback

This PR touches two independent concerns: (1) orphaned job cleanup in cleanupFailedJobs.ts and (2) permanent failure tracking in uploadToMeedan.ts. Both changes are well-motivated, but each has issues worth discussing.

`cleanupFailedJobs.ts`

P1 — Orphaned job deletion is not scoped by queue

The orphanedWhere predicate deletes any untried, non-processing job older than the retention window, across all queues. This is the same concern raised by the Codex reviewer.

The problem: if a legitimate queue (not just the "default" queue) experiences a worker outage longer than the retention window (24h by default), every unprocessed job in that queue would be permanently deleted rather than executed when the worker comes back. The comment says "jobs queued into a queue that has no worker (e.g. the 'default' queue)", but the code doesn't enforce this scope.

Consider adding a queue filter to restrict orphan cleanup to known unserved queues:

const orphanedWhere: Where = {
  and: [
    { queue: { equals: "default" } }, // or a list of known unserved queues
    { processing: { equals: false } },
    { hasError: { equals: false } },
    { totalTried: { equals: 0 } },
    { createdAt: { less_than: cutoffDate.toISOString() } },
    {
      or: [
        { waitUntil: { exists: false } },
        { waitUntil: { less_than: now } },
      ],
    },
  ],
};

Or expose the queue names as an environment variable (e.g. PAYLOAD_JOBS_ORPHANED_QUEUES=default,legacy) so it can be configured per deployment.

Fix this →

Minor — Count/delete race window

The count and delete happen in separate operations. Between the count and the delete, new orphaned jobs could be queued, or the cleanup could run concurrently. This only affects the accuracy of the logged count — the delete itself is still safe. This is acceptable in practice, just worth noting in a comment.

Minor — Output shape is a breaking change

Confirmed in this PR: the output changes from { deleted: number } to { deletedFailed: number, deletedOrphaned: number }. Since TaskCleanupFailedJobs.output is typed unknown in payload-types.ts, TypeScript won't catch downstream consumers reading deleted. Ensure no monitoring, alerting, or admin UI reads the old field.

`uploadToMeedan.ts`

Behavioral change — `setDocumentFailedStatus` is no longer called on extraction failures

Previously, when docFailedExtractions > 0, the code called setDocumentFailedStatus(...) in Airtable and then continued to the next doc (skipping the markDocumentAsProcessedWhenReady call). The document was explicitly marked as failed in Airtable.

In this PR, setDocumentFailedStatus is removed from this path entirely. Now, failed extractions are persisted to the DB and the document is marked complete with a partial-success status like "Uploaded to Meedan (2 out of 5 failed)".

This is a product decision rather than a bug, but the trade-off is: Airtable operators no longer see a "Failed: ..." status for documents with failed extractions — they see a success variant instead. If Airtable workflows depend on the Failed: prefix to triage issues, this change could cause failures to go unnoticed.

Edge case: if all extractions for a document fail, the document would be marked "Uploaded to Meedan (5 out of 5 failed)". The label "Uploaded to Meedan" with a 100% failure rate is misleading. Consider a different status like "Failed: all extractions failed" for the total-failure case.

Fix this →

Good — `hasPendingExtractions` and `hasHandledExtractions` logic is correct

The updated predicates correctly account for uploadError when determining whether extractions are pending or handled. The canMarkDocumentAsProcessed check will correctly wait until all extractions are either uploaded or permanently failed before marking a document complete.

Good — Concurrent update risk is low

The payload.update inside the extraction catch block updates the full extractions array based on the in-memory doc.extractions. If two extraction tasks for the same doc ran concurrently, one could overwrite the other's update. But since the task loops sequentially over extractions for a given doc, this race doesn't apply within a single task run. The only risk is across concurrent task runs for the same doc, which is unlikely given the task structure.

Summary

Issue	Severity	File
Orphaned job deletion not scoped by queue	High	`cleanupFailedJobs.ts:41`
All-failed case shows misleading "Uploaded to Meedan" status	Medium	`uploadToMeedan.ts:590`
`setDocumentFailedStatus` removal changes Airtable failure visibility	Medium (product decision)	`uploadToMeedan.ts:575`
Output shape `deleted` → `deletedFailed/deletedOrphaned` breaks consumers	Low	`cleanupFailedJobs.ts:93`

The most critical item to address before merge is the queue-scoping of orphaned job deletion — without it, any active queue that's down for more than 24 hours will have its jobs silently deleted.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dc0ab211b9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/tasks/uploadToMeedan.ts

…tatus When a document has no extractions to upload, calculate a more accurate status message based on existing failures. Also, fetch a fresh document before updating extraction errors to avoid overwriting concurrent updates from other loop iterations.

vercel bot deployed to Preview March 10, 2026 13:57 View deployment

chatgpt-codex-connector bot reviewed Mar 10, 2026

View reviewed changes

src/tasks/cleanupFailedJobs.ts Show resolved Hide resolved

kelvinkipruto added 2 commits March 10, 2026 17:08

refactor(cleanupFailedJobs): replace conditional promises with explic…

98e5365

…it array

vercel bot deployed to Preview March 10, 2026 14:30 View deployment

chatgpt-codex-connector bot reviewed Mar 10, 2026

View reviewed changes

src/tasks/uploadToMeedan.ts Show resolved Hide resolved

src/tasks/uploadToMeedan.ts Outdated Show resolved Hide resolved

kelvinkipruto merged commit 76a14d4 into main Mar 10, 2026
5 of 6 checks passed

kelvinkipruto deleted the ft/cleanup-orphaned-jobs branch March 10, 2026 14:49

vercel bot deployed to Preview March 10, 2026 14:50 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cleanupFailedJobs): delete orphaned pending jobs in addition to failed ones#561

fix(cleanupFailedJobs): delete orphaned pending jobs in addition to failed ones#561
kelvinkipruto merged 4 commits intomainfrom
ft/cleanup-orphaned-jobs

kelvinkipruto commented Mar 10, 2026

Uh oh!

vercel bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

kelvinkipruto commented Mar 10, 2026

Uh oh!

claude bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

kelvinkipruto commented Mar 10, 2026

Uh oh!

kelvinkipruto commented Mar 10, 2026

Uh oh!

claude bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kelvinkipruto commented Mar 10, 2026

Description

Type of change

Screenshots

Checklist:

Uh oh!

vercel bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kelvinkipruto commented Mar 10, 2026

Uh oh!

claude bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Correctness

Issues

Minor Notes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

kelvinkipruto commented Mar 10, 2026

Uh oh!

kelvinkipruto commented Mar 10, 2026

Uh oh!

claude bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

cleanupFailedJobs.ts

P1 — Orphaned job deletion is not scoped by queue

Minor — Count/delete race window

Minor — Output shape is a breaking change

uploadToMeedan.ts

Behavioral change — setDocumentFailedStatus is no longer called on extraction failures

Good — hasPendingExtractions and hasHandledExtractions logic is correct

Good — Concurrent update risk is low

Summary

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Mar 10, 2026 •

edited

Loading

claude bot commented Mar 10, 2026 •

edited

Loading

claude bot commented Mar 10, 2026 •

edited

Loading

`cleanupFailedJobs.ts`

`uploadToMeedan.ts`

Behavioral change — `setDocumentFailedStatus` is no longer called on extraction failures

Good — `hasPendingExtractions` and `hasHandledExtractions` logic is correct