Skip to content

Conversation

@jumski
Copy link
Contributor

@jumski jumski commented Jan 12, 2026

Add automatic requeue for stalled tasks via cron job

This PR implements a system to automatically detect and requeue tasks that have stalled due to worker crashes or other issues. Key features:

  • Added a requeue_stalled_tasks() function that identifies tasks stuck in 'started' status beyond their timeout window
  • Tasks can be requeued up to 3 times before being marked as failed
  • Added tracking columns to step_tasks table: requeued_count and last_requeued_at
  • Implemented a configurable cron job via setup_requeue_stalled_tasks_cron() that runs every 15 seconds by default
  • Added comprehensive test suite covering basic requeuing, max requeue limits, and multi-flow scenarios
  • Increased default visibility timeout in edge-worker from 2 to 5 seconds for better reliability

This enhancement improves system resilience by ensuring tasks don't remain stuck when workers crash unexpectedly, addressing issue #586.

@changeset-bot
Copy link

changeset-bot bot commented Jan 12, 2026

🦋 Changeset detected

Latest commit: 9083fa2

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 5 packages
Name Type
@pgflow/core Patch
@pgflow/edge-worker Patch
pgflow Patch
@pgflow/client Patch
@pgflow/dsl Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor Author

jumski commented Jan 12, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@nx-cloud
Copy link

nx-cloud bot commented Jan 12, 2026

View your CI Pipeline Execution ↗ for commit 9083fa2

Command Status Duration Result
nx affected -t lint typecheck test --parallel -... ❌ Failed 1m 47s View ↗
nx run edge-worker:test:integration ✅ Succeeded 5m 13s View ↗
nx run client:e2e ✅ Succeeded 2m 51s View ↗
nx run core:pgtap ✅ Succeeded 1m 45s View ↗
nx run edge-worker:e2e ✅ Succeeded 52s View ↗
nx run cli:e2e ✅ Succeeded 7s View ↗

☁️ Nx Cloud last updated this comment at 2026-01-14 07:35:41 UTC

@jumski jumski force-pushed the 01-12-pgf-aav_implement_requeue_for_stalled_tasks branch from 7dabec6 to b3bc1e9 Compare January 12, 2026 09:54
… logic

- Introduced requeued_count and last_requeued_at columns to step_tasks table
- Developed requeue_stalled_tasks function to requeue or fail stalled tasks based on max requeues
- Created setup_requeue_stalled_tasks_cron function to schedule automatic requeue checks
- Updated migration scripts to include new columns and functions
- Added comprehensive tests for requeue behavior, max requeue limit, and cron setup
@jumski jumski force-pushed the 01-12-pgf-aav_implement_requeue_for_stalled_tasks branch from b3bc1e9 to 9083fa2 Compare January 14, 2026 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants