Skip to content

update task_resume_workflows to also resume processes in CREATED/RESUMED status #984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

tjeerddie
Copy link
Contributor

@tjeerddie tjeerddie commented Jul 2, 2025

  • improve start_process to include logic that happens in both executors.
  • move threadpool executor functions to its own file.
    • celery: orchestrator/services/celery.py -> orchestrator/services/executors/celery.py
    • thread: orchestrator/services/processes.py -> orchestrator/services/executors/threadpool.py
      • thread_start_process
      • thread_resume_process
      • thread_validate_workflow
      • THREADPOOL_EXECUTION_CONTEXT
  • move can_be_resumed from api/processes.py to services/processes.py.
  • add can_be_resumed check in _celery_set_process_status_resumed so only correct statuses can be changed.
  • add process status check in thread_resume_process to only retrieve input state on suspended status.
  • improve workflow removed error messages.

updated testing:

  • add unit tests for threadpool and celery executor functions.
  • add unit tests for resume_process.
  • improve unit tests for start_process.
  • add 2 tests that validate the happy flow of start_process and resume_process.
  • change api resume workflow with incorrect status test to check all incorrect statuses.
  • update task_resume_workflow tests.

Related: #898

@tjeerddie tjeerddie force-pushed the 898-improve-worker-process branch from ff408fb to 1ac7b2d Compare July 2, 2025 14:57
Copy link

codspeed-hq bot commented Jul 2, 2025

CodSpeed Performance Report

Merging #984 will not alter performance

Comparing 898-improve-worker-process (7a8f80a) with main (3ab3628)

Summary

✅ 12 untouched benchmarks

Copy link

codecov bot commented Jul 3, 2025

Codecov Report

Attention: Patch coverage is 77.67857% with 25 lines in your changes missing coverage. Please review.

Project coverage is 84.88%. Comparing base (3ab3628) to head (7a8f80a).

Files with missing lines Patch % Lines
orchestrator/services/tasks.py 17.64% 14 Missing ⚠️
orchestrator/services/processes.py 77.77% 4 Missing ⚠️
orchestrator/workflows/tasks/resume_workflows.py 81.81% 3 Missing and 1 partial ⚠️
orchestrator/services/executors/celery.py 75.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #984      +/-   ##
==========================================
+ Coverage   84.40%   84.88%   +0.47%     
==========================================
  Files         213      214       +1     
  Lines       10359    10376      +17     
  Branches     1019     1020       +1     
==========================================
+ Hits         8744     8808      +64     
+ Misses       1345     1295      -50     
- Partials      270      273       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Mark90 Mark90 self-requested a review July 9, 2025 08:28
Comment on lines +33 to +34
created_process_ids = get_process_ids_by_process_statuses([ProcessStatus.CREATED])
resumed_process_ids = get_process_ids_by_process_statuses([ProcessStatus.RESUMED])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure if this is the way to go.

Imagine this scenario:

  • There are separate queues for 1) new workflows (status CREATED) 2) resumed workflows (status RESUMED)
  • Each queue has 50 workflows queued
  • For each queue there are 5 workers churning through queued workflows which they have put on status RUNNING
  • When task_resume_workflows runs it finds 45 processes on CREATED and 45 on RESUMED so it calls resume_process() for each of them

Unless I'm missing something, this would re-queue 90 processes that don't need to be re-queued.

In thread_start_process() you already changed it to set the workflow to RUNNING before executing any step. This means that chance of the celery worker going AWOL before setting it to RUNNING has become much smaller, so that is already much better. If a worker dies when a process is running, it can be resumed by an engine restart.

So, I don't think we should resume created/resumed workflows in this task.
What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the queue will get clogged since the 45 re-added processes, would all get killed with a no-op since the process already ran before and won't have the correct status to run again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr

While this can serve as an emergency "I absolutely want to start everything now" procedure, there are potential race conditions. Such that this workflow cannot be run automatically, it must (IMO) only be run when there is 0 activity on the current workers + all the queues are empty.

Long read

For the CREATED scenario:

  • new process X was previously started and has status CREATED
  • it is present in the New Workflows queue, let's call the task NF_X
  • the NF queue is full so it isn't picked up yet
  • task_resume_workflows runs, sees X having status CREATED and runs processes.resume_process()
    • _celery_resume_process() sets X's status from CREATED to RESUMED
    • it is now also added to the Resume Workflows queue, let's call that task RF_X
  • worker 1 picks up NF_X, asserts status is CREATED, finds out it isn't
    • fails with an error
  • worker 2 picks up RF_X, asserts status is RESUMED, continues
    • thread_resume_process() sets the status from RESUMED to RUNNING

It does not matter if NF_X is picked up before or after RF_X; the resume task will always be the one to go through.

While not pretty given the errors that will be shown, the result is that RF_X runs correctly, so an argument can be made that the end justify the means.
However, there is a race condition (⚡) because the status update to RESUMED can be done after a worker picked up the CREATED task and set the status to RUNNING. This will (likely?) be corrected by the next step update, but until then it leaves the process in an incorrect status (not to mention the worker dying at this point..)

For the RESUMED scenario:

  • process Y has previously been resumed and has status RESUMED
  • it is present in the Resume Workflows queue as RF_Y_1
  • the RF queue is full so it isn't picked up yet
  • task_resume_workflows runs, sees Y having status RESUMED and runs processes.resume_process()
    • _celery_resume_process() sets X's status from RESUMED to RESUMED
    • it is now again added to the RF queue as RF_Y_2
  • worker 1 picks up RF_Y_1, asserts status is RESUMED, continues
    • thread_resume_process() sets the status from RESUMED to RUNNING
  • worker 2 picks up RF_Y_2, asserts status is RESUMED, finds out it isn't
    • fails with an error

This is a bit trickier as there are multiple race conditions (⚡).
The first one (setting the status update from RESUMED to RESUMED) can be done after a worker picked up the RESUMED task and set the status to RUNNING. Same as for created, this will likely be corrected by the next step update, but it leaves the process in an incorrect status until then.
The second one is worse: there is a sequence of events in which both worker 1 and worker 2 start process Y because they asserted it had status RESUMED. Both of them will start executing steps. We definitely do not want this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fixed by limiting which statusses can be changed to RESUMED by the _celery_resume_process. it should only change it from SUSPENDED, WAITING, FAILED, AWAITING_CALLBACK, API_UNAVAILABLE and INCONSISTENT_DATA. so it can never incorrectly change the status, removing the race condition.

@tjeerddie tjeerddie force-pushed the 898-improve-worker-process branch 2 times, most recently from ecdc192 to 32bd94a Compare July 14, 2025 13:12
…MED status

- improve start_process to include logic that happens in both executors.
- move threadpool executor functions to its own file.
- Add more info about the added graph flow in docs.
- Remove `time_limit` from celery NEW_TASK.
- Change workflow cannot be resumed error to be more specific
- Remove duplicate auth check in `start_process`, its already done in `create_process`.
- Move save input step to the processes.py resume_process. removes unnecessary state fetch in resume_task/resume_workflow
- add 2 tests that validate the happy flow of start_process and resume_process.
- add unit tests for resume_process
- improve unit tests for start_process
- move can_be_resumed from api/processes.py to services/processes.py.
- improve workflow removed error messages.
- revert resume_process incorrect return type.
- add can_be_resumed check in _celery_set_process_status_resumed so only correct statuses can be changed.
- add process status check in thread_resume_process to only retrieve input state on suspended status.
- change api resume workflow with incorrect status test to check all incorrect statuses
- update task_resume_workflow tests.
@tjeerddie tjeerddie force-pushed the 898-improve-worker-process branch from e4ce69d to 7a8f80a Compare July 21, 2025 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants