-
Notifications
You must be signed in to change notification settings - Fork 20
update task_resume_workflows to also resume processes in CREATED/RESUMED status #984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ff408fb
to
1ac7b2d
Compare
CodSpeed Performance ReportMerging #984 will not alter performanceComparing Summary
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #984 +/- ##
==========================================
+ Coverage 84.40% 84.88% +0.47%
==========================================
Files 213 214 +1
Lines 10359 10376 +17
Branches 1019 1020 +1
==========================================
+ Hits 8744 8808 +64
+ Misses 1345 1295 -50
- Partials 270 273 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
created_process_ids = get_process_ids_by_process_statuses([ProcessStatus.CREATED]) | ||
resumed_process_ids = get_process_ids_by_process_statuses([ProcessStatus.RESUMED]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely sure if this is the way to go.
Imagine this scenario:
- There are separate queues for 1) new workflows (status CREATED) 2) resumed workflows (status RESUMED)
- Each queue has 50 workflows queued
- For each queue there are 5 workers churning through queued workflows which they have put on status RUNNING
- When
task_resume_workflows
runs it finds 45 processes on CREATED and 45 on RESUMED so it callsresume_process()
for each of them
Unless I'm missing something, this would re-queue 90 processes that don't need to be re-queued.
In thread_start_process()
you already changed it to set the workflow to RUNNING
before executing any step. This means that chance of the celery worker going AWOL before setting it to RUNNING has become much smaller, so that is already much better. If a worker dies when a process is running, it can be resumed by an engine restart.
So, I don't think we should resume created/resumed workflows in this task.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the queue will get clogged since the 45 re-added processes, would all get killed with a no-op since the process already ran before and won't have the correct status to run again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr
While this can serve as an emergency "I absolutely want to start everything now" procedure, there are potential race conditions. Such that this workflow cannot be run automatically, it must (IMO) only be run when there is 0 activity on the current workers + all the queues are empty.
Long read
For the CREATED
scenario:
- new process X was previously started and has status
CREATED
- it is present in the New Workflows queue, let's call the task
NF_X
- the NF queue is full so it isn't picked up yet
task_resume_workflows
runs, sees X having statusCREATED
and runsprocesses.resume_process()
_celery_resume_process()
sets X's status fromCREATED
toRESUMED
⚡- it is now also added to the Resume Workflows queue, let's call that task
RF_X
- worker 1 picks up
NF_X
, asserts status isCREATED
, finds out it isn't- fails with an error
- worker 2 picks up
RF_X
, asserts status isRESUMED
, continuesthread_resume_process()
sets the status fromRESUMED
toRUNNING
It does not matter if NF_X
is picked up before or after RF_X
; the resume task will always be the one to go through.
While not pretty given the errors that will be shown, the result is that RF_X
runs correctly, so an argument can be made that the end justify the means.
However, there is a race condition (⚡) because the status update to RESUMED
can be done after a worker picked up the CREATED
task and set the status to RUNNING
. This will (likely?) be corrected by the next step update, but until then it leaves the process in an incorrect status (not to mention the worker dying at this point..)
For the RESUMED
scenario:
- process Y has previously been resumed and has status
RESUMED
- it is present in the Resume Workflows queue as
RF_Y_1
- the RF queue is full so it isn't picked up yet
task_resume_workflows
runs, sees Y having statusRESUMED
and runsprocesses.resume_process()
_celery_resume_process()
sets X's status fromRESUMED
toRESUMED
⚡- it is now again added to the RF queue as
RF_Y_2
- worker 1 picks up
RF_Y_1
, asserts status isRESUMED
, continuesthread_resume_process()
sets the status fromRESUMED
toRUNNING
⚡
- worker 2 picks up
RF_Y_2
, asserts status isRESUMED
, finds out it isn't- fails with an error
This is a bit trickier as there are multiple race conditions (⚡).
The first one (setting the status update from RESUMED to RESUMED) can be done after a worker picked up the RESUMED
task and set the status to RUNNING
. Same as for created, this will likely be corrected by the next step update, but it leaves the process in an incorrect status until then.
The second one is worse: there is a sequence of events in which both worker 1 and worker 2 start process Y because they asserted it had status RESUMED
. Both of them will start executing steps. We definitely do not want this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fixed by limiting which statusses can be changed to RESUMED
by the _celery_resume_process
. it should only change it from SUSPENDED
, WAITING
, FAILED
, AWAITING_CALLBACK
, API_UNAVAILABLE
and INCONSISTENT_DATA
. so it can never incorrectly change the status, removing the race condition.
ecdc192
to
32bd94a
Compare
…MED status - improve start_process to include logic that happens in both executors. - move threadpool executor functions to its own file.
- Add more info about the added graph flow in docs. - Remove `time_limit` from celery NEW_TASK. - Change workflow cannot be resumed error to be more specific - Remove duplicate auth check in `start_process`, its already done in `create_process`. - Move save input step to the processes.py resume_process. removes unnecessary state fetch in resume_task/resume_workflow
- add 2 tests that validate the happy flow of start_process and resume_process. - add unit tests for resume_process - improve unit tests for start_process - move can_be_resumed from api/processes.py to services/processes.py. - improve workflow removed error messages. - revert resume_process incorrect return type. - add can_be_resumed check in _celery_set_process_status_resumed so only correct statuses can be changed. - add process status check in thread_resume_process to only retrieve input state on suspended status. - change api resume workflow with incorrect status test to check all incorrect statuses - update task_resume_workflow tests.
e4ce69d
to
7a8f80a
Compare
orchestrator/services/celery.py
->orchestrator/services/executors/celery.py
orchestrator/services/processes.py
->orchestrator/services/executors/threadpool.py
thread_start_process
thread_resume_process
thread_validate_workflow
THREADPOOL_EXECUTION_CONTEXT
updated testing:
Related: #898