fix(scheduler): ignore stale executor success after defer reschedule#66431
fix(scheduler): ignore stale executor success after defer reschedule#66431Pedrinhonitz wants to merge 9 commits into
Conversation
When a trigger moves a deferred task back to scheduled before the scheduler processes the executor success from the worker defer exit, treat it as benign (same try_number, next_method set) instead of state mismatch failure. Closes apache#66374 Co-authored-by: Cursor <cursoragent@cursor.com>
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
|
jscheffl
left a comment
There was a problem hiding this comment.
Looks good for me. Thanks for the fix!
As I am not 100% deep in the scheduler feel not comfortable to mege, so if there is a second pair of eyes from reviewer would be cool, else LGTM!
|
Perfect! I'm also eagerly awaiting another review and am open to other suggestions for resolving this issue. Thank you for your review. |
|
Thanks for your fix! I'm just curious, is it possible that the scheduler picks up the TI very quickly and sets its state to |
|
Probably needs @ashb eyes to confirm |
ashb
left a comment
There was a problem hiding this comment.
This entire executor event thing has always been overly complex and confusing.
If the event from the executor is success do we ever need to do anything? If the executor says it ran successfully (which means the "supervisor" ran okay, it says nothing about the task) can't we simply ignore the event? We only need to be concerned with executor failures as the task may not have been able to report it's failure?
|
I agree that this whole process is difficult to understand. This PR focuses on the false mismatch of issue #66374, which occurs when IT is already scheduled with next_method after the trigger. A more comprehensive approach, such as "ignoring all executor SUCCESSES," would require a check to ensure that each consumer of these events is not unexpectedly affected. I am not entirely familiar with this control and have minimal knowledge about it; this was my first contribution, but your comment about the scheduler not needing to process SUCCESS makes sense to me, as that is the expected behavior. I apologize for my English. (I'm translating some points that I didn't understand, and I may have misinterpreted something.) |
|
@ashb as I assume even if it is hard still OK to merge? I assume this also fixed one erro we had in production after applying the hotfix locally. So I think this is good. At least as a fix. General improvements welcome but would prevent waiting longer. |
Summary
Fixes a race where the trigger resumes a deferred task to
scheduledbefore the scheduler processes the executorSUCCESSfrom the worker defer exit. The scheduler then treatedscheduledvs executorsuccessas a state mismatch and failed the TI (#66374).Changes
process_executor_events, treat executorSUCCESSwhile the TI isSCHEDULEDwithnext_methodset (same try) as a stale defer-exit event, consistent with the existing “requeued after defer” intent.test_process_executor_events_stale_success_when_scheduled_after_defer(positive + negative withoutnext_method).66374.bugfix.rst.Testing
Was generative AI tooling used to co-author this PR?
Generated-by: Cursor following the guidelines