Fix max_active_runs lost during DAG serialisation when value equals schema default#65310
Fix max_active_runs lost during DAG serialisation when value equals schema default#65310seruman wants to merge 30 commits into
Conversation
…chema default The serialisation optimisation from apache#55849 strips DAG fields that match their schema.json default. For max_active_runs, max_active_tasks, and max_consecutive_failed_dag_runs this is wrong because their runtime defaults come from airflow.cfg, not the schema. When a user explicitly sets max_active_runs=16 and the config has max_active_runs_per_dag=1, the value gets stripped and the dag table ends up with 1. Skip the schema-default exclusion for these three config-driven fields so they always survive serialisation.
|
@seruman Converting to draft — this PR doesn't yet meet our Pull Request quality criteria.
See the linked criteria for how to fix each item, then mark the PR "Ready for review". This is not a rejection — just an invitation to bring the PR up to standard. No rush. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
|
Quick follow-up to the triage comment above — one clarification on the "Unresolved review comments" item: Once you believe a thread has been addressed — whether by pushing a fix, or by replying in-thread with an explanation of why the suggestion doesn't apply — please mark the thread as resolved yourself by clicking the "Resolve conversation" button at the bottom of each thread. Reviewers don't auto-close their own threads, so an addressed-but-unresolved thread reads as "still waiting on the author" and keeps the PR from moving forward. The author doing the resolve-click is the expected convention on this project. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
|
@potiuk I think triage tool false flags self-reviews, marked it as resolved. I just wanted reviewers comment on it. |
|
Fix is correct and the bug is real. Before merging, worth weighing against extending The gap this PR papers over: #55849 added DAG-level schema-default exclusion without extending the Concrete sketch
DAG_DEFAULTS = {
"max_active_runs": ("core", "max_active_runs_per_dag"),
"max_active_tasks": ("core", "max_active_tasks_per_dag"),
"max_consecutive_failed_dag_runs": ("core", "max_consecutive_failed_dag_runs_per_dag"),
"catchup": ("scheduler", "catchup_by_default"),
"disable_bundle_versioning": ("dag_processor", "disable_bundle_versioning"),
}
json_dict["client_defaults"] = {
"tasks": OperatorSerialization.generate_client_defaults(),
"dag": DagSerialization.generate_client_defaults(),
}
Verified locally, every scenario round-trips correctly (cfg ∈ {1, 16} × user_set ∈ {None, 1, 16, 42}):
Why this beats the carve-out
Tradeoff: ~30-40 line diff vs the current 13. Larger surface, but lands on the architecture already in place, and the one-time |
| # the hardcoded schema default because the schema default (e.g. 16) may differ | ||
| # from the runtime config value (e.g. 1). Excluding them loses explicitly-set | ||
| # values that happen to equal the schema default. | ||
| _CONFIG_DRIVEN_FIELDS = frozenset( |
There was a problem hiding this comment.
nit: lift to module scope. As written, this frozenset is rebuilt on every _is_excluded call (which fires once per DAG attribute per serialization). Move it next to the other module-level constants.
| "downstream_task_ids": [], | ||
| }, | ||
| "is_paused_upon_creation": False, | ||
| "max_active_runs": 16, |
There was a problem hiding this comment.
The literal 16 here only holds while no test has overridden [core] max_active_runs_per_dag. Wrap the test body in conf_vars(...) like test_max_active_runs_equal_to_schema_default_not_overridden_by_conf already does, so the assertion is self-pinning. Same for the new test_config_driven_dag_fields_always_serialized below -- worth pinning the cfg there too.
| ) | ||
| dag_schema_defaults = cls.get_schema_defaults("dag") | ||
| if attrname in dag_schema_defaults: | ||
| if attrname in dag_schema_defaults and attrname not in _CONFIG_DRIVEN_FIELDS: |
There was a problem hiding this comment.
As mentioned I'd prefer solution in #65310 (comment) with client_defaults instead
There was a problem hiding this comment.
Yes that sounds better, that's why I wanted to point it out in #65310 (comment)
How 10d06aa looks like?
b826307 to
e7e330d
Compare
|
Damn I failed to rebase 🤦 sorry for the noise. Edit: I did a bad rebase, that caused adding a bunch of new reviewers due to codeowners, I'm honestly sorry for this. |
e7e330d to
10d06aa
Compare
|
Walking back my earlier suggestion. The Simplest fix: keep the schema.json changes (drop The PR's original carve-out approach was actually closer to right than this rewrite. Apologies for the detour 🤦🏻♂️. |
|
@kaxil yeah that makes much more sense. I do not think 50B/DAG would be much of an issue -at least in my case- but the abstraction was feeling heavy. Had to add an exclusion list to Python side defaults would diverge from the schema side. No need for an apology, TIL how Airflow serializes DAGs 🎉 |
|
@seruman — Your unresolved review thread(s) from @ephraimbuddy, @kaxil appear to have been addressed (post-review commits and/or in-thread replies on every thread, with the latest commit pushed after the most recent thread). I've added the @ephraimbuddy, @kaxil — could you take another look when you have a chance? If you agree the feedback was addressed, please mark the threads as resolved so the queue signal stays accurate. If a thread still needs work, please reply in-line — @seruman will follow up. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
ephraimbuddy
left a comment
There was a problem hiding this comment.
A few suggestions on the tests — nothing blocking. Two are about widening coverage so the fix doesn't regress for the sibling fields; the rest are tightening / parametrisation / docstring tweaks.
One extra point that I couldn't leave inline because the line sits outside the diff hunk:
improvement: test_dag_schema_defaults_optimization (around the for field in DagSerialization.get_schema_defaults("dag").keys() loop, ~L3831) is now weaker than its surrounding comments suggest. After this PR, that loop iterates only over the fields that still have a schema default (fail_fast, render_template_as_native_obj, callback flags). It no longer asserts anything about catchup, max_active_runs, max_active_tasks, max_consecutive_failed_dag_runs, or disable_bundle_versioning — yet the DAG above is still constructed with catchup=False, max_active_runs=16, max_active_tasks=16, max_consecutive_failed_dag_runs=0, disable_bundle_versioning=False under a comment that reads "These should match schema defaults and be excluded". That comment is now wrong, and assert deserialized_dag.max_active_runs == 16 succeeds for the opposite reason it used to (the value is on the wire now, not restored from a schema default). Minimal fix: update the comment, and add a positive for field in (...): assert field in dag_data block to lock in the new contract right next to the test that locks in the old one.
Drafted-by: Claude Code (Opus 4.7); reviewed by @ephraimbuddy before posting
When a DAG explicitly sets
max_active_runs=16andairflow.cfghasmax_active_runs_per_dag = 1, thedagtable ends up with1. Setting it to17or any other value that isn't16works fine.The serialisation optimisation from #55849 strips DAG fields that match their
schema.jsondefault.This works for static defaults like
catchup=False, butmax_active_runs,max_active_tasks, andmax_consecutive_failed_dag_runsget their defaults fromairflow.cfgat parse time, not from the schema.When the user's explicit value happens to equal the schema default (16), it gets stripped,
LazyDeserializedDAGreturnsNone, andcollection.pyfalls back to whateverairflow.cfgsays.The fix skips the schema-default exclusion for these three config-driven fields so they always survive serialisation.
After deploying this, the first parse cycle will produce a slightly different serialised payload for every DAG (three extra int fields), which means a one-time
dag_hashchange and a newDagVersionfor DAGs that have running task instances.closes: #65307
related: #57604
related: #56646
Was generative AI tooling used to co-author this PR?
Generated-by: pi (Claude Opus 4.6) following the guidelines
Note
Prompted it like when DAGs and config configured like this I observe this and rows in metadata is like this and informed it with my suspicion on hard coded default 16 in the schema to point me to relevant paths. After the proposed fix and unit tests went over the code an tests to make sure it is correct and aligns with the rest of the project and see if there're any alternative solutions. Spawned Airflow with
breeze, tested the same exact scenarios we had in the real deployment to verify the new behaviour along with thedag_haschange I mentioned above.{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.