Skip to content

Fix DAG-level on_failure_callback not firing#63692

Open
Sathvik-Chowdary-Veerapaneni wants to merge 14 commits into
apache:mainfrom
Sathvik-Chowdary-Veerapaneni:dag-failure-callback-not-firing-63374
Open

Fix DAG-level on_failure_callback not firing#63692
Sathvik-Chowdary-Veerapaneni wants to merge 14 commits into
apache:mainfrom
Sathvik-Chowdary-Veerapaneni:dag-failure-callback-not-firing-63374

Conversation

@Sathvik-Chowdary-Veerapaneni
Copy link
Copy Markdown

@Sathvik-Chowdary-Veerapaneni Sathvik-Chowdary-Veerapaneni commented Mar 16, 2026

Fixed DAG-level on_failure_callback not firing.

When the scheduler builds a DagCallbackRequest, DagRunContext needs to read DagRun ORM relationship data. In the reported failure path, SQLAlchemy can raise more than DetachedInstanceError, which prevented the callback request from being produced.

Changes:

  • Catch SQLAlchemyError while building DagRunContext and reload the DagRun from the DB before collecting context.
  • Keep the callback request using full server-provided context.
  • Keep scheduler callback logging at debug.
  • Avoid extra DAG processor bundle checks; callback fetching is already scoped by bundle.

Tests:

  • airflow-core/tests/unit/models/test_dagrun.py::TestDagRun::test_dagrun_update_state_with_handle_callback_failure
  • airflow-core/tests/unit/dag_processing/test_manager.py::TestDagFileProcessorManager::test_fetch_callbacks_ignores_other_bundles

closes #63374

@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented Mar 16, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@boring-cyborg boring-cyborg Bot added area:DAG-processing area:Scheduler including HA (high availability) scheduler labels Mar 16, 2026
Previously only DetachedInstanceError was caught when accessing
consumed_asset_events on ORM DagRun objects. Other SQLAlchemy
exceptions (e.g. InvalidRequestError) crashed the scheduler.

closes: apache#63374
DagRunContext creation could crash when ORM relationship access
failed, preventing the callback from being produced entirely.
The callback is now sent with minimal context on failure.
@Sathvik-Chowdary-Veerapaneni Sathvik-Chowdary-Veerapaneni force-pushed the dag-failure-callback-not-firing-63374 branch from f49ef66 to dfd11a3 Compare March 27, 2026 22:32
@eladkal eladkal added this to the Airflow 3.2.0 milestone Mar 27, 2026
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Mar 27, 2026
@eladkal eladkal requested a review from vatsrahul1001 March 27, 2026 22:33
Comment thread airflow-core/src/airflow/callbacks/callback_requests.py Outdated
@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label Apr 2, 2026
@eladkal eladkal added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label Apr 3, 2026
Comment thread airflow-core/newsfragments/63692.bugfix.rst Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this in info? Wondering if debug is enough.

Comment on lines +1339 to +1346
except Exception:
self.log.exception(
"Failed to build DagRunContext for dag_id=%s run_id=%s; "
"sending callback with minimal context",
self.dag_id,
self.run_id,
)
context_from_server = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this ever happen?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it happens, this is bad: running callback without the full context is worse than the callback failing. Also, the exception is broad here

"this DAG processor (serving bundles: %s). Skipping.",
getattr(req, "dag_id", "unknown"),
req.bundle_name,
bundle_names,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any callback fetched at this point is bound to have bundle_name in bundle_names. Debugging here?

Comment on lines +1339 to +1346
except Exception:
self.log.exception(
"Failed to build DagRunContext for dag_id=%s run_id=%s; "
"sending callback with minimal context",
self.dag_id,
self.run_id,
)
context_from_server = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it happens, this is bad: running callback without the full context is worse than the callback failing. Also, the exception is broad here

@vatsrahul1001
Copy link
Copy Markdown
Contributor

@Sathvik-Chowdary-Veerapaneni can you address open comments?

- Removed duplicate bundle-name guard after scoped callback fetch
- Updated scheduler callback request log from info to debug
- Removed broad fallback around DagRunContext creation
- Restored test name and comments for existing bundle filtering behavior
- Removed regression test for dropped minimal-context callback fallback
@Sathvik-Chowdary-Veerapaneni
Copy link
Copy Markdown
Author

Thanks for the reminder. I pushed updates addressing the open review comments by narrowing the PR back to the core DAG callback fix:

  • Removed the extra bundle-name guard in DagFileProcessorManager; fetched callbacks are already scoped by the bundle query.
  • Changed the scheduler callback request log from info to debug.
  • Removed the broad fallback around DagRunContext; callbacks now require the full server context instead of being sent with partial/minimal context.
  • Removed the fallback-specific regression test.

I also updated the PR description to remove the stale fallback/logging notes.

Verified locally in a clean worktree:

  • airflow-core/tests/unit/models/test_dagrun.py::TestDagRun::test_dagrun_update_state_with_handle_callback_failure
  • airflow-core/tests/unit/dag_processing/test_manager.py::TestDagFileProcessorManager::test_fetch_callbacks_ignores_other_bundles

Both passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:DAG-processing area:Scheduler including HA (high availability) scheduler backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch ready for maintainer review Set after triaging when all criteria pass. type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DAG-level on_failure_callback never fires

7 participants