Fix backfill marked complete before DagRuns are created by shivaam · Pull Request #62561 · apache/airflow

shivaam · 2026-02-27T09:53:57Z

What

The scheduler's _mark_backfills_complete() prematurely marks a backfill
as completed when it runs during the window between the Backfill row
commit and the DagRun creation in _create_backfill().

closes: #61375

Why

_create_backfill() works in two steps:

First it commits the Backfill row to the DB
Then it creates the DagRuns

The scheduler runs _mark_backfills_complete() every 30 seconds. If it happens to run between step 1 and step 2, it sees a backfill with no running DagRuns (because they don't exist yet) and marks it done. The DagRuns get created after, but the backfill is already completed.

How

Added an EXISTS check on the backfill_dag_run table in the completion query. Now a backfill needs at least one BackfillDagRun row before it can be marked complete. If it has zero, it means the backfill is still being set up, so we skip it.

Tests

test_mark_backfills_complete_skips_initializing_backfill — verifies that backfill without any dagruns is skipped, then completed after DagRuns finish. If we remove the fix, the test will fail.

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below): Kiro

The scheduler's _mark_backfills_complete() could mark a backfill as completed during the window between the Backfill row commit and DagRun creation. Add an EXISTS guard on BackfillDagRun so backfills still being initialized are skipped.

Removed commented-out lines for clarity.

eladkal

LGTM
will need a 2nd reviewer as this is scheduler core area

kaxil · 2026-03-03T20:37:39Z

airflow-core/src/airflow/jobs/scheduler_job_runner.py

            Backfill.completed_at.is_(None),
+            # Guard: backfill must have at least one association,
+            # otherwise it is still being set up (see #61375).
+            exists(select(BackfillDagRun.id).where(BackfillDagRun.backfill_id == Backfill.id)),


Should we fix the root cause instead? _create_backfill() does session.commit() (backfill.py L605) to persist the Backfill row, then creates BackfillDagRun/DagRun rows afterwards — that's what opens the race window. Changing that to session.flush() would still assign br.id (needed as FK for BackfillDagRun) without committing. The create_session() context manager already commits on successful exit, so all rows would be committed atomically — eliminating the race window entirely.

If the guard approach is preferred, there's an edge case worth considering: if _create_backfill fails after committing the Backfill row but before creating any BackfillDagRun rows (e.g. RuntimeError("No runs to create...") on L616), this guard means _mark_backfills_complete will never clean it up. Combined with the AlreadyRunningBackfill check, that orphaned backfill would block all future backfills for the same DAG permanently.

Good point about the edge case. I looked into the flush approach but I think it introduces a different race condition.

Before creating a backfill, we check if there are any active backfills for the same DAG and throw an error. Currently, we immediately commit the Backfill row, so a concurrent request sees it and raises AlreadyRunningBackfill. If we batch everything into one transaction with flush(), the Backfill row stays invisible to other sessions until all DagRuns are created and the final commit happens. That opens a window of seconds where a concurrent request sees zero active backfills and can create a duplicate backfill.

[Check for existing backfills](

airflow/airflow-core/src/airflow/models/backfill.py

Lines 577 to 591 in bae2c27

num_active = session.scalar(

select(func.count()).where(

Backfill.dag_id == dag_id,

Backfill.completed_at.is_(None),

)

)

if num_active is None:

raise UnknownActiveBackfills(dag_id)

if num_active > 0:

raise AlreadyRunningBackfill(

f"Another backfill is running for Dag {dag_id}. "

f"There can be only one running backfill per Dag."

)

dag = serdag.dag

)

For the guard approach, I think we can handle the orphan two different ways:

Add an age-based cleanup in _mark_backfills_complete — backfills with zero BackfillDagRun rows older than 10 minutes get marked complete instead of being stuck forever.

In _create_backfill, if the DagRun creation fails after the Backfill row is already committed, catch the exception and mark the backfill complete immediately so it doesn't permanently block future backfills for that DAG.

I can add the age-based cleanup in this PR. Happy to also add the exception handling if you think both are worth having. Let me know what you'd prefer.

@kaxil let me know what do you think?

shivaam · 2026-03-07T16:48:40Z

Is it possible to also add copilot reviewer to this cr as well?

Copilot

Pull request overview

Fixes a scheduler race where _mark_backfills_complete() could mark a Backfill as complete in the gap between the Backfill row being committed and its DagRuns being created.

Changes:

Add a scheduler-side guard requiring a Backfill to have at least one BackfillDagRun association before it can be marked complete.
Add a unit test covering the “initializing backfill” window (no associated DagRuns/BackfillDagRun rows yet), ensuring it is not prematurely completed.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`airflow-core/src/airflow/jobs/scheduler_job_runner.py`	Updates the backfill completion query to require existence of at least one `BackfillDagRun` row before completing.
`airflow-core/tests/unit/jobs/test_scheduler_job.py`	Adds a regression test that reproduces the initialization window and asserts the scheduler skips completion until associations exist.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-12T19:08:36Z

airflow-core/src/airflow/jobs/scheduler_job_runner.py

        # todo: AIP-78 simplify this function to an update statement
        query = select(Backfill).where(
            Backfill.completed_at.is_(None),
+            # Guard: backfill must have at least one association,
+            # otherwise it is still being set up (see #61375).
+            exists(select(BackfillDagRun.id).where(BackfillDagRun.backfill_id == Backfill.id)),


The new EXISTS(backfill_dag_run) guard means a Backfill that gets committed but never manages to create any BackfillDagRun rows (e.g. if _create_backfill() errors/crashes after the session.commit() at airflow/models/backfill.py:605) will never be auto-completed by the scheduler. Since _create_backfill() blocks new backfills by counting Backfill.completed_at IS NULL (airflow/models/backfill.py:577-590), this can leave a DAG permanently unable to start new backfills without manual DB cleanup. Consider adding a bounded “initializing” window (e.g., only require the association for very recent backfills) or introducing an explicit backfill state/failed marker so initialization failures don’t create stuck active backfills.

Suggested change

# todo: AIP-78 simplify this function to an update statement

query = select(Backfill).where(

Backfill.completed_at.is_(None),

# Guard: backfill must have at least one association,

# otherwise it is still being set up (see #61375).

exists(select(BackfillDagRun.id).where(BackfillDagRun.backfill_id == Backfill.id)),

# Treat very recent backfills with no associations as "initializing",

# but allow older ones without BackfillDagRun rows to be auto-completed

# so they don't block new backfills if initialization failed.

initializing_cutoff = now - timedelta(minutes=5)

# todo: AIP-78 simplify this function to an update statement

query = select(Backfill).where(

Backfill.completed_at.is_(None),

or_(

# Backfill has at least one association and is fully initialized.

exists(select(BackfillDagRun.id).where(BackfillDagRun.backfill_id == Backfill.id)),

# Or it is older than the initializing window; treat it as no longer initializing

# even if it has no BackfillDagRun rows (e.g. initialization crashed).

Backfill.created_at < initializing_cutoff,

),

shivaam force-pushed the fix/backfill-race-61375 branch from dcaf372 to 3372139 Compare February 27, 2026 10:01

shivaam marked this pull request as ready for review February 27, 2026 10:11

shivaam requested review from XD-DENG and ashb as code owners February 27, 2026 10:11

eladkal added this to the Airflow 3.1.8 milestone Feb 28, 2026

eladkal added type:bug-fix Changelog: Bug Fixes backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch labels Feb 28, 2026

eladkal requested a review from dstandish February 28, 2026 04:55

shivaam added 2 commits February 28, 2026 13:08

Merge branch 'main' into fix/backfill-race-61375

3ef4326

Clean up comments in scheduler_job_runner.py

f450cd7

Removed commented-out lines for clarity.

vatsrahul1001 modified the milestones: Airflow 3.1.8, Airflow 3.1.9 Mar 3, 2026

eladkal approved these changes Mar 3, 2026

View reviewed changes

eladkal requested review from Lee-W, kaxil and uranusjr March 3, 2026 13:37

kaxil reviewed Mar 3, 2026

View reviewed changes

kaxil requested a review from Copilot March 12, 2026 19:03

Copilot started reviewing on behalf of kaxil March 12, 2026 19:04 View session

kaxil changed the title ~~Fix backfill marked complete before DagRuns are created (#61375)~~ Fix backfill marked complete before DagRuns are created Mar 12, 2026

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix backfill marked complete before DagRuns are created#62561

Fix backfill marked complete before DagRuns are created#62561
shivaam wants to merge 3 commits intoapache:mainfrom
shivaam:fix/backfill-race-61375

shivaam commented Feb 27, 2026 •

edited

Loading

Uh oh!

eladkal left a comment

Uh oh!

kaxil Mar 3, 2026

Uh oh!

shivaam Mar 7, 2026 •

edited

Loading

Uh oh!

shivaam Mar 12, 2026

Uh oh!

shivaam commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	num_active = session.scalar(
	select(func.count()).where(
	Backfill.dag_id == dag_id,
	Backfill.completed_at.is_(None),
	)
	)
	if num_active is None:
	raise UnknownActiveBackfills(dag_id)
	if num_active > 0:
	raise AlreadyRunningBackfill(
	f"Another backfill is running for Dag {dag_id}. "
	f"There can be only one running backfill per Dag."
	)

	dag = serdag.dag

Conversation

shivaam commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Tests

Was generative AI tooling used to co-author this PR?

Uh oh!

eladkal left a comment

Choose a reason for hiding this comment

Uh oh!

kaxil Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

shivaam Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaam Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

shivaam commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shivaam commented Feb 27, 2026 •

edited

Loading

shivaam Mar 7, 2026 •

edited

Loading