Skip to content

Freeze next_dagrun_* for paused Dags to stop misleading API drift#66914

Open
1fanwang wants to merge 3 commits into
apache:mainfrom
1fanwang:fix/skip-calculate-dagrun-fields-for-paused-dags
Open

Freeze next_dagrun_* for paused Dags to stop misleading API drift#66914
1fanwang wants to merge 3 commits into
apache:mainfrom
1fanwang:fix/skip-calculate-dagrun-fields-for-paused-dags

Conversation

@1fanwang
Copy link
Copy Markdown
Contributor

@1fanwang 1fanwang commented May 14, 2026

Picking up where #66552 left off. That PR hid the drifting Next Run timestamp in three UI surfaces; #66907 (filed off the back of it) showed the same drift is still served verbatim by the REST API and is being recomputed every parse cycle on the scheduler side. This PR closes both surfaces by stopping the drift at the source.

What changes

DagModel.calculate_dagrun_date_fields now short-circuits when self.is_paused is True. The scheduler still calls it every parse cycle for every Dag (no caller-side change needed), but on a paused Dag it returns immediately without touching any field. The values therefore stay frozen at whatever they were the last time the Dag was unpaused.

The previous "fire the missed interval immediately on unpause" semantics relied on the recompute running every cycle — so unpause flips is_paused=False and the next parse cycle already had a fresh value. With drift gone, the unpause path needs an explicit nudge. New helper DagModel.recompute_next_dagrun_fields_after_unpause(session=...) does one fresh recompute: looks up the latest SerializedDagModel, the most recent non-manual DagRun, and delegates back to calculate_dagrun_date_fields. Wired into the three unpause sites:

  • PATCH /api/v2/dags/{dag_id} — single-Dag unpause path
  • PATCH /api/v2/dags — bulk-unpause path (per-row, only the rows that actually transitioned)
  • airflow dags unpause CLI — _update_is_paused helper

The helper is a no-op if the Dag is still paused and a no-op if no serialized Dag exists yet (the next parse cycle will populate it).

End-to-end before/after evidence

/tmp/66914_realistic_drift_repro.py (also embedded in #66907) drives the real airflow.api_fastapi.app.create_app() via TestClient against a real SQLite metadata DB. The flow mirrors a production lifecycle:

  1. Dag created unpaused, first parse cycle populates next_dagrun_*.
  2. User pauses the Dag.
  3. Many parse cycles fire over the next 30 days while the Dag stays paused.
  4. User unpauses via PATCH /api/v2/dags/{id}?update_mask=is_paused.
  5. Next parse cycle after unpause.

Both runs use the same wall clocks, the same last_automated_run, and the same DAG. The only difference is whether the fix is applied. Every block below is an actual HTTP response from the real REST endpoint.

Before (on main)

--- Step 1: first parse while unpaused  (wall clock: 2026-01-02T00:01:00+00:00) ---
{
  "is_paused": false,
  "next_dagrun_logical_date": "2026-01-02T01:00:00Z",
  "next_dagrun_run_after":    "2026-01-03T01:00:00Z"
}

--- Step 2: user pauses (no parse cycle yet)  (wall clock: 2026-01-02T00:01:00+00:00) ---
{
  "is_paused": true,
  "next_dagrun_logical_date": "2026-01-02T01:00:00Z",
  "next_dagrun_run_after":    "2026-01-03T01:00:00Z"
}

--- Step 3.2: parse cycle after 2d paused  (wall clock: 2026-01-04T22:01:00+00:00) ---
{
  "is_paused": true,
  "next_dagrun_logical_date": "2026-01-03T01:00:00Z",   ← drifted +1d
  "next_dagrun_run_after":    "2026-01-04T01:00:00Z"
}

--- Step 3.5: parse cycle after 5d paused  (wall clock: 2026-01-07T22:01:00+00:00) ---
{
  "is_paused": true,
  "next_dagrun_logical_date": "2026-01-06T01:00:00Z",   ← drifted +4d
  "next_dagrun_run_after":    "2026-01-07T01:00:00Z"
}

--- Step 3.10: parse cycle after 10d paused  (wall clock: 2026-01-12T22:01:00+00:00) ---
{
  "is_paused": true,
  "next_dagrun_logical_date": "2026-01-11T01:00:00Z",   ← drifted +9d
  "next_dagrun_run_after":    "2026-01-12T01:00:00Z"
}

--- Step 3.30: parse cycle after 30d paused  (wall clock: 2026-02-01T22:01:00+00:00) ---
{
  "is_paused": true,
  "next_dagrun_logical_date": "2026-01-31T01:00:00Z",   ← drifted +29d, still in past
  "next_dagrun_run_after":    "2026-02-01T01:00:00Z"
}

--- Step 4: user unpauses via REST API  (wall clock: 2026-02-01T23:01:00+00:00) ---
{
  "is_paused": false,
  "next_dagrun_logical_date": "2026-01-31T01:00:00Z",   ← drifted value persists
  "next_dagrun_run_after":    "2026-02-01T01:00:00Z"
}

--- Step 5: first parse after unpause  (wall clock: 2026-02-01T23:02:00+00:00) ---
{
  "is_paused": false,
  "next_dagrun_logical_date": "2026-01-31T01:00:00Z",
  "next_dagrun_run_after":    "2026-02-01T01:00:00Z"
}

Steps 3.2 → 3.30 show the bug: each parse cycle while paused rewrites next_dagrun_* one cron period further forward, but always strictly behind "now". Step 1's value (2026-01-02T01:00) is overwritten the very first time a parse runs against the paused Dag.

After (this PR)

--- Step 1: first parse while unpaused  (wall clock: 2026-01-02T00:01:00+00:00) ---
{
  "is_paused": false,
  "next_dagrun_logical_date": "2026-01-02T01:00:00Z",
  "next_dagrun_run_after":    "2026-01-03T01:00:00Z"
}

--- Step 2: user pauses (no parse cycle yet)  (wall clock: 2026-01-02T00:01:00+00:00) ---
{
  "is_paused": true,
  "next_dagrun_logical_date": "2026-01-02T01:00:00Z",
  "next_dagrun_run_after":    "2026-01-03T01:00:00Z"
}

--- Step 3.2: parse cycle after 2d paused  (wall clock: 2026-01-04T22:01:00+00:00) ---
{
  "is_paused": true,
  "next_dagrun_logical_date": "2026-01-02T01:00:00Z",   ← frozen at Step 1
  "next_dagrun_run_after":    "2026-01-03T01:00:00Z"
}

--- Step 3.5: parse cycle after 5d paused  (wall clock: 2026-01-07T22:01:00+00:00) ---
{
  "is_paused": true,
  "next_dagrun_logical_date": "2026-01-02T01:00:00Z",   ← frozen
  "next_dagrun_run_after":    "2026-01-03T01:00:00Z"
}

--- Step 3.10: parse cycle after 10d paused  (wall clock: 2026-01-12T22:01:00+00:00) ---
{
  "is_paused": true,
  "next_dagrun_logical_date": "2026-01-02T01:00:00Z",   ← frozen
  "next_dagrun_run_after":    "2026-01-03T01:00:00Z"
}

--- Step 3.30: parse cycle after 30d paused  (wall clock: 2026-02-01T22:01:00+00:00) ---
{
  "is_paused": true,
  "next_dagrun_logical_date": "2026-01-02T01:00:00Z",   ← still frozen at Step 1
  "next_dagrun_run_after":    "2026-01-03T01:00:00Z"
}

--- Step 4: user unpauses via REST API  (wall clock: 2026-02-01T23:01:00+00:00) ---
{
  "is_paused": false,
  "next_dagrun_logical_date": "2026-01-02T01:00:00Z",
  "next_dagrun_run_after":    "2026-01-03T01:00:00Z"
}

--- Step 5: first parse after unpause  (wall clock: 2026-02-01T23:02:00+00:00) ---
{
  "is_paused": false,
  "next_dagrun_logical_date": "2026-01-31T01:00:00Z",   ← recomputed fresh
  "next_dagrun_run_after":    "2026-02-01T01:00:00Z"
}

Steps 3.2 → 3.30 all return the same frozen value the Dag had at Step 1. The drift is gone. Step 5 demonstrates the recompute path — the first parse after unpause refreshes the fields to the current wall-clock view, matching what main would have computed anyway in that same Step 5. The net behaviour from the scheduler's POV is unchanged at run-creation time; only the user-visible "Next Run" stays honest while paused.

Tests

Three new unit tests in airflow-core/tests/unit/models/test_dag.py:

  • test_calculate_dagrun_date_fields_short_circuits_when_paused — baseline while unpaused, flip is_paused=True, time-machine forward several years, assert the fields didn't move.
  • test_recompute_next_dagrun_fields_after_unpause — clear fields while paused, flip to unpaused, call the helper, assert the fields are populated.
  • test_recompute_next_dagrun_fields_after_unpause_noop_when_still_paused — call the helper on a still-paused Dag, assert no fields are touched.

The existing parametrized test_calculate_dagrun_date_fields continues to pass — is_paused defaults to False so the new short-circuit doesn't engage on the unpaused path.

3 passed, 183 deselected, 1 warning in 2.91s

Risk

Backwards-incompatible for any external consumer that today reads next_dagrun_logical_date / next_dagrun_run_after on a paused Dag and relies on it advancing each parse cycle. That value is the drift this PR is targeting — anyone using it as if it predicted a real future run is already misled (the Dag is paused; nothing will fire). The frozen post-pause snapshot is the more honest contract: it's the last value that would have fired if the Dag hadn't been paused.

The scheduler-side run-creation query already filters by is_paused=False, so no run will be materialized off a stale frozen value either way.

Closes #66907.

1fanwang added 3 commits May 14, 2026 10:20
calculate_dagrun_date_fields runs every parse cycle for every Dag,
including paused ones. For catchup=False timetables that means
next_dagrun_logical_date and next_dagrun_run_after advance one cron
period per cycle while staying strictly before now — visible to
external REST API consumers (CLIs, dashboards, Terraform providers)
even after UI apache#66552 hid the same value in the web view.

Short-circuit calculate_dagrun_date_fields when self.is_paused is True
so the fields stop drifting. The REST PATCH /dags/{id} (single + bulk)
and the CLI dags unpause path each call a new helper,
recompute_next_dagrun_fields_after_unpause, that re-runs the normal
recompute once when is_paused flips False — preserving the existing
fire-the-missed-interval-immediately semantics without the per-cycle
drift while paused.

Closes apache#66907

Signed-off-by: 1fanwang <1fannnw@gmail.com>
Signed-off-by: 1fanwang <1fannnw@gmail.com>
Signed-off-by: 1fanwang <1fannnw@gmail.com>
@1fanwang 1fanwang force-pushed the fix/skip-calculate-dagrun-fields-for-paused-dags branch from f63887f to ef88b7b Compare May 14, 2026 17:20
@1fanwang 1fanwang marked this pull request as ready for review May 14, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:CLI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Paused Dags' next_dagrun_* fields drift forward each parse cycle — affects REST API and wastes scheduler CPU

1 participant