-
Notifications
You must be signed in to change notification settings - Fork 16.6k
Description
Apache Airflow version
3.1.7
If "Other Airflow 3 version" selected, which one?
No response
What happened?
There appears to be a bug in Airflow 3 where TriggeredDagRunOperator is not retrying for tasks run via the Kubernetes executor. When these child/triggered tasks time out, the TriggeredDagRunOperator does not retry. This only has an effect when the Kubernetes executor is used (i.e. the issue is not observed using Celery)
What you think should happen instead?
It should retry when the triggered tasks time out
How to reproduce
This is reproducible by using a DAG like so and launching airflow in kind (kubernetes in docker)
"""
Repro for TriggerDagRunOperator retry bug in Airflow 3.
Child DAG: sleeps 3 minutes but has dagrun_timeout of 1 minute → guaranteed timeout.
Parent DAG: triggers child with wait_for_completion=True, retries=2.
Expected: parent task fails and retries 2 times.
Observed (bug): parent task fails once and never retries.
To test: trigger `test_trigger_parent` manually and watch the parent task's
`trigger_child_dag` task. It should retry twice after the child times out.
"""
from datetime import datetime, timedelta
from airflow.providers.standard.operators.bash import BashOperator
from airflow.providers.standard.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.sdk import DAG
# --- Child DAG: guaranteed to time out ---
test_timeout_child = DAG(
dag_id="test_timeout_child",
start_date=datetime(2024, 1, 1),
schedule=None,
max_active_runs=1,
dagrun_timeout=timedelta(minutes=1),
)
sleep_too_long = BashOperator(
task_id="sleep_too_long",
bash_command="sleep 180",
dag=test_timeout_child,
)
# --- Parent DAG: triggers child and should retry on failure ---
test_trigger_parent = DAG(
dag_id="test_trigger_parent",
start_date=datetime(2024, 1, 1),
schedule=None,
max_active_runs=1,
)
TriggerDagRunOperator(
task_id="trigger_child_dag",
trigger_dag_id="test_timeout_child",
wait_for_completion=True,
poke_interval=10,
retries=2,
retry_delay=timedelta(seconds=30),
dag=test_trigger_parent,
)
This is a test by launching Airflow 3 in Kind with the kubernetes operator and running the parent DAG. Notice that the dag run has not retried at all.
The opposite behavior (i.e. the parent DAG retrying the child DAG) is observed if running Airflow locally i.e. using the CeleryExecutor
We can confirm what is going on in the logs:
2026-03-05T07:21:28.241156Z [info ] Received executor event with state skipped for task instance TaskInstanceKey(dag_id='test_timeout_child', task_id='sleep_too_long', run_id='manual__2026-03-05T07:20:24.039198+00:00', try_number=1, map_index=-1) [airflow.jobs.scheduler_job_runner.SchedulerJobRunner] loc=scheduler_job_runner.py:822
i.e. the child DAG timed out and is marked as skipped
parent run is then marked as failed
2026-03-05T07:21:34.665690Z [info ] Marking run <DagRun test_trigger_parent @ 2026-03-05 07:20:16+00:00: manual__2026-03-05T07:20:17.339068+00:00, state:running, queued_at: 2026-03-05 07:20:17.346119+00:00. run_type: manual> failed [airflow.models.dagrun.DagRun] loc=dagrun.py:1171
then it looks like we finally get a report back from the child dag
2026-03-05T07:21:36.245789Z [info ] Received executor event with state failed for task instance TaskInstanceKey(dag_id='test_trigger_parent', task_id='trigger_child_dag', run_id='manual__2026-03-05T07:20:17.339068+00:00', try_number=1, map_index=-1) [airflow.jobs.scheduler_job_runner.SchedulerJobRunner] loc=scheduler_job_runner.py:822
2026-03-05T07:21:36.248753Z [info ] TaskInstance Finished: dag_id=test_trigger_parent, task_id=trigger_child_dag, run_id=manual__2026-03-05T07:20:17.339068+00:00, map_index=-1, run_start_date=2026-03-05 07:20:23.231694+00:00, run_end_date=2026-03-05 07:21:34.289648+00:00, run_duration=71.057954, state=failed, executor=KubernetesExecutor(parallelism=32), executor_state=failed, try_number=1, max_tries=2, pool=default_pool, queue=default, priority_weight=1, operator=TriggerDagRunOperator, queued_dttm=2026-03-05 07:20:17.775473+00:00, scheduled_dttm=2026-03-05 07:20:17.767814+00:00,queued_by_job_id=4, pid=18 [airflow.jobs.scheduler_job_runner.SchedulerJobRunner] loc=scheduler_job_runner.py:868
(note try_number=1, max_tries=2). but it's too late as the parent run was already marked as failed, so it won't retry?
Operating System
Linux
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct