Skip to content

TriggerDagRunOperator does not retry when triggered DAGs time out with Kubernetes Executor #62970

@Baisang

Description

@Baisang

Apache Airflow version

3.1.7

If "Other Airflow 3 version" selected, which one?

No response

What happened?

There appears to be a bug in Airflow 3 where TriggeredDagRunOperator is not retrying for tasks run via the Kubernetes executor. When these child/triggered tasks time out, the TriggeredDagRunOperator does not retry. This only has an effect when the Kubernetes executor is used (i.e. the issue is not observed using Celery)

What you think should happen instead?

It should retry when the triggered tasks time out

How to reproduce

This is reproducible by using a DAG like so and launching airflow in kind (kubernetes in docker)

"""
Repro for TriggerDagRunOperator retry bug in Airflow 3.

Child DAG: sleeps 3 minutes but has dagrun_timeout of 1 minute → guaranteed timeout.
Parent DAG: triggers child with wait_for_completion=True, retries=2.

Expected: parent task fails and retries 2 times.
Observed (bug): parent task fails once and never retries.

To test: trigger `test_trigger_parent` manually and watch the parent task's
`trigger_child_dag` task. It should retry twice after the child times out.
"""

from datetime import datetime, timedelta

from airflow.providers.standard.operators.bash import BashOperator
from airflow.providers.standard.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.sdk import DAG

# --- Child DAG: guaranteed to time out ---

test_timeout_child = DAG(
    dag_id="test_timeout_child",
    start_date=datetime(2024, 1, 1),
    schedule=None,
    max_active_runs=1,
    dagrun_timeout=timedelta(minutes=1),
)

sleep_too_long = BashOperator(
    task_id="sleep_too_long",
    bash_command="sleep 180",
    dag=test_timeout_child,
)

# --- Parent DAG: triggers child and should retry on failure ---

test_trigger_parent = DAG(
    dag_id="test_trigger_parent",
    start_date=datetime(2024, 1, 1),
    schedule=None,
    max_active_runs=1,
)

TriggerDagRunOperator(
    task_id="trigger_child_dag",
    trigger_dag_id="test_timeout_child",
    wait_for_completion=True,
    poke_interval=10,
    retries=2,
    retry_delay=timedelta(seconds=30),
    dag=test_trigger_parent,
)
Image

This is a test by launching Airflow 3 in Kind with the kubernetes operator and running the parent DAG. Notice that the dag run has not retried at all.
The opposite behavior (i.e. the parent DAG retrying the child DAG) is observed if running Airflow locally i.e. using the CeleryExecutor

We can confirm what is going on in the logs:

2026-03-05T07:21:28.241156Z [info     ] Received executor event with state skipped for task instance TaskInstanceKey(dag_id='test_timeout_child', task_id='sleep_too_long', run_id='manual__2026-03-05T07:20:24.039198+00:00', try_number=1, map_index=-1) [airflow.jobs.scheduler_job_runner.SchedulerJobRunner] loc=scheduler_job_runner.py:822

i.e. the child DAG timed out and is marked as skipped

parent run is then marked as failed

2026-03-05T07:21:34.665690Z [info     ] Marking run <DagRun test_trigger_parent @ 2026-03-05 07:20:16+00:00: manual__2026-03-05T07:20:17.339068+00:00, state:running, queued_at: 2026-03-05 07:20:17.346119+00:00. run_type: manual> failed [airflow.models.dagrun.DagRun] loc=dagrun.py:1171

then it looks like we finally get a report back from the child dag

2026-03-05T07:21:36.245789Z [info     ] Received executor event with state failed for task instance TaskInstanceKey(dag_id='test_trigger_parent', task_id='trigger_child_dag', run_id='manual__2026-03-05T07:20:17.339068+00:00', try_number=1, map_index=-1) [airflow.jobs.scheduler_job_runner.SchedulerJobRunner] loc=scheduler_job_runner.py:822
2026-03-05T07:21:36.248753Z [info     ] TaskInstance Finished: dag_id=test_trigger_parent, task_id=trigger_child_dag, run_id=manual__2026-03-05T07:20:17.339068+00:00, map_index=-1, run_start_date=2026-03-05 07:20:23.231694+00:00, run_end_date=2026-03-05 07:21:34.289648+00:00, run_duration=71.057954, state=failed, executor=KubernetesExecutor(parallelism=32), executor_state=failed, try_number=1, max_tries=2, pool=default_pool, queue=default, priority_weight=1, operator=TriggerDagRunOperator, queued_dttm=2026-03-05 07:20:17.775473+00:00, scheduled_dttm=2026-03-05 07:20:17.767814+00:00,queued_by_job_id=4, pid=18 [airflow.jobs.scheduler_job_runner.SchedulerJobRunner] loc=scheduler_job_runner.py:868

(note try_number=1, max_tries=2). but it's too late as the parent run was already marked as failed, so it won't retry?

Operating System

Linux

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions