You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+122-29
Original file line number
Diff line number
Diff line change
@@ -6,26 +6,26 @@
6
6
7
7
# AirFly: Auto Generate Airflow's `dag.py` On The Fly
8
8
9
-
Pipeline management is crucial for efficient data operations within a company. Many engineering teams rely on tools like Airflow to help them organize workflows, including ETL processes, reporting pipelines, or machine learning projects.
9
+
Effective data pipeline management is essential for a company's data operations. Many engineering teams rely on tools like Airflow to organize various batch processing tasks, such as ETL/ELT workflows, data reporting pipelines, machine learning projects and so on.
10
10
11
-
Airflow offers rich extensibility, allowing developers to arrange workloads into a sequence of tasks. These tasks are then declared within a `DAG` context in a `dag.py` file, specifying task dependencies.
11
+
Airflow is a powerful tool for task scheduling and orchestration, allowing users to define workflows as "DAGs". A typical DAG represents a data pipeline and includes a series of tasks along with their dependencies.
12
12
13
-
As a workflow grows in complexity, the increasing intricacy of task relations can lead to confusion and disrupt the DAG structure. This complexity often results in decreased code maintainability, particularly in collaborative scenarios.
13
+
As a pipeline grows in complexity, more tasks and interdependencies are added, which often lead to confusion and disrupt the structure of the DAG, resulting in lower code quality and making it more difficult to maintain and update, especially in collaborative environments.
14
14
15
-
`airfly`tries to alleviate such pain points and streamline the development life cycle. It operates under the assumption that all tasks are managed within a certain Python module. Developers define task dependencies while creating task objects. During deployment, `airfly`can resolve the dependency tree and automatically generate the `dag.py` for you.
15
+
`airfly`aims to alleviate such challenges and streamline the development lifecycle. It assumes that tasks are encapsulated in a specific structure, with dependencies defined as part of the task attributes. During deployment, `airfly`recursively collects all tasks, resolves the dependency tree, and automatically generates the DAG.
*`dag.py` automation: focus on your task, let airfly handle the rest.
26
-
* No need to install Airflow: keep your environment lean.
27
-
* support task group: a nice feature from Airflow 2.0+
28
-
* support duck typing: flexible class inheritance.
24
+
*`dag.py` Automation: focus on your tasks and let airfly handle the rest.
25
+
* No Airflow Installation Required: keep your environment lean without the need for Airflow.
26
+
* Task Group Support: a nice feature from Airflow 2.0+.
27
+
* Duck Typing Support: flexible class inheritance for greater customization.
28
+
29
29
30
30
## Install
31
31
@@ -64,7 +64,8 @@ Options:
64
64
65
65
## How It Works
66
66
67
-
`airfly` assumes the tasks are populated in a Python module(or a package, e.g., `man_dag` in the below example), the dependencies are declared by assigning `upstream` or `downstream` attributes to each task. A task holds some attributes corresponding to an airflow operator, when `airfly` walks through the entire module, all tasks are discovered and collected, the dependency tree and the `DAG` context are auto-built, with some `ast` helpers, `airfly` can wrap the information, convert it into python code, and finally save them to `dag.py`.
67
+
`airfly` assumes that tasks are defined within a Python module (or package, such as `main_dag` in the example below). Each task holds attributes corresponding to an Airflow operator, and the dependencies are declared by assigning `upstream` or `downstream`. As `airfly` walks through the module, it discovers and collects all tasks, resolves the dependency tree, and generates the `DAG` in Python code, which can then be saved as `dag.py`.
68
+
68
69
69
70
```sh
70
71
main_dag
@@ -85,7 +86,7 @@ main_dag
85
86
86
87
### Define your task with `AirFly`
87
88
88
-
Declare a task as following(see [demo](https://github.com/ryanchao2012/airfly/blob/main/examples/tutorial/demo.py)):
89
+
Declare a task as follows(see [demo](https://github.com/ryanchao2012/airfly/blob/main/examples/tutorial/demo.py)):
89
90
90
91
```python
91
92
# in demo.py
@@ -97,9 +98,8 @@ class print_date(AirFly):
97
98
op_params =dict(bash_command="date")
98
99
99
100
100
-
# during dag generation,
101
-
# this class will be converted to airflow operator
102
-
print_date._to_ast(print_date).show()
101
+
# During DAG generation,
102
+
# This class will be auto-converted to the following code:
*`op_class (str)`: specifies the airflow operator to this task.
112
-
*`op_params`: keyword arguments which will be passed to the airflow operator(`op_class`), a parameter (i.e., value in the dictionary) could be one of the [primitive types](https://docs.python.org/3/library/stdtypes.html), a function or a class.
112
+
*`op_params`: keyword arguments which will be passed to the operator(`op_class`), a parameter (i.e., value in the dictionary) could be one of the [primitive types](https://docs.python.org/3/library/stdtypes.html), a function or a class.
113
+
114
+
You can also define the attributes using `property`:
By default, the class name(`print_date`) maps to `task_id`to the applied operator after dag generation. You can change this behavior by overriding `_get_taskid` as a classmethod, you have to make sure the task id is globally unique:
142
+
By default, the class name(`print_date`) is used as the `task_id`for the applied operator after DAG generation. You can change this behavior by overriding `_get_taskid` as a class method. Make sure that the `task_id` is globally unique:
@@ -372,10 +377,11 @@ with DAG("demo_dag", **dag_kwargs) as dag:
372
377
373
378
```
374
379
375
-
`airfly` wraps required information including variables and imports into output python script, and pass the specified value to `DAG` object.
380
+
As you can see, `airfly` wraps required information including variables and import dependencies into output code, and pass the specified value to `DAG` object.
376
381
377
382
378
383
## Exclude tasks from codegen
384
+
379
385
By passing `--exclude-pattern` to match any unwanted objects with their `__qualname__`. then filter them out.
380
386
381
387
```
@@ -410,12 +416,99 @@ with DAG("demo_dag") as dag:
410
416
The `templated` task is gone.
411
417
412
418
419
+
### Operators Support
420
+
421
+
#### Built-in Operators
422
+
423
+
Operators defined in the official Airflow package, such as `BashOperator`, `PythonOperator`, and `KubernetesPodOperator`, are considered built-in, including those contributed by the community through various providers (e.g., Google, Facebook, OpenAI).
424
+
425
+
To use a built-in operator, assign `op_class` to its name and specify corresponding parameters using `op_params`:
Sometimes, operators may have a naming ambiguity. For instance, `EmailOperator` could refer to either [`airflow.operators.email.EmailOperator`](https://github.com/apache/airflow/blob/2.10.4/airflow/operators/email.py#L29) or [`airflow.providers.smtp.operators.smtp.EmailOperator`](https://github.com/apache/airflow/blob/2.10.4/airflow/providers/smtp/operators/smtp.py#L29). To resolve such ambiguities, specify the correct module using `op_module`:
This approach ensures that `Task3` explicitly references the `EmailOperator` from the `airflow.providers.smtp.operators.smtp` module, avoiding conflicts with similarly named operators.
475
+
476
+
477
+
#### Private Operators
478
+
479
+
Operators not included in the official Airflow package are considered private. Developers often create custom operators by extending existing built-in ones to meet their use cases. Since these custom operators are not registered within Airflow, `airfly` cannot automatically infer them by name.
480
+
481
+
To use a private operator, provide its class definition directly in `op_class`:
482
+
483
+
```python
484
+
# in my_package/operators.py
485
+
from airflow.operators.bash import BashOperator
486
+
487
+
classEchoOperator(BashOperator):
488
+
489
+
def__init__(self, text: str, **kwargs):
490
+
cmd =f"echo {text}"
491
+
super().__init__(bash_command=cmd, **kwargs)
492
+
493
+
# in my_package/tasks.py
494
+
from airfly.model import AirFly
495
+
from my_package.operators import EchoOperator
496
+
497
+
classTask4(AirFly):
498
+
op_class = EchoOperator
499
+
op_params =dict(text="Hello World")
500
+
501
+
```
502
+
503
+
This approach enables seamless integration of private, custom-built operators with `airfly`.
504
+
505
+
413
506
### Task Group
414
507
415
508
`airfly` defines `TaskGroup` in the DAG context and assigns `task_group` to each operator for you.
416
509
It maps the module hierarchy to the nested group structure,
417
510
so the tasks in the same python module will be grouped closer.
418
-
If you don't like this feature, pass `--task-group`/`-g` with `False` to disable it.
511
+
If you don't like this feature, pass `--task-group`/`-g` with `0` to disable it.
0 commit comments