-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support concurrent execution of the same DAG #786
Comments
Out of curiosity, what is the reason? Do we have different params? |
There are two reasons:
|
One more query, how do you plan to save history for such parallel dag runs. |
I don't think we need to make changes to the persistence layer or history data. Currently, it uses a The biggest changes required are in the agent process and the Web UI. Currently, all functionalities rely on the fact that only one agent process is running for a particular DAG at a time (e.g., interacting a running process to get the realtime update). I've been pondering how to manage multiple agent processes to enable parallel execution of a DAG. At the moment, I plan to create a new parent process (a service) that manages a group of agent processes. |
You may need to append some uuid kind of thing with dagname to uniquely identify it and that may become your new ID on agent process side. On client side, you may need to identify it using params because I don't think we will be running same dag with exactly same params. |
Currently, the DAG execution history is managed using composite keys consisting of the And yes, parameters can be a way to differentiate executions for users on both the Web UI and CLI. |
One of the point may be to disallow user to run same dag with same set of params. |
I'm not sure if it's necessary, as some common utility DAGs might get the same parameters, for example |
Logically a program or script should generate same output for same input. Backup scripts that you mentioned run in cron but not concurrently. And if they are running concurrently then source and destination given as cmdline args (aka params in dagu) should be different otherwise they may corrupt the output files being backed up. For map reduce task and "for" loops, it should have data divided into chunks and same dag should be running on different parts of same data. |
Ah, I see. You're absolutely right. Yes, normally, users should not run the same DAG with the same set of parameters. That's an excellent point. I'll try to implement to prevent it. Thanks for bringing it up! |
What we are trying to implement actually already exists if we invoke the common dag as a Subdag with different params. |
I might be misunderstanding something, but here’s the scenario I’m thinking of: Let's say we have a common DAG called DAG: backup_1: steps:
- name: call_backup
run: backup
params: DIR=/data1 DAG: backup_2: steps:
- name: call_backup
run: backup
params: DIR=/data2 If you try to run these two DAGs simultaneously, one of them will fail with the current implementation even though the params are different. |
Oh.. that should be implemented. Do you mean that I can't have two dags with same subdag being called with different arguments. That feature is very much desirable. |
Yes, that's correct. I'll work on implementing this functionality as soon as possible. |
One possible way could be to suffix the uuid generated using params in the dagid and all the unix processes we are running should take it. Logically we will be treating it as a different dag. |
Oh, that sounds like a great idea for implementing it quickly and easily. I'll give it a try. |
We can't ignore the case when we have multiple subdags being called up from a single Master dag. I think this case will also be handled with the above mentioned approach. We can call these dags composable dags. Is there a way we can do validation on the set of params specified before running a dag. That will help in bringing uniformity across multiple runs of same dag with different set of params. I am not sure though. |
Could you elaborate on what kind of validation you have in mind? For example, are you thinking about enforcing specific rules for allowed parameter values or ensuring that no two runs of the same DAG use conflicting parameters? |
I think when we are using common dag, it is enforced by design. So it is not required. You can ignore this. |
I can give a good example of this I am using currently in Airflow (and jus asked about this in the Discord). I have a scientific imaging application which gets new tlescope data every day. I have a DAG that sweeps the incoming directory for new files and then for each file launches a subdags composed of tasks for description, calibraiton, and photometry before inserting those results into a database. Run sequentially without running concurrent/parallel executions of the subdag, this would take days for each day's input. If I'm running 128 of those dags at the same time (and pretty sure I could run more in Go than I am in airflow), this takes about 2 hours each day. Keen to see this feature implemented since stable concurrent execution was one of the thing that prevented me from using Prefect over Airflow. And Airflow has been cantankerous. (wheras I'm thinking with a little work I could integrate Dagu directly into the docker-compose of our main Go lang API app and massively simplify our architecture.). |
There is a much simpler way to do this. This is exactly what we are doing. We create a template yaml. And keep monitoring the directory waiting for new file. And whenever new data arrives we create a yaml from the template yaml. The yaml has a processing sequence where each process expects a uniqueID (e.g. time of acquisition), we set that uniqueID as variable within the yaml. And while creating yaml in the dag_path from template yaml, we set the uniquied. And then simply invoke post request to start the dag. I am pretty sure you can use this for your use case. I can help you setting it up. Downside is that that processing sequence is getting replicated but that will be fixed the moment the current issue will be fixed where we will be able to run a subdag concurrently with different uniqueIDs as argument. We are getting around 72 to 360 datasets per day. And have a process sequence of around 70+ processes. And we have been running this for last 8 months without fail. We are able to concurrently run multiple dags. All thanks to dagu architecture. Advantage of above mentioned approach (creating separate yaml for each acquisition time) is we can track the processing status of datasets. |
@wakatara how many dags you are running per day and how much compute intensive the dags are? What is the data volume you are processing. What kind of data telescope is providing and the kind of corrections you are applying. If possible can you share sample data and correction details that you are applying. |
@ghansham As mentioned, the key issue is running these files in parallel. The tasks themselves are merely api calls though they can take quite a bit of time to calculate (I hit a number of science backend points, though those are calcualted completely separately and not (at least currently) part of the application. So, this is more calling apis, getting results, and then passing those reulst to the next call which gets more information. Without boring you too much on way the sausages get made, the important pipeline function of the code looks like this (in python tasks for Airflow... I've written them so they should be easily transferrable as python tasks to trigger via Dagu (or Prefect... the other contender right now.). def sci_backend_processing(**kwargs):
# file = kwargs['dag_run'].conf['unprocessed_file']
file = kwargs["file"]
print(f"Processing file: {file}")
scratch = t.copy_to_scratch(file)
description = t.describe_fits(file, scratch)
description = t.identify(file, description)
orbit_job_id = t.object_orbit_submit(description["OBJECT"])
time.sleep(random.uniform(25, 35))
orbit = t.object_orbit(orbit_job_id, description)
description = t.flight_checks(description)
description = t.get_database_ids(description)
description["SOURCE-FILEPATH"] = description["FITS-FILE"]
filename = os.path.basename(description["SOURCE-FILEPATH"])
path = f"/data/staging/datalake/{ description['PDS4-LID'] }/{ description['ISO-DATE-LAKE'] }/{ description['INSTRUMENT'] }/{filename}"
description["LAKE-FILEPATH"] = path
print(f"Description hash: { description }")
# Calibration and ATLAS pre-calibrated override
filepath = os.path.normpath(file).split(os.path.sep)
if filepath[-4] == "atlas":
calibration = t.calibrate_fits_atlas(scratch, file)
else:
calibration_job_id = t.calibrate_fits_submit(scratch, file)
time.sleep(random.uniform(60, 90))
calibration = t.calibrate_fits(calibration_job_id, description)
# Photometru submit and retrieval
photom_job_id = t.photometry_fits_submit(
scratch, file, description["OBJECT"], "APERTURE"
)
filepath = os.path.normpath(file).split(os.path.sep)
if filepath[-4] == "atlas":
time.sleep(random.uniform(20, 30))
if filepath[-5] == "gemini":
print(
"Gemini file: Taking longer on photometry wait due to Gemini file size photometry."
)
time.sleep(random.uniform(60, 90))
photometry = t.photometry_fits(photom_job_id, description)
ephem_job_id = t.object_ephemerides_submit(description, orbit)
time.sleep(random.uniform(35, 45))
ephemerides = t.object_ephemerides(ephem_job_id, description)
orbit_coords_id = t.record_orbit_submit(description["OBJECT"], orbit)
time.sleep(random.uniform(20, 30))
orbit_coords = t.record_orbit(orbit_coords_id, description)
t.database_inserts(description, calibration, photometry, ephemerides, orbit_coords)
t.move_to_datalake(scratch, description) So,
Does that provide a good enough overview? My key issue is we will need to ramp this up somewhat as we get more data sources coming online and ingsting. As well, things get really interesting when we get to large projects coming online which probably will require streaing and such... Hope that helps. Airflow has been useful, but a very problem child so would love to replace it with something simpler and, as you say in your README, more dev friendly. thanks! |
@ghansham The simpler way you allude to, I'd love to see it if you can put the code or gists up somewhere I could take a peak. I spent a number of weeks trying to get Prefect working (which is vastly simpler but had many issues with concurrent execution in its 2.x version mostly due to memory leaks.). Keen to also get this working in Go lang since trust its concurency model much more and, well... it's been rock solid with our APIs and Task queuing to date. Most of my issues right now are Airflow. |
Can you share an equivalent sample dagu yaml first
…On Fri, 7 Feb, 2025, 11:50 am Daryl Manning, ***@***.***> wrote:
@ghansham <https://github.com/ghansham> The simpler way you allude to,
I'd love to see it if you can put the code or gists up somewhere I could
take a peak. I spent a number of weeks trying to get Prefect working (which
is vastly simpler but had many issues with concurrent execution in its 2.x
version mostly due to memory leaks.). Keen to also get this working in Go
lang since trust its concurency model much more and, well... it's been rock
solid with our APIs and Task queuing to date. Most of my issues right now
are Airflow.
—
Reply to this email directly, view it on GitHub
<#786 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYXFJGOXHBHEXS2H2ANFID2ORGC3AVCNFSM6AAAAABU5JWFC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBSGA2DAOBWG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I could not find the variable 't' definition in above code snippet |
@ghansham The above is the function that does the processing to give you an idea, the actual tasks.py (which is the t in the above) is a separate python file with the functions called for that. I can, of course, post these but it's quite a bit of code. Nothing complex or crazy but it's not of trivial length... =] The missing bit is lemme know if drawing a diagram or such might help if that's not clearer... |
I can understand now. |
If you can identify files you are receiving with some uniqueIDs like (say date/time), start creating yaml files like this: yaml filename : hubble1_20250104_050000.yaml And create content of this yaml like this:
You can include more steps based on your processing pipeline. The one that you mentioning your code as
Next write an ingestor program that runs as daemon that keeps watching the input directory area waiting for new files to arrive and create such yaml for every new file that arrives from a template yaml stored in a fixed location and keeps changing params in yaml based on the file that you want to process. And copy it to And then you can create a post request in the the ingestor program to start processing for that yaml. For creating post requests that initiate dagu processing, refer to: https://dagu.readthedocs.io/en/latest/rest.html#submit-dag-action-post-api-v1-dags-name For example, using curl
depending on whether you are running it as http or https, you can change dagu url. Just one query how you are monitoring the input files for arrival. |
No description provided.
The text was updated successfully, but these errors were encountered: