-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vine: Prune Files in Dask Graph during Evaluation #3851
Comments
@BarrySlyDelgado I think you mentioned that the distributed storage plot may come in handy here, is that present in our repo so that Connor can try it out? |
Once the logs are generated we should see what tasks are failing and what workers are failing (If any). I currently suspect that the the accumulation of intermediate results may cause workers to fail. This is somewhat of a knock on affect of TaskVine scheduling tasks to workers that have the most bytes of data dependencies without checking if data transfers will cause that worker to exceed its allocation. |
Apologies for taking so long to get logs-- for the past few days, I haven't been able to get more than ~20 workers at a time, probably a combination my bad priority and heavy use by other users. I will keep trying to make a faithful recreation of the issue for the purposes of logs. |
Hmm, does the crash occur when using the large dataset with a small number of workers? (Our first rough hypothesis would suggest that a crash is more likely with fewer workers...) |
I haven't let it go long enough to get to a crash, but I can tell you the behavior where essentially countless recovery tasks are produced occurs almost immediately. |
Hi @BarrySlyDelgado, it looks like I'm going to have trouble getting a faithful recreation; even backing off on the memory and disk I request, I can't get any more than 70 workers. That said, I did let a run go to completion after discussing with Doug above. There were quite a few task failures, and many recovery tasks, but this run actually managed to complete, which greatly shocked me. I've never been able to get this program to run all the way through when I was requesting 150+ workers-- could there be such a thing as too many workers? Either way, I've attached the logs from this run here in case they would be helpful. |
At the time of writing, the number there seems to be a limited amount of machines in the condor pool which would suggest why you were not getting your requested amount of workers. I'll take a look at the logs to see what I can get from them. Also, do you have the code which writes the parquet files? |
Sure, here's the code that does the skimming: https://github.com/cmoore24-24/hgg/blob/main/notebooks/skims/skimmer/skimmer.py |
I have some potential insights that will probably need to be addressed further. I've plotted the average task runtime left-y (seconds) and the number of tasks of a specific category right-y: To go forward, I think we should investigate what the long running red tasks are and why purple tasks seem to be less distributed across workers. FYI @dthain |
Very interesting, the plots really help here! It would not be hard to imagine a case where the scheduler prefers to run jobs on the nodes that already have the most data, and thus prefers the same node over and over again until it is full. The scheduler certainly has access to quite a lot of information about storage utilization, but is not currently making use of it. We could do a better job there. |
(Giving Barry a break on this one while he is away for the summer.) @JinZhou5042 is going to start looking into this matter. The key issue as we understand it is that the DaskVine executor is not cleaning up intermediate files as they are consumed. We need two things to make this work: |
@JinZhou5042 please set up a time to talk with @btovar to discuss DaskVine and make sure you understand what is going on there. Then, you can put in |
Two ways to improve it: 1. Reduce worker disconnections While the workers are not failing due to disk filling up, in the beginning of every run, several workers disconnect accidentally. I am assuming that some tasks are running out the worker memory, causing the disconnection of workers. Unfortunately, those tasks are usually long tasks. If we took 3000s to run a task and had a worker disconnection, we have to take another 3000s to run the recovery task, which lengthens the total program execution time. If we have a mechanism on the worker to detect that a task is about to consume substantial amount of memory, the worker can limit the total cores to tell the manager to not schedule more tasks to there, or swap the process information from the memory to the disk and recover later when the memory pressure is relieved, or forsake some short tasks for long tasks because it doesn't cost much to recover them, we can further reduce the overall execution time from 5h to 4h or even less. 2. Dynamic worker cores This application has more parallel tasks at the beginning and much less at the end. If we can assign more cores to the workers while starting and gradually limit as we go, the resource waste seems to be reduced. |
@cmoore24-24 @dthain FYI |
Also @colinthomas-z80 FYI |
@JinZhou5042 I hope you are keeping all of the logs from these runs handy, because you are putting together a nice body of evidence for a paper on the benefits/costs of using in-cluster storage. And I also like how you have expressed that we have several different issues, each of which can be addressed separately. Please go ahead and propose separate PRs for each of those little issues. |
Now, I'm not sure about your comment above about workers crashing because of memory usage. What should happen is that the worker monitors the memory usage of the task as it runs. (Assuming the resource monitor is turned on.). If the task exceeds the allocated memory usage, the worker should kill it and return it to the manager, without the worker itself crashing. But if the worker itself is crashing, then something else is going on. Are you able to obtain the logs of crashed workers? (This should be possible if using the --debug-workers option to the factory.) |
I am not sure if I understand correctly, but there has to be a reason that several workers were disconnecting at a close time point. It is obvious that those workers are not filling up the disk, there might be other reasons but the only one I can think about is that the worker is running out the memory, and that is probably caused by some memory intensive tasks. |
I'll try to use the |
Though the issue with the infinite loop submission of recovery tasks has been resolved, and we are now confident that the application can complete its execution, there remains a critical problem affecting the overall completion time. This problem introduces significant uncertainty to the total completion time of the entire application. Here is the scatter plot of the execution time of each task where the x axis represents the task id and the y axis represents the time, we see that the time distribution is highly imbalanced, the majority of tasks finish in a relatively shorter period than the others. In effect, if we look into the CDF of task execution time, it turns out that 90% of tasks finish in 500s and 95% finish in 1300s. In a computing cluster, the total runtime of this application actually depends on the stability of the worker connections, and from my point of view this can be roughly categorized into the following scenarios:
If we are lucky to have reliable workers, we may have this view, where the blue are regular tasks, the red are failed tasks and the pink are recovery tasks. It seems good because we only lose a couple of workers, we lose in the early stages and those tasks failing are not running for too long, which means that they don't extend the total runtime for too much: However, if we are frequently losing workers, we need notable extra time to run recovery tasks, especially those long tasks (longer than 5000s): Therefore, at this stage, I think the key to improving overall performance is to reduce the failure probability of long tasks or to decrease the cost of recovering them. I think there are 2 ways to achieve this:
|
There are 183 distinct categories of this application (183 different kinds of functions), I plotted the task count, average/max execution time of each category: It shows that:
This indicates that we can infer whether the task execution times across the entire graph are evenly or unevenly distributed based on the current running conditions, even without running all the tasks. If the distribution is imbalanced, we can identify which categories might contain long tasks based on the current performance of different task types, and then apply the following optimizations to these categories:
|
The evidence for that the total execution time depends on the recovery cost of long tasks, rather than losing workers or filling up the disk: For this run I had 100 workers, the total runtime was 33231s. For the second run I had 84 workers at most, and the total time was half of the first one (14652s). The first run took less time to finish the first 90% of tasks, but took considerably longer to finish the tails, while the second took longer to finish the first 90% tasks but didn't take a while to complete all subsequent tasks. Both of them suffered worker crashing in the earlier phases, however, the second one had more reliable worker connections later and not too many long tasks were recovered. I believe that's the reason it's twice faster than the first one. |
Issue: I am currently running into an issue when trying to run a process with Task/Dask Vine, wherein the process inevitably fails when the input dataset is too large. The symptoms are:
Use Case: Using Dask Vine to perform large calculations on QCD sample root files, and output the results into parquet files. The total size of the dataset(s) that fail is ~170 GB. I have independently worked around this issue by manually cutting the datasets into batches; doing one-quarter of the dataset at a time has been a successful workaround thus far.
Manager Options:
I will work on recreating this scenario on my end so I can provide the vine-run-info logs.
The text was updated successfully, but these errors were encountered: