-
-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiling MPI applications with Tracy can cause long spikes in non-blocking send/receive operations #966
Comments
You are using async / fiber functionality, and the current implementation switches everything to be fully serialized in such case. Maybe this is the reason why you see this behavior? |
Yes, Celerity uses the fibers API to render concurrent tasks in our runtime. However, the reproducer code does not, it only uses a single |
@psalz, depending on how I'm lucky, I can reproduce your numbers and cannot. Both for builds with Tracy and without. I used Komondor HPC cluster with the following submit script:
There is output with Tracy:
Lucky run without Tracy:
Unlucky run without Tracy:
Note, that when I have huge numbers, mean is also huge, but in your numbers it is not so. So,
P.S. Cray MPI works a bit different and even reproduces your's numbers sometimes (even on single node) but anyway it looks like you caught some warm-up iterations. |
Thanks for trying to reproduce the issue! I was already pretty confident that there was a causal relationship between enabling Tracy and the spikes I am seeing, but I went ahead and confirmed this again in a larger experiment: These plots are based on the individual times for each of the 10'000 iterations on each rank, aggregated on the master node and then written out to CSV. Using 32 ranks, on 32 nodes, each rank having all 32 CPUs and 300GB memory exclusively allocated. I did 10 runs for both configurations, alternating between with and without Tracy. All runs were done in a single sbatch script, so the set of nodes does not change, and we can rule out any topology-related effects. |
Hmm... @psalz, What is your setup of Tracy? I used that one (note, it uses code for #967):
for your example code with extra |
After some more discussion with a colleague, we were wondering if the behavior could be explained by memory allocations happening inside Tracy. I just added time measurements around only the
Could it be that interactions with the Tracy API in one thread cause a memory (re-)allocation that then blocks interactions in other threads? If so, is there maybe a way I could tell Tracy to allocate a large chunk of memory in advance, instead of on-demand? |
As I can see, Tracy uses RPMALLOC for non-emscripten builds which should cache memory allocations: tracy/public/common/TracyAlloc.hpp Lines 6 to 11 in da60684
|
Welp, I'm an idiot. When I originally tried to reproduce this issue I ported the fiber mechanism from Celerity, and then upon removing it forgot to also remove the So your initial hunch was right @wolfpld! It's also clear now that the spikes don't have anything to do with MPI per-se, but are only being "amplified" by it, because a spike on one rank causes the receive on another rank to take longer. I ran a couple more experiments, this time also measuring the duration of So clearly (a) the spikes are because of The reason we are seeing more spikes in the "per-iteration" times compared to the "ZoneScopedN" times most likely is because the former also measure spikes that happen on neighboring MPI ranks. So the remaining question is what is happening in Tracy, and if there is anything we can do to mitigate it..? |
@psalz, you need to cache system call which gets thread ID. Unfortunately, It is happening on each message committing. Starting from here: tracy/public/client/TracyProfiler.hpp Lines 137 to 143 in da60684
Then you have two implementations of tracy/public/client/TracyProfiler.cpp Line 1385 in da60684
tracy/public/client/TracyProfiler.cpp Line 1325 in da60684
For one case Tracy has caching, for another Tracy makes extra call: tracy/public/common/TracySystem.cpp Lines 61 to 99 in da60684
where system calls happen. And it can be an explanation why it has such spikes. Hope, it helps :-) |
We are using Tracy to profile our distributed GPU runtime system Celerity, and it's mostly working great. However, during some recent benchmarking runs on the Leonardo supercomputer we've noticed that traces often contain very long spikes for MPI non-blocking send / receive operations, with some transfers taking several thousand times longer than they should (e.g. 30ms instead of 10us, sometimes even > 100ms).
Here's an example trace for a run with 32 ranks. Notice how there are several small gaps throughout the run and a few very long ones towards the end, caused by long transfers (in the "p2p" fibers at the very bottom). The application is a simple stencil code executed over 10000 iterations, with each iteration performing exactly the same set of operations (point to point transfers between ranks, some copies as well as GPU kernel executions).
Long story short, it turns out that those spikes only happen while profiling with Tracy, and therefore seem to be due to some unfortunate interaction between the Tracy client and MPI.
What is very curious is that the gaps happen at seemingly predictable phases of the program's execution.
Here is another trace of the same application / configuration. Notice how the pattern of gaps looks very similar, although in this case the long gap towards the end is quite a bit shorter.
I've managed to create a small-ish reproducer program, in case anyone is interested:
Obviously creating zones in a busy loop is not ideal, but this was the only way I could reproduce the effect in this small example. In our real application zones are submitted by different threads, including the thread that calls
MPI_Test
, but not for each iteration as is done here.Here's the output when running on 32 ranks on Leonardo, with the
ZoneScopedN
enabled:And here's the output without the
ZoneScopedN
:I realize that this is a rather difficult issue to reproduce; I'm mainly opening it to see if anybody has any ideas as to what might be causing these spikes, or any suggestions for how to investigate this further.
One hypothesis we had was that somewhere inside MPI a OS / hardware interaction sometimes causes a thread to be scheduled out, and Tracy would get scheduled in (which could result in delays in the order of milliseconds). However, it is unlikely that this would result in a consistent gap pattern. Furthermore, we've tried explicitly setting the thread affinity for Tracy and all other application threads to ensure no overlap, but this does not seem to change anything (or at least not consistently; we've seen a couple of instances where it seemed to eliminate the gaps, but then wasn't reproducible).
Here's some additional things we've determined:
MPI_Test
et al.; pre-loading a dummy MPI library that replacesMPI_Isend
/MPI_Irecv
/MPI_Test
with no-ops eliminates the gaps.tracy-capture
) or notThe text was updated successfully, but these errors were encountered: