Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling MPI applications with Tracy can cause long spikes in non-blocking send/receive operations #966

Open
psalz opened this issue Jan 7, 2025 · 9 comments

Comments

@psalz
Copy link

psalz commented Jan 7, 2025

We are using Tracy to profile our distributed GPU runtime system Celerity, and it's mostly working great. However, during some recent benchmarking runs on the Leonardo supercomputer we've noticed that traces often contain very long spikes for MPI non-blocking send / receive operations, with some transfers taking several thousand times longer than they should (e.g. 30ms instead of 10us, sometimes even > 100ms).

Here's an example trace for a run with 32 ranks. Notice how there are several small gaps throughout the run and a few very long ones towards the end, caused by long transfers (in the "p2p" fibers at the very bottom). The application is a simple stencil code executed over 10000 iterations, with each iteration performing exactly the same set of operations (point to point transfers between ranks, some copies as well as GPU kernel executions).

image

Long story short, it turns out that those spikes only happen while profiling with Tracy, and therefore seem to be due to some unfortunate interaction between the Tracy client and MPI.

What is very curious is that the gaps happen at seemingly predictable phases of the program's execution.

Here is another trace of the same application / configuration. Notice how the pattern of gaps looks very similar, although in this case the long gap towards the end is quite a bit shorter.

image

I've managed to create a small-ish reproducer program, in case anyone is interested:

#include <numeric>
#include <optional>
#include <functional>
#include <algorithm>
#include <array>
#include <vector>
#include <chrono>
#include <cstdio>
#include <cstdlib>
#include <thread>
#include <deque>

#include <mpi.h>
#include <tracy/Tracy.hpp>
#include <tracy/TracyC.h>

using clk = std::chrono::steady_clock;
using namespace std::chrono_literals;

int main(int argc, char* argv[]) {
	const size_t transfer_bytes = argc > 1 ? std::atol(argv[1]) : 16384 * 4;
	const size_t iterations = 10000;
	const size_t warmup = 100;

	int provided = -1;
	MPI_Init_thread(nullptr, nullptr, MPI_THREAD_MULTIPLE, &provided);

	int rank = -1;
	int size = -1;
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	MPI_Comm_size(MPI_COMM_WORLD, &size);

	if(rank == 0) {
		printf("Transferring %zu bytes %zu times, %zu warmup iterations\n", transfer_bytes, iterations, warmup);
		fflush(stdout);
	}
	MPI_Barrier(MPI_COMM_WORLD);

	std::vector<unsigned char> sendbuf_up(transfer_bytes);
	std::vector<unsigned char> recvbuf_up(transfer_bytes);
	std::vector<unsigned char> sendbuf_down(transfer_bytes);
	std::vector<unsigned char> recvbuf_down(transfer_bytes);

	std::vector<clk::duration> times;
	times.reserve(iterations);

	const int up = rank > 0 ? rank - 1 : size - 1;
	const int down = rank < size - 1 ? rank + 1 : 0;
	const int tag_up = 0;
	const int tag_down = 1;
	for(size_t i = 0; i < iterations + warmup; ++i) {
		const auto before = clk::now();
		MPI_Request reqs[4];

		MPI_Irecv(recvbuf_up.data(), transfer_bytes, MPI_BYTE, up, tag_down + i, MPI_COMM_WORLD, &reqs[0]);
		MPI_Irecv(recvbuf_down.data(), transfer_bytes, MPI_BYTE, down, tag_up + i, MPI_COMM_WORLD, &reqs[1]);
		MPI_Isend(sendbuf_up.data(), transfer_bytes, MPI_BYTE, up, tag_up + i, MPI_COMM_WORLD, &reqs[2]);
		MPI_Isend(sendbuf_down.data(), transfer_bytes, MPI_BYTE, down, tag_down + i, MPI_COMM_WORLD, &reqs[3]);

		bool done[4] = {false, false, false, false};
		bool all_done = false;
		while(!all_done) {
			ZoneScopedN("make things slow"); // <--- comment out this zone to remove spikes

			all_done = true;
			for(size_t j = 0; j < 4; ++j) {
				if(done[j]) continue;
				int flag = -1;
				MPI_Test(&reqs[j], &flag, MPI_STATUS_IGNORE);
				done[j] = flag != 0;
				all_done = all_done && done[j];
			}
		}

		const auto after = clk::now();
		if(i >= warmup) {
			times.push_back(after - before);
		}
	}

	MPI_Finalize();

	const auto sum = std::accumulate(times.begin(), times.end(), clk::duration{});
	const auto min = *std::min_element(times.begin(), times.end());
	const auto max = *std::max_element(times.begin(), times.end());

	printf("Rank %2d mean: %4zuus, min: %4zuus, max: %4zuus\n", rank, sum / 1us / iterations, min / 1us, max / 1us);

	return 0;
}

Obviously creating zones in a busy loop is not ideal, but this was the only way I could reproduce the effect in this small example. In our real application zones are submitted by different threads, including the thread that calls MPI_Test, but not for each iteration as is done here.

Here's the output when running on 32 ranks on Leonardo, with the ZoneScopedN enabled:

Transferring 65536 bytes 10000 times, 100 warmup iterations
Rank  7 mean:   20us, min:   13us, max: 26901us
Rank  4 mean:   20us, min:   13us, max: 31456us
Rank 17 mean:   20us, min:   12us, max: 25444us
Rank 14 mean:   20us, min:   12us, max: 21204us
Rank 19 mean:   20us, min:   13us, max: 25211us
Rank 31 mean:   20us, min:   13us, max: 31455us
Rank 10 mean:   20us, min:   13us, max: 20106us
Rank  5 mean:   20us, min:   13us, max: 31455us
Rank  1 mean:   20us, min:   12us, max: 31458us
Rank 11 mean:   20us, min:   13us, max: 18880us
Rank 18 mean:   20us, min:   13us, max: 25231us
Rank 28 mean:   20us, min:   13us, max: 31472us
Rank  3 mean:   20us, min:   13us, max: 31455us
Rank  2 mean:   20us, min:   13us, max: 31458us
Rank 23 mean:   20us, min:   12us, max: 25189us
Rank 27 mean:   20us, min:   13us, max: 25188us
Rank  0 mean:   20us, min:   13us, max: 31460us
Rank  6 mean:   20us, min:   12us, max: 26901us
Rank 16 mean:   20us, min:   12us, max: 25554us
Rank 24 mean:   20us, min:   12us, max: 25186us
Rank  8 mean:   20us, min:   12us, max: 24177us
Rank  9 mean:   20us, min:   12us, max: 19973us
Rank 13 mean:   20us, min:   12us, max: 21199us
Rank 20 mean:   20us, min:   12us, max: 25191us
Rank 29 mean:   20us, min:   12us, max: 31467us
Rank 30 mean:   20us, min:   12us, max: 31461us
Rank 25 mean:   20us, min:   12us, max: 25186us
Rank 15 mean:   20us, min:   12us, max: 25550us
Rank 21 mean:   20us, min:   12us, max: 25186us
Rank 26 mean:   20us, min:   12us, max: 25184us
Rank 12 mean:   20us, min:   12us, max: 16532us
Rank 22 mean:   20us, min:   12us, max: 25188us

And here's the output without the ZoneScopedN:

Transferring 65536 bytes 10000 times, 100 warmup iterations
Rank  8 mean:   15us, min:   12us, max:  250us
Rank 22 mean:   15us, min:   13us, max:  216us
Rank 20 mean:   15us, min:   12us, max:  216us
Rank 24 mean:   15us, min:   12us, max:  216us
Rank 15 mean:   15us, min:   12us, max:  249us
Rank 23 mean:   15us, min:   12us, max:  216us
Rank  0 mean:   15us, min:   13us, max:  251us
Rank 12 mean:   15us, min:   13us, max:  249us
Rank 25 mean:   15us, min:   12us, max:  217us
Rank 21 mean:   15us, min:   12us, max:  216us
Rank 19 mean:   15us, min:   12us, max:  215us
Rank  5 mean:   15us, min:   13us, max:  250us
Rank 14 mean:   15us, min:   12us, max:  249us
Rank  7 mean:   15us, min:   12us, max:  251us
Rank 29 mean:   15us, min:   12us, max:  217us
Rank 26 mean:   15us, min:   12us, max:  217us
Rank 17 mean:   15us, min:   12us, max:  216us
Rank  9 mean:   15us, min:   12us, max:  250us
Rank 18 mean:   15us, min:   12us, max:  217us
Rank  2 mean:   15us, min:   12us, max:  251us
Rank 11 mean:   15us, min:   12us, max:  249us
Rank 13 mean:   15us, min:   12us, max:  249us
Rank  4 mean:   15us, min:   13us, max:  251us
Rank  6 mean:   15us, min:   13us, max:  251us
Rank 28 mean:   15us, min:   12us, max:  216us
Rank 31 mean:   15us, min:   13us, max:  216us
Rank  3 mean:   15us, min:   13us, max:  251us
Rank  1 mean:   15us, min:   12us, max:  251us
Rank 27 mean:   15us, min:   12us, max:  216us
Rank 16 mean:   15us, min:   13us, max:  250us
Rank 10 mean:   15us, min:   12us, max:  249us
Rank 30 mean:   15us, min:   12us, max:  216us

I realize that this is a rather difficult issue to reproduce; I'm mainly opening it to see if anybody has any ideas as to what might be causing these spikes, or any suggestions for how to investigate this further.

One hypothesis we had was that somewhere inside MPI a OS / hardware interaction sometimes causes a thread to be scheduled out, and Tracy would get scheduled in (which could result in delays in the order of milliseconds). However, it is unlikely that this would result in a consistent gap pattern. Furthermore, we've tried explicitly setting the thread affinity for Tracy and all other application threads to ensure no overlap, but this does not seem to change anything (or at least not consistently; we've seen a couple of instances where it seemed to eliminate the gaps, but then wasn't reproducible).

Here's some additional things we've determined:

  • It only seems to happen for transfers over the actual network (no shared memory on a single node).
  • Reproducible for both OpenMPI and Intel MPI (MPICH).
  • The spikes actually happen somewhere inside calls to MPI_Test et al.; pre-loading a dummy MPI library that replaces MPI_Isend / MPI_Irecv / MPI_Test with no-ops eliminates the gaps.
  • It does not matter whether the trace is actually being consumed (e.g. via tracy-capture) or not
@wolfpld
Copy link
Owner

wolfpld commented Jan 7, 2025

You are using async / fiber functionality, and the current implementation switches everything to be fully serialized in such case. Maybe this is the reason why you see this behavior?

@psalz
Copy link
Author

psalz commented Jan 7, 2025

You are using async / fiber functionality, and the current implementation switches everything to be fully serialized in such case. Maybe this is the reason why you see this behavior?

Yes, Celerity uses the fibers API to render concurrent tasks in our runtime. However, the reproducer code does not, it only uses a single ZoneScopedN!

@foxtran
Copy link
Contributor

foxtran commented Jan 11, 2025

@psalz, depending on how I'm lucky, I can reproduce your numbers and cannot. Both for builds with Tracy and without.

I used Komondor HPC cluster with the following submit script:

#!/bin/bash

#SBATCH -A <ACCOUNT>
#SBATCH --partition=cpu
#SBATCH --job-name=TracyMPI
#SBATCH --time=0-00:10:00
#SBATCH --mem=200gb
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --output=slurm/slurm-%J.out
#SBATCH --error=slurm/slurm-%J.err

# LOAD GCC 14.1
# LOAD IntelMPI 2024.1

export PATH=.../tracy.git/capture/build:$PATH

mpirun ./TracyMPI &
tracy-capture -o tracy.${SLURM_JOBID}.${SLURM_LOCALID}.out

There is output with Tracy:

Connecting to 127.0.0.1:8086...Transferring 65536 bytes 10000 times, 100 warmup iterations
Rank  8 mean:   77us, min:   20us, max:  394us
Rank  0 mean:   77us, min:   20us, max:  519us
Rank  9 mean:   77us, min:   11us, max:  402us
Rank  1 mean:   77us, min:   14us, max:  500us
Rank 14 mean:   77us, min:   13us, max:  487us
Rank  2 mean:   77us, min:   15us, max:  498us
Rank 15 mean:   77us, min:   19us, max:  490us
Rank 11 mean:   77us, min:   11us, max:  408us
Rank 12 mean:   77us, min:   11us, max:  458us
Rank  3 mean:   77us, min:   15us, max:  493us
Rank 13 mean:   77us, min:   11us, max:  463us
Rank  4 mean:   77us, min:   15us, max:  423us
Rank 10 mean:   77us, min:   12us, max:  409us
Rank  5 mean:   77us, min:   17us, max:  424us
Rank  6 mean:   77us, min:   20us, max:  399us
Rank  7 mean:   77us, min:   21us, max:  394us

Queue delay: 12 ns
Timer resolution: 9 ns

Frames: 2
Time span: 1.66 s
Zones: 814,577
Elapsed time: 2.4 s
Saving trace... done!
Trace size 6281.06 KB (14.24% ratio)

Lucky run without Tracy:

Transferring 65536 bytes 10000 times, 100 warmup iterations
Rank  3 mean:   65us, min:   38us, max:  133us
Rank  0 mean:   65us, min:   23us, max:  199us
Rank  1 mean:   65us, min:   23us, max:  133us
Rank  2 mean:   65us, min:   38us, max:  133us
Rank  4 mean:   65us, min:   38us, max:  133us
Rank  5 mean:   65us, min:   41us, max:  134us
Rank  6 mean:   65us, min:   38us, max:  102us
Rank  7 mean:   65us, min:   29us, max:  148us
Rank  8 mean:   65us, min:   25us, max:  192us
Rank  9 mean:   65us, min:   20us, max:  170us
Rank 11 mean:   65us, min:   42us, max:  127us
Rank 15 mean:   65us, min:   25us, max:  152us
Rank 12 mean:   65us, min:   41us, max:  126us
Rank 14 mean:   65us, min:   41us, max:  126us
Rank 13 mean:   65us, min:   41us, max:  126us
Rank 10 mean:   65us, min:   23us, max:  127us

Unlucky run without Tracy:

Transferring 65536 bytes 10000 times, 100 warmup iterations
Rank  0 mean: 1442us, min:   19us, max: 33296us
Rank  1 mean: 1442us, min:   12us, max: 20987us
Rank  2 mean: 1442us, min:   14us, max: 20816us
Rank  3 mean: 1442us, min:   13us, max: 20654us
Rank  4 mean: 1442us, min:   14us, max: 20488us
Rank  5 mean: 1441us, min:   13us, max: 20338us
Rank  6 mean: 1440us, min:   14us, max: 20209us
Rank  7 mean: 1440us, min:   12us, max: 20091us
Rank  8 mean: 1440us, min:   12us, max: 19976us
Rank  9 mean: 1440us, min:   18us, max: 33705us
Rank 11 mean: 1440us, min:   21us, max: 19896us
Rank 12 mean: 1440us, min:   25us, max: 19965us
Rank 13 mean: 1440us, min:   23us, max: 20088us
Rank 14 mean: 1440us, min:   19us, max: 20204us
Rank 15 mean: 1442us, min:   22us, max: 20338us
Rank 16 mean: 1442us, min:   18us, max: 20531us
Rank 17 mean: 1442us, min:   19us, max: 20680us
Rank 18 mean: 1442us, min:   12us, max: 20860us
Rank 19 mean: 1442us, min:   31us, max: 33240us
Rank 10 mean: 1440us, min:   24us, max: 33659us

Note, that when I have huge numbers, mean is also huge, but in your numbers it is not so.

So,

  • Try to run your tasks with --cpus-per-task=2 that would increase the space for your threads on Leonardo (each node have 32 cores, I do not know HT status) since Tracy has extra threads (as well as tracy-capture utility threads if you use it). It allow you to avoid overcommiting CPU cores (but I think you did it already)
  • Print out the whole times data. Looks like you have only few spikes which can be related to thread moving between cores.
  • Check topology of task (how far are your nodes? Are they in single rack?)
  • Try to run sequentially with Tracy and without in single job. Even multiple time to be sure that topology and MPI communication from other users are not reasons.

P.S. Cray MPI works a bit different and even reproduces your's numbers sometimes (even on single node) but anyway it looks like you caught some warm-up iterations.

@psalz
Copy link
Author

psalz commented Jan 14, 2025

Thanks for trying to reproduce the issue!

I was already pretty confident that there was a causal relationship between enabling Tracy and the spikes I am seeing, but I went ahead and confirmed this again in a larger experiment:

image

These plots are based on the individual times for each of the 10'000 iterations on each rank, aggregated on the master node and then written out to CSV. Using 32 ranks, on 32 nodes, each rank having all 32 CPUs and 300GB memory exclusively allocated. I did 10 runs for both configurations, alternating between with and without Tracy. All runs were done in a single sbatch script, so the set of nodes does not change, and we can rule out any topology-related effects.

@foxtran
Copy link
Contributor

foxtran commented Jan 14, 2025

Hmm... @psalz, What is your setup of Tracy?

I used that one (note, it uses code for #967):

if(ENABLE_TRACY)
  set(TRACY_LTO ON CACHE INTERNAL "Enable Link-Time optimization")
  include(FetchContent)

  FetchContent_Declare(tracy
    GIT_REPOSITORY https://github.com/foxtran/tracy.git
    GIT_TAG feature/lto
    GIT_SHALLOW TRUE
    GIT_PROGRESS TRUE
  )

  FetchContent_MakeAvailable(tracy)
endif()

for your example code with extra #include <random> to fix compilation with GCC 14.1.

@psalz
Copy link
Author

psalz commented Jan 14, 2025

After some more discussion with a colleague, we were wondering if the behavior could be explained by memory allocations happening inside Tracy. I just added time measurements around only the ZoneScopedN itself, and lo and behold, it sometimes spikes to tens of milliseconds -- it could be that I was simply measuring this all the time. As I said, in our distributed runtime we don't create a separate zone for each call to MPI_Test, and I am not 100% confident that the reproducer actually models the same issue that we are seeing there.

You are using async / fiber functionality, and the current implementation switches everything to be fully serialized in such case. Maybe this is the reason why you see this behavior?

Could it be that interactions with the Tracy API in one thread cause a memory (re-)allocation that then blocks interactions in other threads? If so, is there maybe a way I could tell Tracy to allocate a large chunk of memory in advance, instead of on-demand?

@foxtran
Copy link
Contributor

foxtran commented Jan 14, 2025

As I can see, Tracy uses RPMALLOC for non-emscripten builds which should cache memory allocations:

#if defined TRACY_ENABLE && !defined __EMSCRIPTEN__
# include "TracyApi.h"
# include "TracyForceInline.hpp"
# include "../client/tracy_rpmalloc.hpp"
# define TRACY_USE_RPMALLOC
#endif

@psalz
Copy link
Author

psalz commented Jan 15, 2025

Welp, I'm an idiot. When I originally tried to reproduce this issue I ported the fiber mechanism from Celerity, and then upon removing it forgot to also remove the set(TRACY_FIBERS ON) from my CMakeLists.txt. Thanks @foxtran for asking what my setup was, because otherwise I wouldn't have gone back and looked.

So your initial hunch was right @wolfpld!

It's also clear now that the spikes don't have anything to do with MPI per-se, but are only being "amplified" by it, because a spike on one rank causes the receive on another rank to take longer.

I ran a couple more experiments, this time also measuring the duration of ZoneScopedN. The second run disables rpmalloc to see whether this has any effect.

image

image

image

So clearly (a) the spikes are because of TRACY_FIBERS=ON, (b) it has something to do with memory allocations, (c) rpmalloc is already doing a good job mitigating a lot of spikes.

The reason we are seeing more spikes in the "per-iteration" times compared to the "ZoneScopedN" times most likely is because the former also measure spikes that happen on neighboring MPI ranks.

So the remaining question is what is happening in Tracy, and if there is anything we can do to mitigate it..?

@foxtran
Copy link
Contributor

foxtran commented Jan 15, 2025

@psalz, you need to cache system call which gets thread ID. Unfortunately, It is happening on each message committing.

Starting from here:

#ifdef TRACY_FIBERS
# define TracyQueuePrepare( _type ) \
auto item = tracy::Profiler::QueueSerial(); \
tracy::MemWrite( &item->hdr.type, _type );
# define TracyQueueCommit( _name ) \
tracy::MemWrite( &item->_name.thread, tracy::GetThreadHandle() ); \
tracy::Profiler::QueueSerialFinish();

Then you have two implementations of GetThreadHandle (TracyProfiler.cpp is too big to clearly understand in which #ifdef I am):

TRACY_API uint32_t GetThreadHandle() { return s_threadHandle.val; }

TRACY_API uint32_t GetThreadHandle() { return detail::GetThreadHandleImpl(); }

For one case Tracy has caching, for another Tracy makes extra call:

TRACY_API uint32_t GetThreadHandleImpl()
{
#if defined _WIN32
static_assert( sizeof( decltype( GetCurrentThreadId() ) ) <= sizeof( uint32_t ), "Thread handle too big to fit in protocol" );
return uint32_t( GetCurrentThreadId() );
#elif defined __APPLE__
uint64_t id;
pthread_threadid_np( pthread_self(), &id );
return uint32_t( id );
#elif defined __ANDROID__
return (uint32_t)gettid();
#elif defined __linux__
return (uint32_t)syscall( SYS_gettid );
#elif defined __FreeBSD__
long id;
thr_self( &id );
return id;
#elif defined __NetBSD__
return _lwp_self();
#elif defined __DragonFly__
return lwp_gettid();
#elif defined __OpenBSD__
return getthrid();
#elif defined __QNX__
return (uint32_t) gettid();
#elif defined __EMSCRIPTEN__
// Not supported, but let it compile.
return 0;
#else
// To add support for a platform, retrieve and return the kernel thread identifier here.
//
// Note that pthread_t (as for example returned by pthread_self()) is *not* a kernel
// thread identifier. It is a pointer to a library-allocated data structure instead.
// Such pointers will be reused heavily, making the pthread_t non-unique. Additionally
// a 64-bit pointer cannot be reliably truncated to 32 bits.
#error "Unsupported platform!"
#endif
}

where system calls happen. And it can be an explanation why it has such spikes.

Hope, it helps :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants