Buffered communication for CUDA-Aware MPI #167

tharittk · 2025-08-28T13:30:19Z

The package rebuilding is required to run pylops-mpi with CUDA-aware
In my case for NCSA Delta, I create a new conda environment and do
module load openmpi/5.0.5+cuda
then
MPICC=/path/to/mpicc pip install --no-cache-dir --force-reinstall mpi4py
And to run the test (assuming you're in the compute node already):

module load openmpi/5.0.5+cuda
export PYLOPS_MPI_CUDA_AWARE=1
echo "TESTING **WITH** CUDA_AWARE"

echo "TEST NUMPY MPI"
export TEST_CUPY_PYLOPS=0
mpirun -n 2 pytest tests/ --with-mpi

echo "TEST CUPY MPI"
export TEST_CUPY_PYLOPS=1
mpirun -n 2 pytest tests/ --with-mpi

echo " TEST NCCL "
mpirun -n 2 pytest tests_nccl/ --with-mpi

Note:

allgather has not been implemented with the buffered version.
The NumPy + MPI case can use the buffered version regardless whether the mpi4py is built against CUDA-aware MPI or not

A new DistributedMix class is create with the aim of simpflify and unify all comm. calls in both DistributedArray and operators (further hiding away all implementation details).

mrava87 · 2025-09-07T20:57:56Z

@tharittk great start!

Regarding the setup, I completely agree with the need to change the installation process for CUDA-Aware MPI. I have personally so far mostly relied on conda to install mpi as part of the installation of mpi4py, but it seems like this cannot be done to get CUDA-Aware MPI (see https://urldefense.com/v3/https://chatgpt.com/share/68bdf141-0658-800d-9c6c-e85aa4ab6d87;!!BgN1JKhRo9Eh4Q!SnZ79GzfYSo75i0MB4v9O_mBEnH1UA5IVYuisb-NWb0p9kRXKab9gydJlsLTleI51ozFLiVK8FDInCoRknrulElJpw$); so whilst the module load ... part would change (one may have the same luck that you have to get a pre-installed MPI with CUDA support or may need to install themselves), the second part should be universal, so we may want to add some Makefile targets for this setup 😄

Regarding the code, as I briefly mentioned offline, whilst I think this is the right way to go:

buffer comms for NumPy
have the PYLOPS_MPI_CUDA_AWARE env variable for CuPy to allow using object comms for non CUDA-Aware MPI + CuPy

i am starting to feel that the number of branches in code is growing and it is about time to put it all in one place... what I am mostly concerned is that this kind of branches will not only be present in DistributedArray but they will start to permeate into operators. I had a first go at it, only with the allgather method to give you and idea and discuss together if you think this is a good approach before we implement it for all the other comm method. The approach I took is two-fold:

create a _mpi subpackage (similar to _ncll) where all MPI methods are implemented with the various branches - what so far you had in the else branch in the _allreduce method in DistributedArray
create a mixin class DistributedMixIn (in Distributed file) where we can basically move all comm methods that are currently in DistributedArray. However, by doing so, also operators can inherit this class and access those methods - I used VStack as an example.

@astroC86 we have also talked a bit about this in the context of your MatrixMult operator. Pinging you so you can folllow this space, and hopefully once this PR is merged the bar for the implementation of operators that support all backends (Numpy+MPI, Cupy+MPI, Cupy+NCCL) will be lowered as one would just need to know what communication pattern they want to use and call the one from the mixin class without worrying about the subtleties of the different backends

tharittk and others added 5 commits August 17, 2025 04:07

Buffered Send/Recv

1df4f21

Buffered Allreduce

647ce65

minor clean up

31068f9

feat: WIP DistributedMix

ca558fd

A new DistributedMix class is create with the aim of simpflify and unify all comm. calls in both DistributedArray and operators (further hiding away all implementation details).

feat: added _mpi file with actual mpi comm. implementations

64854bb

feat: moved _send to Distributed

838ed0b

mrava87 mentioned this pull request Sep 9, 2025

MatrixMult bcast #168

Open

tharittk added 3 commits September 12, 2025 01:47

mpi_recv for MixIn

ab97e3d

MixIn for allgather.

dbe1f30

fix flake8

a08924b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Buffered communication for CUDA-Aware MPI #167

Buffered communication for CUDA-Aware MPI #167

Uh oh!

tharittk commented Aug 28, 2025

Uh oh!

mrava87 commented Sep 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Buffered communication for CUDA-Aware MPI #167

Are you sure you want to change the base?

Buffered communication for CUDA-Aware MPI #167

Uh oh!

Conversation

tharittk commented Aug 28, 2025

Uh oh!

mrava87 commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mrava87 commented Sep 7, 2025 •

edited

Loading