These examples are standalone HIP examples that uses ROCm-aware openMPI and UCX and try to implement collective comms operations. These examples are written to have in-depth understanding of current comms libraries and minimize existing libraries layering to invoke collective communication operations intra-node and inter-node. With these examples, one can inspect underlying SW stack invocations like HIP APIs usage as well as how execution progresses from SW to HW (GPU/network interconnects etc.) while running the collective comms ops. The goal is to integrate the functionality or take learnings from these examples and implement efficient intra-node/inter-node suport in IREE frontend/compiler/runtime stages.
the goal for this examples is to have communication and computation to overlap and have GPU invoked communication without involving host or explicit device buffer transfers. The way it is going to achieve that goal is to setup symmetric heap and map HIP device buffers through the heap, pass pointers across devices and reference cross-device buffers directly from kernel for get/put operations. Examples will cover GPU kernel invoked get/put and collective operations like all-reduce for intra-node GPUs.
- ROCm v6.2.0 onwards
- AMD GPU
- MI300X (tested)
- ROCm-aware Open MPI and UCX
- UCX is mainly used for inter-node communication over network like InifiniBand, RoCE, Ethernet, but can be used between GPUs on same node (intra-node) as well through shared memory. Currently examples are going to use shared memory for intra-node communication, inter-node examples will follow later.
Build and configure ROCm-aware Open MPI and UCX.
export INSTALL_DIR=$HOME/ompi_for_gpu
export ROCM_PATH=<rocm-path>
export UCX_DIR=$INSTALL_DIR/ucx
export OMPI_DIR=$INSTALL_DIR/ompi
Build UCX
git clone https://github.com/openucx/ucx.git -b v1.15.x
cd ucx
./autogen.sh
mkdir build
cd build
../configure -prefix=$UCX_DIR \
--with-rocm=$ROCM_PATH
make -j 8
make -j 8 install
Build Open MPI
git clone --recursive https://github.com/open-mpi/ompi.git \
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
--with-rocm=$ROCM_PATH
make -j 8
make -j 8 install
Update environment variable to use correct version of Open MPI and UCX
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:$ROCM_PATH/lib:$LD_LIBRARY_PATH
export PATH=$OMPI_DIR/bin:$PATH
CMakeLists.txt will be updated later when the examples grow in complexities, currently it is not tested.
Compile example gpu_mpi
hipcc -o gpu_mpi gpu_mpi.cpp -I$ROCM_PATH/include -I$OMPI_DIR/include -L$OMPI_DIR/lib -lmpi -L$ROCM_PATH/lib -lamdhip64
Run example
HIP_VISIBLE_DEVICES=0,1,2,3 mpirun -np 4 ./gpu_mpi
For in-depth details on GPU-aware MPI, please see:
https://rocm.docs.amd.com/en/latest/how-to/gpu-enabled-mpi.html
https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-gpu-aware-mpi-readme/