Skip to content

Commit 2a2a924

Browse files
luraesssimonbyrnevchuravy
authored
Add CUDA-aware MPI hints to known issues documentation. (#537)
Co-authored-by: Simon Byrne <[email protected]> Co-authored-by: Valentin Churavy <[email protected]>
1 parent 0e6284d commit 2a2a924

File tree

1 file changed

+44
-1
lines changed

1 file changed

+44
-1
lines changed

docs/src/knownissues.md

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,51 @@ ENV["UCX_ERROR_SIGNALS"] = "SIGILL,SIGBUS,SIGFPE"
6565
```
6666
at `__init__`. If set externally, it should be modified to exclude `SIGSEGV` from the list.
6767

68+
## CUDA-aware MPI
69+
70+
### Memory pool
71+
72+
Using CUDA-aware MPI on multi-GPU nodes with recent CUDA.jl may trigger (see [here](https://github.com/JuliaGPU/CUDA.jl/issues/1053#issue-946826096))
73+
```
74+
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
75+
cannot be used.
76+
cuIpcGetMemHandle return value: 1
77+
```
78+
in the MPI layer, or fail on a segmentation fault (see [here](https://discourse.julialang.org/t/cuda-aware-mpi-works-on-system-but-not-for-julia/75060)) with
79+
```
80+
[1642930332.032032] [gcn19:4087661:0] gdr_copy_md.c:122 UCX ERROR gdr_pin_buffer failed. length :65536 ret:22
81+
```
82+
This is due to the MPI implementation using legacy `cuIpc*` APIs, which are incompatible with stream-ordered allocator, now default in CUDA.jl, see [UCX issue #7110](https://github.com/openucx/ucx/issues/7110).
83+
84+
To circumvent this, one has to ensure the CUDA memory pool to be set to `none`:
85+
```
86+
export JULIA_CUDA_MEMORY_POOL=none
87+
```
88+
_More about CUDA.jl [memory environment-variables](https://juliagpu.gitlab.io/CUDA.jl/usage/memory/#Environment-variables)._
89+
90+
### Hints to ensure CUDA-aware MPI to be functional
91+
92+
Make sure to:
93+
- Have MPI and CUDA on path (or module loaded) that were used to build the CUDA-aware MPI
94+
- Make sure to have:
95+
```
96+
export JULIA_CUDA_MEMORY_POOL=none
97+
export JULIA_MPI_BINARY=system
98+
export JULIA_CUDA_USE_BINARYBUILDER=false
99+
```
100+
- Add CUDA and MPI packages in Julia. Build MPI.jl in verbose mode to check whether correct versions are built/used:
101+
```
102+
julia -e 'using Pkg; pkg"add CUDA"; pkg"add MPI"; Pkg.build("MPI"; verbose=true)'
103+
```
104+
- Then in Julia, upon loading MPI and CUDA modules, you can check
105+
- CUDA version: `CUDA.versioninfo()`
106+
- If MPI has CUDA: `MPI.has_cuda()`
107+
- If you are using correct MPI implementation: `MPI.identify_implementation()`
108+
109+
After that, it may be preferred to run the Julia MPI script (as suggested [here](https://discourse.julialang.org/t/cuda-aware-mpi-works-on-system-but-not-for-julia/75060/11)) launching it from a shell script (as suggested [here](https://discourse.julialang.org/t/cuda-aware-mpi-works-on-system-but-not-for-julia/75060/4)).
110+
68111
## Microsoft MPI
69112
70113
### Custom operators on 32-bit Windows
71114
72-
It is not possible to use [custom operators with 32-bit Microsoft MPI](https://github.com/JuliaParallel/MPI.jl/issues/246), as it uses the `stdcall` calling convention, which is not supported by [Julia's C-compatible function pointers](https://docs.julialang.org/en/v1/manual/calling-c-and-fortran-code/index.html#Creating-C-Compatible-Julia-Function-Pointers-1).
115+
It is not possible to use [custom operators with 32-bit Microsoft MPI](https://github.com/JuliaParallel/MPI.jl/issues/246), as it uses the `stdcall` calling convention, which is not supported by [Julia's C-compatible function pointers](https://docs.julialang.org/en/v1/manual/calling-c-and-fortran-code/index.html#Creating-C-Compatible-Julia-Function-Pointers-1).

0 commit comments

Comments
 (0)