Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi/4.1.4 under ncarenv/22.10 breaks with MPI_Ssend #32

Open
benkirk opened this issue Nov 30, 2022 · 1 comment
Open

openmpi/4.1.4 under ncarenv/22.10 breaks with MPI_Ssend #32

benkirk opened this issue Nov 30, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@benkirk
Copy link

benkirk commented Nov 30, 2022

It seems our openmpi/4.1.4 in ncarenv/22.10 fails with synchronous send modes.

To reproduce:

module reset && module load gcc openmpi
[ -f hello_world_mpi_ssend_recv.C ] || wget https://gist.githubusercontent.com/benkirk/15aea836fa7feb9636bc7e799e714c15/raw/df0db410535ff14d60c4ca37b336b6d1adc28c4d/hello_world_mpi_ssend_recv.C
mpicxx -o hello_world_mpi_ssend_recv hello_world_mpi_ssend_recv.C
qcmd -q main -l select=1:ncpus=2:mpiprocs=2 -l walltime=00:30:00 -A SCSG0001 -- mpiexec -n 2 --mca opal_warn_on_missing_libcuda 0 ./hello_world_mpi_ssend_recv

Output:

Waiting on job launch; 6122.gusched01 with qsub arguments:
    qsub  -l select=1:ncpus=2:mpiprocs=2 -A SCSG0001 -q main@gusched01 -l walltime=00:30:00

--------------------------------------------------------------------------
The library attempted to open the following supporting CUDA libraries,
but each of them failed.  CUDA-aware support is disabled.
libcuda.so.1: cannot open shared object file: No such file or directory
libcuda.dylib: cannot open shared object file: No such file or directory
/usr/lib64/libcuda.so.1: cannot open shared object file: No such file or directory
/usr/lib64/libcuda.dylib: cannot open shared object file: No such file or directory
If you are not interested in CUDA-aware support, then run with
--mca opal_warn_on_missing_libcuda 0 to suppress this message.  If you are interested
in CUDA-aware support, then try setting LD_LIBRARY_PATH to the location
of libcuda.so.1 to get passed this issue.
--------------------------------------------------------------------------
Hello from 0 / gu0013, running ./hello_world_mpi_ssend_recv on 2 ranks
Hello from 1 / gu0013, running ./hello_world_mpi_ssend_recv on 2 ranks
calling MPI_Send...done
calling MPI_Isend...done
[gu0013:207035] *** An error occurred in MPI_Recv
[gu0013:207035] *** reported by process [354680833,1]
[gu0013:207035] *** on communicator MPI_COMM_WORLD
[gu0013:207035] *** MPI_ERR_OTHER: known error not in list
[gu0013:207035] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gu0013:207035] ***    and potentially your MPI job)
[gu0013:207030] 1 more process has sent help message help-mpi-common-cuda.txt / dlopen failed
[gu0013:207030] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@vanderwb vanderwb added the bug Something isn't working label Dec 1, 2022
@benkirk
Copy link
Author

benkirk commented May 3, 2023

OpenMPI is aware of this too, for a year now:
open-mpi/ompi#10210

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants