Skip to content

Unknown MPI ABI when using Fujitsu MPI on Fugaku, and segmentation faults when running the tests with this implementation #539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
giordano opened this issue Feb 14, 2022 · 19 comments · Fixed by #542 or #541

Comments

@giordano
Copy link
Member

$ julia --project -e 'ENV["JULIA_MPI_BINARY"]="system"; using Pkg; Pkg.build("MPI"; verbose=true)'
    Building MPI → `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/d56a80d8cf8b9dc3050116346b3d83432b1912c0/build.log`
[ Info: using system MPI                             ]  0/1
┌ Info: Using implementation
│   libmpi = "libmpi"
│   mpiexec_cmd = `mpiexec`
└   MPI_LIBRARY_VERSION_STRING = "FUJITSU MPI Library 4.0.0 (4.0.1fj4.0.0)\0"
┌ Info: MPI implementation detected
│   impl = UnknownMPI::MPIImpl = 0
│   version = v"0.0.0"
└   abi = "unknown"
[ Info: Unknown MPI ABI: building constants file
ERROR: LoadError: IOError: could not spawn `mpicc gen_consts.c -o gen_consts -lmpi -L/opt/FJSVxtclanga/tcsds-1.2.34/lib64`: no such file or directory (ENOENT)
Stacktrace:
  [1] _spawn_primitive(file::String, cmd::Cmd, stdio::Vector{Any})
    @ Base ./process.jl:100
  [2] #690
    @ ./process.jl:113 [inlined]
  [3] setup_stdios(f::Base.var"#690#691"{Cmd}, stdios::Vector{Any})
    @ Base ./process.jl:197
  [4] _spawn
    @ ./process.jl:112 [inlined]
  [5] run(::Cmd; wait::Bool)
    @ Base ./process.jl:445
  [6] run(::Cmd)
    @ Base ./process.jl:444
  [7] top-level scope
    @ ~/.julia/packages/MPI/08SPr/deps/gen_consts.jl:231
  [8] include(fname::String)
    @ Base.MainInclude ./client.jl:451
  [9] top-level scope
    @ ~/.julia/packages/MPI/08SPr/deps/build.jl:114
 [10] include(fname::String)
    @ Base.MainInclude ./client.jl:451
 [11] top-level scope
    @ none:5
in expression starting at /home/ra000019/a04463/.julia/packages/MPI/08SPr/deps/gen_consts.jl:231
in expression starting at /home/ra000019/a04463/.julia/packages/MPI/08SPr/deps/build.jl:64
ERROR: Error building `MPI`: 

Stacktrace:
  [1] pkgerror(msg::String)
    @ Pkg.Types /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Types.jl:68
  [2] (::Pkg.Operations.var"#62#67"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec})()
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:962
  [3] withenv(::Pkg.Operations.var"#62#67"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec}, ::Pair{String, String}, ::Vararg{Pair{String}})
    @ Base ./env.jl:172
  [4] (::Pkg.Operations.var"#99#103"{String, Bool, Bool, Bool, Pkg.Operations.var"#62#67"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec}, Pkg.Types.PackageSpec})()
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1506
  [5] with_temp_env(fn::Pkg.Operations.var"#99#103"{String, Bool, Bool, Bool, Pkg.Operations.var"#62#67"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec}, Pkg.Types.PackageSpec}, temp_env::String)
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1390
  [6] (::Pkg.Operations.var"#98#102"{Bool, Bool, Bool, Pkg.Operations.var"#62#67"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec}, Pkg.Types.Context, Pkg.Types.PackageSpec, String, Pkg.Types.Project, String})(tmp::String)
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1469
  [7] mktempdir(fn::Pkg.Operations.var"#98#102"{Bool, Bool, Bool, Pkg.Operations.var"#62#67"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec}, Pkg.Types.Context, Pkg.Types.PackageSpec, String, Pkg.Types.Project, String}, parent::String; prefix::String)
    @ Base.Filesystem ./file.jl:750
  [8] mktempdir(fn::Function, parent::String) (repeats 2 times)
    @ Base.Filesystem ./file.jl:748
  [9] sandbox(fn::Function, ctx::Pkg.Types.Context, target::Pkg.Types.PackageSpec, target_path::String, sandbox_path::String, sandbox_project_override::Pkg.Types.Project; force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool)
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1435
 [10] sandbox(fn::Function, ctx::Pkg.Types.Context, target::Pkg.Types.PackageSpec, target_path::String, sandbox_path::String, sandbox_project_override::Pkg.Types.Project)
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1432
 [11] build_versions(ctx::Pkg.Types.Context, uuids::Set{Base.UUID}; verbose::Bool)
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:943
 [12] build(ctx::Pkg.Types.Context, uuids::Set{Base.UUID}, verbose::Bool)
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:822
 [13] build(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; verbose::Bool, kwargs::Base.Pairs{Symbol, Base.TTY, Tuple{Symbol}, NamedTuple{(:io,), Tuple{Base.TTY}}})
    @ Pkg.API /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/API.jl:992
 [14] build(pkgs::Vector{Pkg.Types.PackageSpec}; io::Base.TTY, kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:verbose,), Tuple{Bool}}})
    @ Pkg.API /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/API.jl:149
 [15] #build#99
    @ /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/API.jl:142 [inlined]
 [16] #build#98
    @ /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/API.jl:141 [inlined]
 [17] top-level scope
    @ none:1
@giordano
Copy link
Member Author

For future reference, by setting ENV["JULIA_MPICC"]="mpifcc" I was able to compile the costants, now I'm trying to run the tests with this MPI, although it may be nice to automatically detect the ABI. My understanding is that Fujitsu MPI is based on OpenMPI.

@giordano
Copy link
Member Author

And I get a segmentation fault when running the error handler tests with Fujitsu MPI:

signal (11): Segmentation fault
in expression starting at /home/ra000019/a04463/.julia/packages/MPI/08SPr/test/test_errorhandler.jl:6

signal (11): Segmentation fault
in expression starting at /home/ra000019/a04463/.julia/packages/MPI/08SPr/test/test_errorhandler.jl:6

signal (11): Segmentation fault
in expression starting at /home/ra000019/a04463/.julia/packages/MPI/08SPr/test/test_errorhandler.jl:6

signal (11): Segmentation fault
in expression starting at /home/ra000019/a04463/.julia/packages/MPI/08SPr/test/test_errorhandler.jl:6
mca_pml_ob1_send at /opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so (unknown line)
mca_pml_ob1_send at /opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so (unknown line)
MPI_Send at /opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so (unknown line)
MPI_Send at /opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so (unknown line)
Send at /home/ra000019/a04463/.julia/packages/MPI/08SPr/src/pointtopoint.jl:181 [inlined]
Send at /home/ra000019/a04463/.julia/packages/MPI/08SPr/src/pointtopoint.jl:186
unknown function (ip: 0x400021fc7d33)
mca_pml_ob1_send at /opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so (unknown line)
Send at /home/ra000019/a04463/.julia/packages/MPI/08SPr/src/pointtopoint.jl:181 [inlined]
Send at /home/ra000019/a04463/.julia/packages/MPI/08SPr/src/pointtopoint.jl:186
unknown function (ip: 0x400021fc7d33)
MPI_Send at /opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so (unknown line)
Send at /home/ra000019/a04463/.julia/packages/MPI/08SPr/src/pointtopoint.jl:181 [inlined]
Send at /home/ra000019/a04463/.julia/packages/MPI/08SPr/src/pointtopoint.jl:186
unknown function (ip: 0x400021fc7d33)
mca_pml_ob1_send at /opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so (unknown line)
MPI_Send at /opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so (unknown line)
Send at /home/ra000019/a04463/.julia/packages/MPI/08SPr/src/pointtopoint.jl:181 [inlined]
Send at /home/ra000019/a04463/.julia/packages/MPI/08SPr/src/pointtopoint.jl:186
unknown function (ip: 0x400021fc7d33)
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linuxaarch64/build/src/julia.h:1788 [inlined]
do_call at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:126
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linuxaarch64/build/src/julia.h:1788 [inlined]
do_call at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:126
jl_apply at /buildworker/worker/package_linuxaarch64/build/src/julia.h:1788 [inlined]
do_call at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:126
eval_value at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:215
jl_apply at /buildworker/worker/package_linuxaarch64/build/src/julia.h:1788 [inlined]
do_call at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:126
eval_value at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:215
eval_value at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:215
eval_stmt_value at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:166 [inlined]
eval_body at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:601
eval_value at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:215
eval_stmt_value at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:166 [inlined]
eval_body at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:601
eval_stmt_value at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:166 [inlined]
eval_body at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:601
eval_body at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:516
eval_stmt_value at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:166 [inlined]
eval_body at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:601
eval_body at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:516
eval_body at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:516
jl_interpret_toplevel_thunk at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:731
eval_body at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:516
jl_interpret_toplevel_thunk at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:731
jl_interpret_toplevel_thunk at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:731
jl_toplevel_eval_flex at /buildworker/worker/package_linuxaarch64/build/src/toplevel.c:885
jl_interpret_toplevel_thunk at /buildworker/worker/package_linuxaarch64/build/src/interpreter.c:731
jl_toplevel_eval_flex at /buildworker/worker/package_linuxaarch64/build/src/toplevel.c:885
jl_toplevel_eval_flex at /buildworker/worker/package_linuxaarch64/build/src/toplevel.c:885
jl_toplevel_eval_flex at /buildworker/worker/package_linuxaarch64/build/src/toplevel.c:830
jl_toplevel_eval_in at /buildworker/worker/package_linuxaarch64/build/src/toplevel.c:944
jl_toplevel_eval_flex at /buildworker/worker/package_linuxaarch64/build/src/toplevel.c:885
jl_toplevel_eval_flex at /buildworker/worker/package_linuxaarch64/build/src/toplevel.c:830
jl_toplevel_eval_flex at /buildworker/worker/package_linuxaarch64/build/src/toplevel.c:830
jl_toplevel_eval_in at /buildworker/worker/package_linuxaarch64/build/src/toplevel.c:944
jl_toplevel_eval_in at /buildworker/worker/package_linuxaarch64/build/src/toplevel.c:944
jl_toplevel_eval_flex at /buildworker/worker/package_linuxaarch64/build/src/toplevel.c:830
jl_toplevel_eval_in at /buildworker/worker/package_linuxaarch64/build/src/toplevel.c:944
eval at ./boot.jl:373 [inlined]
include_string at ./loading.jl:1196
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
eval at ./boot.jl:373 [inlined]
include_string at ./loading.jl:1196
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
eval at ./boot.jl:373 [inlined]
include_string at ./loading.jl:1196
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
eval at ./boot.jl:373 [inlined]
include_string at ./loading.jl:1196
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
_include at ./loading.jl:1253
_include at ./loading.jl:1253
_include at ./loading.jl:1253
_include at ./loading.jl:1253
include at ./Base.jl:418
include at ./Base.jl:418
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
include at ./Base.jl:418
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
include at ./Base.jl:418
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
exec_options at ./client.jl:292
exec_options at ./client.jl:292
exec_options at ./client.jl:292
exec_options at ./client.jl:292
_start at ./client.jl:495
_start at ./client.jl:495
_start at ./client.jl:495
_start at ./client.jl:495
jfptr__start_29189 at /vol0003/ra000019/a04463/julia-1.7.2-aarch64/lib/julia/sys.so (unknown line)
jfptr__start_29189 at /vol0003/ra000019/a04463/julia-1.7.2-aarch64/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
jfptr__start_29189 at /vol0003/ra000019/a04463/julia-1.7.2-aarch64/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
jfptr__start_29189 at /vol0003/ra000019/a04463/julia-1.7.2-aarch64/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
_jl_invoke at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linuxaarch64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linuxaarch64/build/src/julia.h:1788 [inlined]
true_main at /buildworker/worker/package_linuxaarch64/build/src/jlapi.c:559
jl_apply at /buildworker/worker/package_linuxaarch64/build/src/julia.h:1788 [inlined]
true_main at /buildworker/worker/package_linuxaarch64/build/src/jlapi.c:559
jl_repl_entrypoint at /buildworker/worker/package_linuxaarch64/build/src/jlapi.c:701
jl_apply at /buildworker/worker/package_linuxaarch64/build/src/julia.h:1788 [inlined]
true_main at /buildworker/worker/package_linuxaarch64/build/src/jlapi.c:559
jl_repl_entrypoint at /buildworker/worker/package_linuxaarch64/build/src/jlapi.c:701
jl_apply at /buildworker/worker/package_linuxaarch64/build/src/julia.h:1788 [inlined]
true_main at /buildworker/worker/package_linuxaarch64/build/src/jlapi.c:559
jl_repl_entrypoint at /buildworker/worker/package_linuxaarch64/build/src/jlapi.c:701
jl_repl_entrypoint at /buildworker/worker/package_linuxaarch64/build/src/jlapi.c:701
main at /buildworker/worker/package_linuxaarch64/build/cli/loader_exe.c:42
Allocations: 830759 (Pool: 830203; Big: 556); GC: 1
main at /buildworker/worker/package_linuxaarch64/build/cli/loader_exe.c:42
Allocations: 830759 (Pool: 830203; Big: 556); GC: 1
main at /buildworker/worker/package_linuxaarch64/build/cli/loader_exe.c:42
Allocations: 830759 (Pool: 830203; Big: 556); GC: 1
main at /buildworker/worker/package_linuxaarch64/build/cli/loader_exe.c:42
Allocations: 830759 (Pool: 830203; Big: 556); GC: 1
[WARN] PLE 0610 plexec The process terminated with the signal.(rank=3)(nid=0x27c60008)(sig=11)
test_errorhandler.jl: Error During Test at /home/ra000019/a04463/.julia/packages/MPI/08SPr/test/runtests.jl:26
  Got exception outside of a @test
  failed process: Process(`mpiexec -n 4 /vol0003/ra000019/a04463/julia-1.7.2-aarch64/bin/julia -Cnative -J/vol0003/ra000019/a04463/julia-1.7.2-aarch64/lib/julia/sys.so --depwarn=yes --check-bounds=yes -g1 --color=yes --startup-file=no /home/ra000019/a04463/.julia/packages/MPI/08SPr/test/test_errorhandler.jl`, ProcessExited(139)) [139]
  
  Stacktrace:
    [1] pipeline_error
      @ ./process.jl:531 [inlined]
    [2] run(::Cmd; wait::Bool)
      @ Base ./process.jl:446
    [3] run
      @ ./process.jl:444 [inlined]
    [4] (::var"#13#15"{String})(cmd::Cmd)
      @ Main ~/.julia/packages/MPI/08SPr/test/runtests.jl:38
    [5] (::MPI.var"#28#29"{var"#13#15"{String}})(cmd::Cmd)
      @ MPI ~/.julia/packages/MPI/08SPr/src/environment.jl:25
    [6] _mpiexec
      @ ~/.julia/packages/MPI/08SPr/deps/deps.jl:6 [inlined]
    [7] mpiexec(fn::var"#13#15"{String})
      @ MPI ~/.julia/packages/MPI/08SPr/src/environment.jl:25
    [8] macro expansion
      @ ~/.julia/packages/MPI/08SPr/test/runtests.jl:27 [inlined]
    [9] top-level scope
      @ /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Test/src/Test.jl:1359
   [10] include(fname::String)
      @ Base.MainInclude ./client.jl:451
   [11] top-level scope
      @ none:6
   [12] eval
      @ ./boot.jl:373 [inlined]
   [13] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:268
   [14] _start()
      @ Base ./client.jl:495
Test Summary:        | Error  Total
test_errorhandler.jl |     1      1
Test Summary:        | Error  Total
test_errorhandler.jl |     1      1
ERROR: LoadError: Some tests did not pass: 0 passed, 0 failed, 1 errored, 0 broken.
in expression starting at /home/ra000019/a04463/.julia/packages/MPI/08SPr/test/runtests.jl:26

@giordano giordano changed the title Unknown MPI ABI on Fugaku when using system MPI Unknown MPI ABI when using Fujitsu MPI on Fugaku, and segmentation faults when running the tests with this implementation Feb 14, 2022
@simonbyrne
Copy link
Member

What does MPI.Get_library_version() give?

@giordano
Copy link
Member Author

julia> MPI.Get_library_version()
"FUJITSU MPI Library 4.0.0 (4.0.1fj4.0.0)\0"

@giordano
Copy link
Member Author

Uhm, I'm able to build MPI.jl with

JULIA_MPICC="mpifcc" JULIA_MPI_BINARY="system"

on v0.19.2 but not on master:

┌ Info: using implementation
│   libmpi = "libmpi"
│   mpiexec_cmd = `mpiexec`
└   MPI_LIBRARY_VERSION_STRING = "FUJITSU MPI Library 4.0.0 (4.0.1fj4.0.0)\0"
Command-line warning: "/opt/FJSVxtclanga/tcsds-1.2.34/lib64/../include" was specified as both a system and non-system include directory -- the non-system entry will be ignored

--------------------------------------------------------------------------
[mpi::orte-runtime::orte_init:startup:internal-failure]
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  pmix init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[mpi::orte-runtime::orte_init:startup:internal-failure]
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[mpi::mpi-runtime::mpi_init:startup:internal-failure]
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[h30-3006c:01170] [0] func:/opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so.0(opal_backtrace_buffer+0x28) [0x4000003f1e68]
[h30-3006c:01170] [1] func:/opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so.0(ompi_mpi_abort+0xb8) [0x40000018f3c8]
[h30-3006c:01170] [2] func:/opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so.0(ompi_mpi_errors_are_fatal_comm_handler+0x108) [0x40000017cbc0]
[h30-3006c:01170] [3] func:/opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so.0(ompi_errhandler_invoke+0x60) [0x40000017c7a8]
[h30-3006c:01170] [4] func:/opt/FJSVxtclanga/tcsds-1.2.34/lib64/libmpi.so.0(MPI_Init+0x17c) [0x4000001b91f4]
[h30-3006c:01170] [5] func:./generate_compile_time_mpi_constants() [0x400dc0]
[h30-3006c:01170] [6] func:/lib64/libc.so.6(__libc_start_main+0xe4) [0x400001010de4]
[h30-3006c:01170] [7] func:./generate_compile_time_mpi_constants() [0x400cac]
[h30-3006c:01170] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
ERROR: LoadError: failed process: Process(`./generate_compile_time_mpi_constants`, ProcessExited(1)) [1]

Stacktrace:
 [1] pipeline_error
   @ ./process.jl:531 [inlined]
 [2] run(::Cmd; wait::Bool)
   @ Base ./process.jl:446
 [3] run(::Cmd)
   @ Base ./process.jl:444
 [4] top-level scope
   @ ~/.julia/dev/MPI/deps/prepare_mpi_constants.jl:25
 [5] include(fname::String)
   @ Base.MainInclude ./client.jl:451
 [6] top-level scope
   @ ~/.julia/dev/MPI/deps/build.jl:106
 [7] include(fname::String)
   @ Base.MainInclude ./client.jl:451
 [8] top-level scope
   @ none:5
in expression starting at /home/ra000019/a04463/.julia/dev/MPI/deps/prepare_mpi_constants.jl:1
in expression starting at /home/ra000019/a04463/.julia/dev/MPI/deps/build.jl:83
ERROR: Error building `MPI`: 

Stacktrace:
  [1] pkgerror(msg::String)
    @ Pkg.Types /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Types.jl:68
  [2] (::Pkg.Operations.var"#62#67"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec})()
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:962
  [3] withenv(::Pkg.Operations.var"#62#67"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec}, ::Pair{String, String}, ::Vararg{Pair{String}})
    @ Base ./env.jl:172
  [4] (::Pkg.Operations.var"#99#103"{String, Bool, Bool, Bool, Pkg.Operations.var"#62#67"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec}, Pkg.Types.PackageSpec})()
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1506
  [5] with_temp_env(fn::Pkg.Operations.var"#99#103"{String, Bool, Bool, Bool, Pkg.Operations.var"#62#67"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec}, Pkg.Types.PackageSpec}, temp_env::String)
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1390
  [6] (::Pkg.Operations.var"#98#102"{Bool, Bool, Bool, Pkg.Operations.var"#62#67"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec}, Pkg.Types.Context, Pkg.Types.PackageSpec, String, Pkg.Types.Project, String})(tmp::String)
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1469
  [7] mktempdir(fn::Pkg.Operations.var"#98#102"{Bool, Bool, Bool, Pkg.Operations.var"#62#67"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec}, Pkg.Types.Context, Pkg.Types.PackageSpec, String, Pkg.Types.Project, String}, parent::String; prefix::String)
    @ Base.Filesystem ./file.jl:750
  [8] mktempdir(fn::Function, parent::String) (repeats 2 times)
    @ Base.Filesystem ./file.jl:748
  [9] sandbox(fn::Function, ctx::Pkg.Types.Context, target::Pkg.Types.PackageSpec, target_path::String, sandbox_path::String, sandbox_project_override::Pkg.Types.Project; force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool)
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1435
 [10] sandbox(fn::Function, ctx::Pkg.Types.Context, target::Pkg.Types.PackageSpec, target_path::String, sandbox_path::String, sandbox_project_override::Pkg.Types.Project)
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1432
 [11] build_versions(ctx::Pkg.Types.Context, uuids::Set{Base.UUID}; verbose::Bool)
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:943
 [12] build(ctx::Pkg.Types.Context, uuids::Set{Base.UUID}, verbose::Bool)
    @ Pkg.Operations /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:822
 [13] build(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; verbose::Bool, kwargs::Base.Pairs{Symbol, Base.TTY, Tuple{Symbol}, NamedTuple{(:io,), Tuple{Base.TTY}}})
    @ Pkg.API /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/API.jl:992
 [14] build(pkgs::Vector{Pkg.Types.PackageSpec}; io::Base.TTY, kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:verbose,), Tuple{Bool}}})
    @ Pkg.API /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/API.jl:149
 [15] #build#99
    @ /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/API.jl:142 [inlined]
 [16] #build#98
    @ /vol0003/ra000019/a04463/julia-1.7.2-aarch64/share/julia/stdlib/v1.7/Pkg/src/API.jl:141 [inlined]
 [17] top-level scope
    @ none:1

@simonbyrne
Copy link
Member

Do you know if it's based on MPICH or Open MPI?

@simonbyrne
Copy link
Member

Can you dump the symbol names?

@giordano
Copy link
Member Author

I mentioned above this should be based on OpenMPI:

$ mpiexec --version
mpiexec (Open MPI)
FUJITSU MPI Library 4.0.1 a549543c34

@simonbyrne
Copy link
Member

if you're using the release version (not master), you should be able to set JULIA_MPI_ABI=OpenMPI

@giordano
Copy link
Member Author

The test_errorhandler.jl tests crash also with JULIA_MPI_ABI=OpenMPI 😢 In https://gist.github.com/giordano/7d006a1824e5c7edc3a5bbdada8455c1 there is the output of nm libmpi.so

@giordano
Copy link
Member Author

giordano commented Feb 15, 2022

Side note, I mentioned on Slack that tests with the default JLLs with v0.19.2 are successful on Fugaku (and I had tried already in the past on the A64FX nodes of Isambard 2 in Bristol) 🙂 Hopefully this is just a matter of getting the ABI right.

@giordano
Copy link
Member Author

Ok, good news is that test_errorhandler.jl is the only test set which is failing with Fujitsu MPI.

@simonbyrne
Copy link
Member

Ah ok. It could be that the Fujitsu MPI just doesn't support custom error handlers. I think there have been similar things with other "optimized" MPI implementations.

@giordano
Copy link
Member Author

Another datapoint: also ENV["JULIA_MPI_BINARY"]="OpenMPI_jll" works fine. Only Fujitsu MPI segfaults and only in the custom error handler, but that may be somewhat normal according to what you said.

@vchuravy
Copy link
Member

@giordano could you verify current master on FujitsuMPI? I would love to release 0.20 "soon".

@giordano
Copy link
Member Author

Unfortunately I don't think I can test anymore 😞

@sloede
Copy link
Member

sloede commented Apr 17, 2022

Maybe once more after October...

@giordano
Copy link
Member Author

giordano commented Apr 17, 2022

Nevermind, I can use it again.

Identification of system MPI works out-of-the-box:

julia> MPIPreferences.use_system_binary()
┌ Info: MPI implementation
│   libmpi = "libmpi"
│   version_string = "FUJITSU MPI Library 4.0.0 (4.0.1fj4.0.0)\0"
│   impl = "FujitsuMPI"
│   version = v"4.0.0"
└   abi = "OpenMPI"
┌ Warning: The underlying MPI implementation has changed. You will need to restart Julia for this change to take effect
│   libmpi = "libmpi"
│   abi = "OpenMPI"
│   mpiexec = "mpiexec"
└ @ MPIPreferences /data/ra000019/a04463/julia-depot/packages/MPIPreferences/uArzO/src/MPIPreferences.jl:119

I'll run the tests after I understand how to make the preferences "stick" when running the tests of MPI.jl (not tonight) 🙂 Edit: I'm experiencing #561

@giordano
Copy link
Member Author

With latest version of #542 and the hack described in #564 (comment) to run the tests for system MPI, the testsuite of this package is successful for me on Fugaku with Fujitsu MPI 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants