Skip to content

Suboptimal dgemm OpenBLAS performance and dual-socket scaling #936

@carstenbauer

Description

@carstenbauer

From carstenbauer/julia-dgemm-noctua#2. (cc @ViralBShah)

I benchmarked a simple dgemm call (i.e. mul!(C,A,B)) on Noctua 1 (single and dual-socket Intel Xeon Gold "Skylake" 6148 20-Core CPUs) for multiple BLAS libraries (called from Julia using LBT)

  • 20 cores -> single-socket
  • 40 cores -> dual-socket (full node)

Here are the benchmark results:

BLAS # cores size GFLOPS
Intel MKL v2022.0.0 (JLL) 40 cores 10240 2081
Intel MKL v2022.0.0 (JLL) 20 cores 10240 1054
BLIS 0.9.0 (JLL) 40 cores 10240 1890
BLIS 0.9.0 (JLL) 20 cores 10240 990
Octavian 0.3.15 40 cores 10240 1053
Octavian 0.3.15 20 cores 10240 1016
OpenBLAS (shipped with Julia 1.8) 40 cores 10240 1092
OpenBLAS (shipped with Julia 1.8) 20 cores 10240 1063
--------------------- ----------- ------- ------
OpenBLAS 0.3.17 (custom) 40 cores 10240 1908
OpenBLAS 0.3.17 (custom) 20 cores 10240 1439
OpenBLAS 0.3.20 (custom) 40 cores 10240 1897
OpenBLAS 0.3.20 (custom) 20 cores 10240 1444
--------------------- ----------- ------- ------
OpenBLAS 0.3.17 (JLL) 40 cores 10240 1437
OpenBLAS 0.3.17 (JLL) 20 cores 10240 1124
OpenBLAS 0.3.20 (JLL) 40 cores 10240 1535
OpenBLAS 0.3.20 (JLL) 20 cores 10240 1185

The custom OpenBLAS has been compiled with

make INTERFACE64=1 USE_THREAD=1 NO_AFFINITY=1 GEMM_MULTITHREADING_THRESHOLD=50 NO_STATIC=1 BINARY=64

Primary observations/conclusions:

  • MKL and BLIS (through MKL.jl and BLISBLAS.jl) scale reasonably well from single to dual-socket but OpenBLAS shipped with Julia 1.8 doesn't scale at all. (Octavian also doesn't scale, see Dual-socket support JuliaLinearAlgebra/Octavian.jl#151)
  • A custom build of OpenBLAS shows overall best single-socket performance and scales reasonably well. Therefore it is not just that OpenBLAS is inferior to MKL/BLIS. Perhaps we use suboptimal build options?
  • What is particularly curious is the using OpenBLAS_jll (0.3.17 and 0.3.20) manually leads to strictly better performance (both in terms of numbers and scaling) than the default/shipped OpenBLAS. How is the default integration of the OpenBLAS_jll different from just manually doing using OpenBLAS_jll and BLAS.lbt_forward(...; clear=true)? (It's still worse than a custom build of OpenBLAS though.)

I hope we can improve the default OpenBLAS performance and scaling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions