-
-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Labels
performanceMust go fasterMust go faster
Description
From carstenbauer/julia-dgemm-noctua#2. (cc @ViralBShah)
I benchmarked a simple dgemm
call (i.e. mul!(C,A,B)
) on Noctua 1 (single and dual-socket Intel Xeon Gold "Skylake" 6148 20-Core CPUs) for multiple BLAS libraries (called from Julia using LBT)
- 20 cores -> single-socket
- 40 cores -> dual-socket (full node)
Here are the benchmark results:
BLAS | # cores | size | GFLOPS |
---|---|---|---|
Intel MKL v2022.0.0 (JLL) | 40 cores | 10240 | 2081 |
Intel MKL v2022.0.0 (JLL) | 20 cores | 10240 | 1054 |
BLIS 0.9.0 (JLL) | 40 cores | 10240 | 1890 |
BLIS 0.9.0 (JLL) | 20 cores | 10240 | 990 |
Octavian 0.3.15 | 40 cores | 10240 | 1053 |
Octavian 0.3.15 | 20 cores | 10240 | 1016 |
OpenBLAS (shipped with Julia 1.8) | 40 cores | 10240 | 1092 |
OpenBLAS (shipped with Julia 1.8) | 20 cores | 10240 | 1063 |
--------------------- | ----------- | ------- | ------ |
OpenBLAS 0.3.17 (custom) | 40 cores | 10240 | 1908 |
OpenBLAS 0.3.17 (custom) | 20 cores | 10240 | 1439 |
OpenBLAS 0.3.20 (custom) | 40 cores | 10240 | 1897 |
OpenBLAS 0.3.20 (custom) | 20 cores | 10240 | 1444 |
--------------------- | ----------- | ------- | ------ |
OpenBLAS 0.3.17 (JLL) | 40 cores | 10240 | 1437 |
OpenBLAS 0.3.17 (JLL) | 20 cores | 10240 | 1124 |
OpenBLAS 0.3.20 (JLL) | 40 cores | 10240 | 1535 |
OpenBLAS 0.3.20 (JLL) | 20 cores | 10240 | 1185 |
The custom OpenBLAS has been compiled with
make INTERFACE64=1 USE_THREAD=1 NO_AFFINITY=1 GEMM_MULTITHREADING_THRESHOLD=50 NO_STATIC=1 BINARY=64
Primary observations/conclusions:
- MKL and BLIS (through MKL.jl and BLISBLAS.jl) scale reasonably well from single to dual-socket but OpenBLAS shipped with Julia 1.8 doesn't scale at all. (Octavian also doesn't scale, see Dual-socket support JuliaLinearAlgebra/Octavian.jl#151)
- A custom build of OpenBLAS shows overall best single-socket performance and scales reasonably well. Therefore it is not just that OpenBLAS is inferior to MKL/BLIS. Perhaps we use suboptimal build options?
- What is particularly curious is the using OpenBLAS_jll (0.3.17 and 0.3.20) manually leads to strictly better performance (both in terms of numbers and scaling) than the default/shipped OpenBLAS. How is the default integration of the OpenBLAS_jll different from just manually doing
using OpenBLAS_jll
andBLAS.lbt_forward(...; clear=true)
? (It's still worse than a custom build of OpenBLAS though.)
I hope we can improve the default OpenBLAS performance and scaling.
ViralBShah
Metadata
Metadata
Assignees
Labels
performanceMust go fasterMust go faster