Suboptimal `dgemm` OpenBLAS performance and dual-socket scaling

From https://github.com/carstenbauer/julia-dgemm-noctua/issues/2. (cc @ViralBShah)

I benchmarked a simple `dgemm` call (i.e. `mul!(C,A,B)`) on [Noctua 1](https://pc2.uni-paderborn.de/hpc-services/available-systems/noctua1) (single and dual-socket **Intel** Xeon Gold "Skylake" 6148 20-Core CPUs) for multiple BLAS libraries (called from Julia using LBT)
* 20 cores -> single-socket
* 40 cores -> dual-socket (full node)

**Here are the benchmark results:**
| BLAS | # cores | size | GFLOPS |
|---------------------|-----------|-------|------|
| Intel MKL v2022.0.0 (JLL) | 40 cores | 10240 | 2081 |
| Intel MKL v2022.0.0 (JLL) | 20 cores  | 10240 | 1054 |
| BLIS 0.9.0 (JLL)          | 40 cores | 10240 | 1890 |
| BLIS 0.9.0 (JLL)          | 20 cores  | 10240 | 990 |
| Octavian 0.3.15     | 40 cores | 10240 | **1053** |
| Octavian 0.3.15     | 20 cores  | 10240 | 1016 |
| OpenBLAS (shipped with Julia 1.8)    | 40 cores | 10240 | **1092** |
| OpenBLAS (shipped with Julia 1.8)     | 20 cores  | 10240 | 1063 |
|---------------------|-----------|-------|------|
| OpenBLAS 0.3.17 (custom)    | 40 cores | 10240 | 1908 |
| OpenBLAS 0.3.17 (custom)     | 20 cores  | 10240 | 1439 |
| OpenBLAS 0.3.20 (custom)    | 40 cores | 10240 | 1897 |
| OpenBLAS 0.3.20 (custom)     | 20 cores  | 10240 | 1444 |
|---------------------|-----------|-------|------|
| OpenBLAS 0.3.17 (JLL)    | 40 cores | 10240 | 1437 |
| OpenBLAS 0.3.17 (JLL)     | 20 cores  | 10240 | 1124 |
| OpenBLAS 0.3.20 (JLL)    | 40 cores | 10240 | 1535 |
| OpenBLAS 0.3.20 (JLL)     | 20 cores  | 10240 | 1185 |

The custom OpenBLAS has been compiled with
```
make INTERFACE64=1 USE_THREAD=1 NO_AFFINITY=1 GEMM_MULTITHREADING_THRESHOLD=50 NO_STATIC=1 BINARY=64
```

**Primary observations/conclusions:**
* MKL and BLIS (through [MKL.jl](https://github.com/JuliaLinearAlgebra/MKL.jl) and[ BLISBLAS.jl](https://github.com/carstenbauer/BLISBLAS.jl)) scale reasonably well from single to dual-socket but **OpenBLAS shipped with Julia 1.8 doesn't scale at all**. (Octavian also doesn't scale, see https://github.com/JuliaLinearAlgebra/Octavian.jl/issues/151)
* A custom build of OpenBLAS shows overall best single-socket performance and scales reasonably well. **Therefore it is not just that OpenBLAS is inferior to MKL/BLIS.** Perhaps we use suboptimal build options?
* What is particularly curious is the **using OpenBLAS_jll (0.3.17 and 0.3.20) manually leads to strictly better performance (both in terms of numbers and scaling) than the default/shipped OpenBLAS**. How is the default integration of the OpenBLAS_jll different from just manually doing `using OpenBLAS_jll` and `BLAS.lbt_forward(...; clear=true)`? (It's still worse than a custom build of OpenBLAS though.)

I hope we can improve the default OpenBLAS performance and scaling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Suboptimal `dgemm` OpenBLAS performance and dual-socket scaling #936

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BLAS	# cores	size	GFLOPS
Intel MKL v2022.0.0 (JLL)	40 cores	10240	2081
Intel MKL v2022.0.0 (JLL)	20 cores	10240	1054
BLIS 0.9.0 (JLL)	40 cores	10240	1890
BLIS 0.9.0 (JLL)	20 cores	10240	990
Octavian 0.3.15	40 cores	10240	1053
Octavian 0.3.15	20 cores	10240	1016
OpenBLAS (shipped with Julia 1.8)	40 cores	10240	1092
OpenBLAS (shipped with Julia 1.8)	20 cores	10240	1063
---------------------	-----------	-------	------
OpenBLAS 0.3.17 (custom)	40 cores	10240	1908
OpenBLAS 0.3.17 (custom)	20 cores	10240	1439
OpenBLAS 0.3.20 (custom)	40 cores	10240	1897
OpenBLAS 0.3.20 (custom)	20 cores	10240	1444
---------------------	-----------	-------	------
OpenBLAS 0.3.17 (JLL)	40 cores	10240	1437
OpenBLAS 0.3.17 (JLL)	20 cores	10240	1124
OpenBLAS 0.3.20 (JLL)	40 cores	10240	1535
OpenBLAS 0.3.20 (JLL)	20 cores	10240	1185

Uh oh!

Suboptimal dgemm OpenBLAS performance and dual-socket scaling #936

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Suboptimal `dgemm` OpenBLAS performance and dual-socket scaling #936