Cholesky-Bench

Cholesky-Bench benchmarks right-looking tiled Cholesky factorization from fork-join to asynchronous tasks across several parallelism models, currently comparing OpenMP and HPX implementations side by side. A non-tiled parallel reference is also included as a baseline.

Variants

OpenMP (`openmp/`)

Mode	Description
`for_naive`	Naive parallel for loop with dynamic schedule for trailing-update
`for_collapse`	Collapsed parallel for loop (optional dynamic schedule for trailing-update with LLVM)
`task_naive`	OpenMP tasking without dependencies (OpenMP 3.0)
`task_depend`	OpenMP tasking with `depend` clauses (OpenMP 4.0)
`task_prio`	OpenMP tasking with `priority` clauses (OpenMP 4.5) (requires `OMP_MAX_TASK_PRIORITY` to be set at runtime)

HPX (`hpx/`)

Mode	Description
`async_future`	Fully asynchronous tasking with dataflow using `hpx::shared_future<vector>`
`sync_future`	Manually synchronized tasking with dataflow using `hpx::shared_future<vector>`
`loop_one`	Naive fork-join with dynamic schedule for trailing-update
`loop_two`	Collapsed fork-join with dynamic schedule for trailing-update
`async_void`	Fully asynchronous tasking with dataflow using `hpx::shared_future<void>`

Reference (`reference/`)

Mode	Description
`lapacke`	Single threaded `LAPACKE_dpotrf` call on the full matrix; no tiling. Parallelism is delegated entirely to a threaded BLAS (OpenBLAS built with `threads=openmp`, or threaded Intel oneMKL via `ENABLE_MKL=ON`). Enabled by default; disable with `ENABLE_LAPACKE=OFF`.
`plasma`	Single `plasma_dpotrf` call on the full matrix (PLASMA's high-level synchronous API). PLASMA does its own tiled, OpenMP-task-based parallel Cholesky internally; tile size is left at PLASMA's built-in default. Built only when `ENABLE_PLASMA=ON`.

This directory is the natural baseline for the OpenMP and HPX tiled implementations: the lapacke mode isolates the contribution of vendor-provided dense-LA parallelism, and the plasma mode adds a tiled-parallel competitor that uses the same OpenMP runtime as the in-house variants.

PLASMA descriptor int32 overflow

PLASMA 24.8.7's plasma_desc_*_create() routines compute their tile-storage size as int * int before casting to size_t, which silently overflows once the padded triangular tile-area exceeds INT32_MAX. With the default nb=256, the boundary is at N=65280 (mt=255).

The benchmark handles this transparently:

For sweep sizes N in (65280, 65536], only plasma is silently clamped down to 65280 for that iteration; lapacke runs at the full N. The problem_size column reports the original N, so plasma's timing in this range corresponds to the 65280 compute even though the row is labelled with the input size.
For N > 65536 plasma records nan. lapacke is unaffected by the int32 ceiling and continues normally.

Dependencies

All three implementations are built with CMake (≥ 3.23) and C++20. The OpenMP and HPX directories link against a sequential BLAS (parallelism is at the tile level); the reference/ directory links against a threaded BLAS instead.

Dependency	OpenMP	HPX	Reference
OpenBLAS 0.3.28 (sequential)	✓ (default)	✓ (default)	—
OpenBLAS 0.3.28 (`threads=openmp`)	—	—	✓ (default)
Intel oneMKL (sequential)	optional (`ENABLE_MKL=ON`)	optional (`ENABLE_MKL=ON`)	—
Intel oneMKL (`intel_thread`)	—	—	optional (`ENABLE_MKL=ON`)
PLASMA 24.8.7	—	—	optional (`ENABLE_PLASMA=ON`)
HPX 1.11.0 + jemalloc	—	✓	—
GCC 14.2.0	✓	✓	✓
LLVM/Clang 22.1.2	optional	—	—

Dependencies are managed via Spack.

Build

From within the openmp/, hpx/, or reference/ directory, run:

./compile.sh [gcc|llvm]   # OpenMP:    gcc (default) or llvm
./compile.sh              # HPX:       always gcc
./compile.sh              # Reference: always gcc

The script clears and recreates the build/ directory, then runs CMake in Release mode followed by a parallel make.

CMake options

These can be set as environment variables before calling compile.sh:

Option	Default	Description
`ENABLE_VALIDATION`	`OFF`	After each factorization, compute the relative residual ‖A − LL^T‖_F / ‖A‖_F and warn if it exceeds 1e-10. In `openmp/` and `hpx/`, mutually exclusive with `DISABLE_COMPUTATION`.
`DISABLE_COMPUTATION`	`OFF`	(`openmp/` and `hpx/` only) Replace all BLAS/tile-generation calls with no-ops. The task graph and loops remain intact, so scheduling overhead can be measured in isolation.
`ENABLE_DYNAMIC_SCHEDULE`	`OFF`	(`openmp/` only) Use `schedule(dynamic,1)` on the trailing-update worksharing loops in `for_collapse`. Requires the LLVM toolchain; rejected at compile time with GCC.
`ENABLE_MKL`	`OFF`	Link against Intel oneMKL instead of OpenBLAS. In `openmp/` and `hpx/` this is the sequential MKL; in `reference/` it is the threaded MKL.
`ENABLE_PLASMA`	`OFF`	(`reference/` only) Also build the PLASMA `plasma_dpotrf` variant. Adds a `plasma` column alongside `lapacke` in the runtime output.
`ENABLE_LAPACKE`	`ON`	(`reference/` only) Run the `lapacke` mode at runtime. Set `OFF` to skip it (e.g. when only `plasma` is wanted). Linking is unchanged either way — PLASMA and validation still need cblas/lapacke symbols.

Examples:

# OpenMP: GCC build with validation enabled
ENABLE_VALIDATION=ON ./compile.sh gcc

# OpenMP: LLVM build with dynamic scheduling
ENABLE_DYNAMIC_SCHEDULE=ON ./compile.sh llvm

# HPX: measure pure scheduling overhead
DISABLE_COMPUTATION=ON ./compile.sh

# Reference: threaded MKL baseline
ENABLE_MKL=ON ./compile.sh

# Reference: also build the PLASMA tiled-Cholesky variant
ENABLE_PLASMA=ON ./compile.sh

# Reference: PLASMA only, skip the LAPACKE_dpotrf column at runtime
ENABLE_LAPACKE=OFF ENABLE_PLASMA=ON ./compile.sh

Run

Directly

# OpenMP
OMP_NUM_THREADS=128 OMP_PROC_BIND=close OMP_PLACES=cores \
  ./build/cholesky_openmp \
  --loop 1 --size_start 1024 --size_stop 65536 \
  --tiles_start 64 --tiles_stop 64

# HPX
./build/cholesky_hpx \
  --hpx:threads=128 \
  --loop=1 --size_start=1024 --size_stop=65536 \
  --tiles_start=64 --tiles_stop=64

# Reference (parallel BLAS, no tiling)
OMP_NUM_THREADS=128 OMP_PROC_BIND=close OMP_PLACES=cores \
  ./build/cholesky_reference \
  --loop 1 --size_start 1024 --size_stop 65536

Via SLURM

All three directories contain a run.sh that is a ready-to-submit SLURM batch script (128 CPUs, exclusive node, 144-hour wall time):

sbatch openmp/run.sh             # gcc runtime (default)
sbatch openmp/run.sh llvm        # llvm runtime
sbatch hpx/run.sh
sbatch reference/run.sh          # gcc runtime; defaults to N=65280 (see PLASMA boundary note)

Command-line arguments

Argument	Default	Description
`--loop` / `--loop=`	1	Number of timed repetitions per configuration
`--size_start` / `--size_stop`	32 / 128	Problem size range (doubled each step)
`--tiles_start` / `--tiles_stop`	16 / 32	Tile count range (doubled each step). Accepted but ignored by the `reference/` binary, which has no tiling axis.

Output

Results are appended to a text file in the working directory:

runtimes_openmp_cholesky_<suffix>.txt
runtimes_hpx_cholesky_<suffix>.txt
runtimes_reference_cholesky_<suffix>.txt

The suffix encodes which dimension is swept: tile_ if tiles vary, size_ if size varies, followed by the loop count. The file uses ;-separated columns:

The reference/ binary reports a lapacke column (suppressed by ENABLE_LAPACKE=OFF) plus a plasma column when built with ENABLE_PLASMA=ON, with tile_size = problem_size and n_tiles = 1, so its runtime files merge cleanly with the tiled benchmarks on the problem_size key:

Repository structure

.
├── .clang-format           # repo-wide style; governs all three subtrees
├── CMakeLists.txt          # top-level coordinator (formatting only; LANGUAGES NONE)
├── openmp/
│   ├── CMakeLists.txt
│   ├── compile.sh          # build script (gcc or llvm)
│   ├── run.sh              # SLURM job script
│   ├── main.cpp
│   └── core/
│       ├── include/
│       │   ├── cholesky_factor.hpp
│       │   ├── functions.hpp
│       │   ├── tile_generation.hpp
│       │   ├── validate.hpp
│       │   └── adapter_cblas_fp64.hpp
│       └── src/
│           ├── cholesky_factor.cpp
│           ├── functions.cpp
│           ├── tile_generation.cpp
│           ├── validate.cpp
│           └── adapter_cblas_fp64.cpp
├── hpx/
│   ├── CMakeLists.txt
│   ├── compile.sh          # build script (gcc only)
│   ├── run.sh              # SLURM job script
│   ├── main.cpp
│   └── core/
│       ├── include/
│       │   ├── cholesky_factor.hpp
│       │   ├── functions.hpp
│       │   ├── tile_generation.hpp
│       │   ├── validate.hpp
│       │   └── adapter_cblas_fp64.hpp
│       └── src/
│           ├── cholesky_factor.cpp
│           ├── functions.cpp
│           ├── tile_generation.cpp
│           ├── validate.cpp
│           └── adapter_cblas_fp64.cpp
└── reference/
    ├── CMakeLists.txt
    ├── compile.sh          # build script (gcc only)
    ├── run.sh              # SLURM job script
    ├── main.cpp
    └── core/
        ├── include/
        │   ├── cholesky_factor.hpp
        │   ├── functions.hpp
        │   ├── matrix_generation.hpp
        │   ├── adapter_plasma_fp64.hpp  # only used when ENABLE_PLASMA=ON
        │   ├── validate.hpp
        │   └── adapter_cblas_fp64.hpp
        └── src/
            ├── cholesky_factor.cpp
            ├── functions.cpp
            ├── matrix_generation.cpp
            ├── adapter_plasma_fp64.cpp  # only built when ENABLE_PLASMA=ON
            ├── validate.cpp
            └── adapter_cblas_fp64.cpp

When ENABLE_LAPACKE=OFF, adapter_cblas_fp64.cpp and validate.cpp are still compiled and linked (they share cblas/lapacke symbols with PLASMA's BLAS dependency); only the runtime dispatch of the lapacke mode is skipped.

Formatting

A repository-wide .clang-format governs all subtrees. The top-level CMakeLists.txt wires up clang-format and cmake-format targets via Format.cmake; configure once from the repo root and use the targets:

cmake -B build-fmt
cmake --build build-fmt --target check-clang-format   # CI-style check
cmake --build build-fmt --target fix-clang-format     # apply formatting

Each subproject (openmp/, hpx/, reference/) is its own standalone CMake project with its own dependencies, so the top-level CMakeLists.txt only handles formatting. The actual builds still happen from inside each subdirectory via its compile.sh.

Contributing

We would be happy to expand Cholesky-Bench to additional asynchronous many-task (AMT) runtimes. If you would like to add an implementation, feel free to open a pull request.

How to cite

If you use Cholesky-Bench in your research, please cite:

@misc{coming_soon}

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github/workflows		.github/workflows
hpx		hpx
openmp		openmp
reference		reference
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cholesky-Bench

Variants

OpenMP (`openmp/`)

HPX (`hpx/`)

Reference (`reference/`)

PLASMA descriptor int32 overflow

Dependencies

Build

CMake options

Run

Directly

Via SLURM

Command-line arguments

Output

Repository structure

Formatting

Contributing

How to cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cholesky-Bench

Variants

OpenMP (openmp/)

HPX (hpx/)

Reference (reference/)

PLASMA descriptor int32 overflow

Dependencies

Build

CMake options

Run

Directly

Via SLURM

Command-line arguments

Output

Repository structure

Formatting

Contributing

How to cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

OpenMP (`openmp/`)

HPX (`hpx/`)

Reference (`reference/`)

Packages