matrix experiments by bashbaug · Pull Request #167 · bashbaug/SimpleOpenCLSamples

bashbaug · 2026-06-02T05:24:31Z

This PR adds several samples that demonstrate various methods of computing a large matrix multiplication. There are currently three samples: one that computes the product of two bfloat16 matrices, another that computes the product of 8-bit integer matrices, and a third that computes a product of tf32 "TensorFloat-32" matrices.

Each sample includes a naive version for correctness that runs (usually, slowly) on any OpenCL implementation, plus many other variants that demonstrate different extensions and tiling strategies. The samples are flexible and can accomodate other implementations, as needed.

Now have tiled implementations for SIMD16 as well.

We want to prioritize reuse of the A matrix to make best use of read suppression buffers.

This is not working (silently failing) with some recent drivers, so disable it for now. Ideally we will be able to reenable it shortly.

This should enable better cache reuse across subgroups.

This may also be helpful to keep subgroups running approximately together, which could also improve cache utilization.

Also, remove tK from all host function output, since it is only used internally within the kernels.

Tiled kernels still need to be enabled and ported.

Copilot

Pull request overview

This PR adds a new “matrix experiments” sample suite under samples/20_* that demonstrates large matrix-multiplication implementations for bf16, int8, and tf32 in OpenCL, including naive reference kernels and multiple optimized subgroup/tiled variants.

Changes:

Adds three new sample executables (matrixexperiments-bf16, matrixexperiments-i8, matrixexperiments-tf32) with corresponding OpenCL kernel sources and README usage docs.
Introduces shared helper utility readStringFromFile() in include/util.hpp and removes duplicate per-sample implementations.
Adds a new include/bfloat16.hpp type helper.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 29 comments.

Show a summary per file

File	Description
samples/CMakeLists.txt	Registers the three new matrix experiment sample subdirectories.
samples/20_matrixexperiments-tf32/README.md	Documents tf32 sample purpose, extensions, and CLI flags.
samples/20_matrixexperiments-tf32/matrix_kernels_tf32.cl	Adds tf32 naive + subgroup + tiled kernel variants.
samples/20_matrixexperiments-tf32/matrix_kernel_tiled_tf32.cl	Adds templated tf32 tiled kernel implementation.
samples/20_matrixexperiments-tf32/matrix_helpers_tf32.cl	Adds tf32 activation + subgroup load/store helpers.
samples/20_matrixexperiments-tf32/main.cpp	Adds tf32 host harness: argument parsing, build, run, validate, benchmark.
samples/20_matrixexperiments-tf32/CMakeLists.txt	Adds build rules for tf32 sample.
samples/20_matrixexperiments-i8/README.md	Documents int8 sample purpose, extensions, and CLI flags.
samples/20_matrixexperiments-i8/matrix_kernels_i8.cl	Adds int8 naive + subgroup + blockread kernel variants.
samples/20_matrixexperiments-i8/matrix_helpers_i8.cl	Adds int8 activation + dp4/dpas emulation + IO helpers.
samples/20_matrixexperiments-i8/main.cpp	Adds int8 host harness: argument parsing, build, run, validate, benchmark.
samples/20_matrixexperiments-i8/CMakeLists.txt	Adds build rules for int8 sample.
samples/20_matrixexperiments-bf16/README.md	Documents bf16 sample purpose, extensions, and CLI flags.
samples/20_matrixexperiments-bf16/matrix_kernels_bf16.cl	Adds bf16 naive + subgroup + tiled kernel variants.
samples/20_matrixexperiments-bf16/matrix_kernel_tiled_bf16.cl	Adds templated bf16 tiled + blockread tiled kernel implementation.
samples/20_matrixexperiments-bf16/matrix_helpers_bf16.cl	Adds bf16 conversion, activation, and subgroup load/store helpers.
samples/20_matrixexperiments-bf16/CMakeLists.txt	Adds build rules for bf16 sample.
samples/06_ndrangekernelfromfile/main.cpp	Removes local readStringFromFile implementation (now centralized).
samples/05_kernelfromfile/main.cpp	Removes local readStringFromFile implementation (now centralized).
include/util.hpp	Adds shared `readStringFromFile()` helper and `<fstream>` include.
include/bfloat16.hpp	Adds a C++ `bfloat16` helper type.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            auto localErr = std::fabs(C[index] - C_ref[index]) /
+                            std::max(std::fabs(C[index]),
+                                    std::fabs(C_ref[index]));


bashbaug added 30 commits January 4, 2024 13:46

basic infrastructure, dpas version is working

472d64c

improved address arithmetic

16c343c

added vnni versions

1da0c2c

cleanup

11e0eef

host code cleanup

d637ee6

add SIMD16 versions and emulation

52b9550

add support for PVC, which does not support SIMD8

074e0a5

fix warning

1ca8f73

add 2D block read variants

f2b00f3

reenable all variants

0bb5529

add vnni block read variants

7b89cfe

fix typo in emulation path

ca4b3cd

start to add block tiled versions

3ffcf3e

improve block tiled versions

b469713

more improvements

098f339

add more block tiled variants

ce7866f

refactor device code into a helper file

48c3bc2

switch to timing using event profiling

feb1064

more refactorization and simplification

4e89026

Now have tiled implementations for SIMD16 as well.

add tiled block read kernels for PVC

c7edcd6

fix block read tiled kernels and execute them

a433769

fix typo affecting one of the SIMD16 kernels

d4eb405

fix a few more bugs and improve validation testing

b6be2d4

add support for a larger A matrix block read

d76df7e

switch the tiled dpas order

756d2e9

We want to prioritize reuse of the A matrix to make best use of read suppression buffers.

temporarily disable the large a matrix block load

4caea7b

This is not working (silently failing) with some recent drivers, so disable it for now. Ideally we will be able to reenable it shortly.

add support for launching more than one subgroup per work group

0fb3d66

This should enable better cache reuse across subgroups.

add support for split barriers

d09b982

This may also be helpful to keep subgroups running approximately together, which could also improve cache utilization.

add support for larger K values for some tiled kernels

031e076

rename tester host functions to match kernel names more closely

16b7cda

Also, remove tK from all host function output, since it is only used internally within the kernels.

bashbaug added 26 commits July 30, 2024 10:46

enable support for the native tf32 dpas

eff9d19

update tf32 function names to be closer to the final versions

af8b5e1

Merge branch 'main' into matrixperf

af02676

Merge branch 'main' into matrixperf

d927552

Merge branch 'matrixperf' into matrixperf-i8

6a72682

revert change to tf32 kernel

83a0690

fix typo

7e95831

Merge remote-tracking branch 'origin/main' into matrixperf-i8

dddfdf3

switch block read functions to the production names

41159a8

Tiled kernels still need to be enabled and ported.

add transpose block read variant

ddc93ff

Merge remote-tracking branch 'origin/main' into matrixperf-i8

fbb652f

switch to a more efficient sequence with conditional movs

0734097

cleanup

3324a53

Merge branch 'main' into matrixperf

ace054e

switch to production 2d block io functions

62a1fd8

switch more block reads to the production versions

badf4c2

Merge branch 'matrixperf-i8' into matrixperf-final

a58314a

integrate i8 matrix multiplication

9988e6d

switch to final directories and sample names

d680ff1

Merge branch 'main' into matrixperf-final

5fe7e09

Merge branch 'main' into matrixperf-final

a348dea

update copyright, add README

af28503

remove warning when split barriers are unsupported

7f5764c

fixes for CPU and more

4e31c01

switch to kernels that use integer dot products

ada8a4a

Merge branch 'main' into matrixperf-final

2d734c0

bashbaug requested a review from Copilot June 2, 2026 05:24

Copilot started reviewing on behalf of bashbaug June 2, 2026 05:24 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

minor fixes

2ae7f9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matrix experiments#167

matrix experiments#167
bashbaug wants to merge 107 commits into
mainfrom
matrixperf-final

bashbaug commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bashbaug commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants