CUDA: some micro-optimizations in mmf.cuh for mul_mat_id #15926

am17an · 2025-09-10T13:33:13Z

Following #15767, I do not see a noticeable difference in performance but this change has better memory coalescing and uses all warps available for finding slots. In general, this part of the code does not contribute significantly to the runtime in any case.

While looking at the optimizing the kernel, I noticed that this kernel is overall bounded by register pressure which affects occupancy. I tried adding pragma unroll 1 to dial-back some of the unrolling but that only made performance worse

JohannesGaessler

My experience with mmq_ids_helper has been that the biggest speedup came from specifying the number of used experts at compile time in order to eliminate the inner loop over n_expert_used.

ggml/src/ggml-cuda/mmf.cuh

am17an · 2025-09-11T05:28:18Z

My experience with mmq_ids_helper has been that the biggest speedup came from specifying the number of used experts at compile time in order to eliminate the inner loop over n_expert_used.

Unfortunately I still don't see a speedup in my tests, I tried with granite-moe and also test-backend-ops. Also I saw unrolling 16-32 experts_used has a detrimental affect on performance (measured on RTX 3090) due to increased register pressure

JohannesGaessler · 2025-09-11T08:12:19Z

Regarding register pressure: that is always the biggest limitation for matrix multiplications. For MMF to scale properly to larger batch sizes the memory access patterns will need to be changed. Like in MMQ, it will be necessary to load the src0/src1 data into shared memory tiles, do a __syncthreads, and then do matrix-multiply-accumulate. The important difference vs. the current implementation is that the tiles would be much wider in ne01/ne11 and much shorter in ne00/ne10 and that the data loaded by one warp into shared memory would be used by other warps as well (hence the need for a __syncthreads).

What you could do with less effort is extend the kernel to run more than one CUDA block in parallel for ne11. For MoE that should still be faster than going through synchronization + cuBLAS up to some batch size.

CUDA: MUL_MAT_ID optimizations for mmf

8a6cfa4

am17an requested a review from JohannesGaessler September 10, 2025 13:33

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 10, 2025

JohannesGaessler reviewed Sep 10, 2025

View reviewed changes

ggml/src/ggml-cuda/mmf.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/mmf.cuh Outdated Show resolved Hide resolved

unroll n_expert_used loop + remove warp syncs

94b189b

am17an force-pushed the mmf_opt_cuda branch from c66fb36 to 94b189b Compare September 11, 2025 04:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: some micro-optimizations in mmf.cuh for mul_mat_id #15926

CUDA: some micro-optimizations in mmf.cuh for mul_mat_id #15926

Uh oh!

am17an commented Sep 10, 2025

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

Uh oh!

am17an commented Sep 11, 2025

Uh oh!

JohannesGaessler commented Sep 11, 2025

Uh oh!

Uh oh!

CUDA: some micro-optimizations in mmf.cuh for mul_mat_id #15926

Are you sure you want to change the base?

CUDA: some micro-optimizations in mmf.cuh for mul_mat_id #15926

Uh oh!

Conversation

am17an commented Sep 10, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

am17an commented Sep 11, 2025

Uh oh!

JohannesGaessler commented Sep 11, 2025

Uh oh!

Uh oh!