Skip to content

Conversation

am17an
Copy link
Collaborator

@am17an am17an commented Sep 10, 2025

Following #15767, I do not see a noticeable difference in performance but this change has better memory coalescing and uses all warps available for finding slots. In general, this part of the code does not contribute significantly to the runtime in any case.

While looking at the optimizing the kernel, I noticed that this kernel is overall bounded by register pressure which affects occupancy. I tried adding pragma unroll 1 to dial-back some of the unrolling but that only made performance worse

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 10, 2025
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My experience with mmq_ids_helper has been that the biggest speedup came from specifying the number of used experts at compile time in order to eliminate the inner loop over n_expert_used.

@am17an
Copy link
Collaborator Author

am17an commented Sep 11, 2025

My experience with mmq_ids_helper has been that the biggest speedup came from specifying the number of used experts at compile time in order to eliminate the inner loop over n_expert_used.

Unfortunately I still don't see a speedup in my tests, I tried with granite-moe and also test-backend-ops. Also I saw unrolling 16-32 experts_used has a detrimental affect on performance (measured on RTX 3090) due to increased register pressure

@JohannesGaessler
Copy link
Collaborator

Regarding register pressure: that is always the biggest limitation for matrix multiplications. For MMF to scale properly to larger batch sizes the memory access patterns will need to be changed. Like in MMQ, it will be necessary to load the src0/src1 data into shared memory tiles, do a __syncthreads, and then do matrix-multiply-accumulate. The important difference vs. the current implementation is that the tiles would be much wider in ne01/ne11 and much shorter in ne00/ne10 and that the data loaded by one warp into shared memory would be used by other warps as well (hence the need for a __syncthreads).

What you could do with less effort is extend the kernel to run more than one CUDA block in parallel for ne11. For MoE that should still be faster than going through synchronization + cuBLAS up to some batch size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants