-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations #11595
base: master
Are you sure you want to change the base?
Conversation
llvmpipe seems to have issues with the
adding a bounds check makes it happy shared uvec2 iq2xxs_grid[256];
void init_iq_shmem(uvec3 wgsize)
{
// copy the table into shared memory and sync
[[unroll]] for (uint i = 0; i < iq2xxs_grid.length(); i += wgsize.x) {
if (i + gl_LocalInvocationIndex.x < iq2xxs_grid.length())
iq2xxs_grid[i + gl_LocalInvocationIndex.x] = iq2xxs_grid_const[i + gl_LocalInvocationIndex.x];
}
barrier();
} |
I didn't realize we were using such large workgroup sizes with these init functions for getrows. Maybe the branch condition should do something like |
That's why I love the llvmpipe test as it finds all those issues which get ignored by regular GPUs or traditional subgroup sizes. BTW have you noticed an improvement on your end with |
In that case I believe the issue also appears with actual GPU, but it is probably hidden by hardware bounds checking which is not in llvmpipe. |
(This is a draft written on top of #11501 and #11528 )
This PR introduces MMV kernels for IQ2 and IQ3 quantizations. It also includes optimizations suggested by @jeffbolznv (unrolled
init_iq_shmem
and 2x block size inmul_mat_vec
).After this PR the performance of IQ2/IQ3 seems in line with comparable K-quants (
model size × t/s
is similar).Note that the kernels for IQ1 quants are included in #11528
Performance before all optimizations
(both Mesa compilers for AMD target are shown: ACO and LLVM)
(llama-bench output is annotated by the estimate bandwidth model size × t/s)
(Qwen IQ1 model files are from https://huggingface.co/legraphista/Qwen2.5-Coder-7B-Instruct-IMat-GGUF)
(model files from
bartowski/Mistral-Small-24B-Instruct-2501-GGUF
have wrong name "llama 13B")Performance after: