Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations #11595

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

remyoudompheng
Copy link
Contributor

@remyoudompheng remyoudompheng commented Feb 2, 2025

(This is a draft written on top of #11501 and #11528 )

This PR introduces MMV kernels for IQ2 and IQ3 quantizations. It also includes optimizations suggested by @jeffbolznv (unrolled init_iq_shmem and 2x block size in mul_mat_vec).

After this PR the performance of IQ2/IQ3 seems in line with comparable K-quants (model size × t/s is similar).
Note that the kernels for IQ1 quants are included in #11528

Performance before all optimizations
(both Mesa compilers for AMD target are shown: ACO and LLVM)
(llama-bench output is annotated by the estimate bandwidth model size × t/s)
(Qwen IQ1 model files are from https://huggingface.co/legraphista/Qwen2.5-Coder-7B-Instruct-IMat-GGUF)
(model files from bartowski/Mistral-Small-24B-Instruct-2501-GGUF have wrong name "llama 13B")

Backend 1/2: Vulkan0
  Device description: AMD Radeon 780M (RADV GFX1103_R1)
  Device memory: 17066 MB (17066 MB free)

  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):      41.57 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):      75.72 GFLOPS

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    450.75 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    349.44 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    274.34 GFLOPS

  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   344.50 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   288.32 GFLOPS

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 345.72 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  325.93 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   262.45 GFLOPS

  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 358.35 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   310.26 GFLOPS

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  274.33 GFLOPS
  MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  265.44 GFLOPS

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        238.80 ± 4.48 |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         17.82 ± 0.38 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        233.74 ± 0.83 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         16.20 ± 0.03 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         59.33 ± 0.02 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.43 ± 0.07 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         59.93 ± 0.35 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          4.66 ± 0.02 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         55.63 ± 0.22 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.64 ± 0.10 |
| llama 13B Q2_K - Medium        |   8.88 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         56.05 ± 0.23 |
| llama 13B Q2_K - Medium        |   8.88 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          8.28 ± 0.06 | 73.5 GiB/s
| llama 13B Q3_K - Large         |  11.54 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         47.16 ± 0.02 |
| llama 13B Q3_K - Large         |  11.54 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.20 ± 0.03 | 71.5 GiB/s

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1 (LLVM 19.1.7)) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        133.73 ± 1.47 |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         12.92 ± 0.00 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        128.73 ± 2.73 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         11.15 ± 0.02 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         40.82 ± 0.06 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          3.49 ± 0.00 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         35.25 ± 0.19 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          2.00 ± 0.01 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         38.51 ± 0.02 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          4.03 ± 0.00 |
| llama 13B Q2_K - Medium        |   8.88 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         30.34 ± 0.03 |
| llama 13B Q2_K - Medium        |   8.88 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.08 ± 0.00 |
| llama 13B Q3_K - Large         |  11.54 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         27.12 ± 0.01 |
| llama 13B Q3_K - Large         |  11.54 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          4.56 ± 0.00 |

Performance after:

  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   707.53 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   639.12 GFLOPS

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 524.20 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  507.47 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   458.70 GFLOPS

  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 375.33 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   337.94 GFLOPS

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  257.80 GFLOPS

legraphista/Qwen2.5-Coder-7B-Instruct-IMat-GGUF
bartowski/Mistral-Small-24B-Instruct-2501-GGUF

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        248.47 ± 0.47 |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         34.39 ± 0.12 | 60.9 GiB/s
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        228.57 ± 6.27 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         32.25 ± 0.22 | 61.3 GiB/s
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         62.63 ± 0.06 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         tg128 |         10.06 ± 0.01 | 70.0 GiB/s
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         55.94 ± 0.29 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          8.75 ± 0.18 | 66.1 GiB/s
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         57.35 ± 0.05 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          7.61 ± 0.00 | 70.2 GiB/s

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1 (LLVM 19.1.7)) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        135.52 ± 0.62 |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         31.07 ± 0.53 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        122.89 ± 0.04 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         28.14 ± 0.07 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         40.84 ± 0.06 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.37 ± 0.01 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         35.53 ± 0.02 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          4.64 ± 0.00 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         39.29 ± 0.04 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.22 ± 0.00 |

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Feb 2, 2025
@remyoudompheng
Copy link
Contributor Author

llvmpipe seems to have issues with the init_iq_shmem of iq2_xxs

  GET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b=7,v=1): OK
  GET_ROWS(type=iq2_xs,n=256,m=5,r=4,b=1,v=0): AddressSanitizer: CHECK failed: asan_allocator.cpp:190 "((old)) == ((kAllocBegMagic))" (0x2b2b2b1908081908, 0xcc6e96b9cc6e96b9) (tid=2409713)
    #0 0x56059d6dac9b in __asan::CheckUnwind() asan_rtl.cpp.o
    #1 0x56059d6fac00 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) (llama.cpp/build/bin/test-backend-ops+0x15cc00) (BuildId: b8c3518bde2946e83d4f9b8f4732cf76ed58a79a)

adding a bounds check makes it happy

shared uvec2 iq2xxs_grid[256];

void init_iq_shmem(uvec3 wgsize)
{
    // copy the table into shared memory and sync
    [[unroll]] for (uint i = 0; i < iq2xxs_grid.length(); i += wgsize.x) {
        if (i + gl_LocalInvocationIndex.x < iq2xxs_grid.length())
        iq2xxs_grid[i + gl_LocalInvocationIndex.x] = iq2xxs_grid_const[i + gl_LocalInvocationIndex.x];
    }
    barrier();
}

@jeffbolznv
Copy link
Collaborator

llvmpipe seems to have issues with the init_iq_shmem of iq2_xxs

adding a bounds check makes it happy

I didn't realize we were using such large workgroup sizes with these init functions for getrows. Maybe the branch condition should do something like ((length % wgsize.x) != 0) && so it's optimized away in the mul mat shaders.

@netrunnereve
Copy link
Collaborator

netrunnereve commented Feb 3, 2025

llvmpipe seems to have issues with the init_iq_shmem of iq2_xxs

That's why I love the llvmpipe test as it finds all those issues which get ignored by regular GPUs or traditional subgroup sizes.

BTW have you noticed an improvement on your end with bitfieldExtract? I've tried it in the past but ended up not bothering with it as the compiler was always smart enough to use the bfe hardware instruction instead of a shift and and. At the same time I've also seen it mess up the ternary operator and insert real branches sometimes which was why I got rid of all of them in #11081. Compilers are weird.

@remyoudompheng
Copy link
Contributor Author

In that case I believe the issue also appears with actual GPU, but it is probably hidden by hardware bounds checking which is not in llvmpipe.
I don't think bitfieldExtract is necessary here but as a matter of personal taste, it feels a bit clearer than shifts and mask (avoiding too many parentheses). Here the ternary operator pattern is simple enough to compile to 2 instructions (test bit, then v_cndmask mask, -x, x) on AMD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants