vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations #11595

remyoudompheng · 2025-02-02T15:05:12Z

(This is a draft written on top of #11501 and #11528 )

This PR introduces MMV kernels for IQ2 and IQ3 quantizations. It also includes optimizations suggested by @jeffbolznv (unrolled init_iq_shmem and 2x block size in mul_mat_vec).

After this PR the performance of IQ2/IQ3 seems in line with comparable K-quants (model size × t/s is similar).
Note that the kernels for IQ1 quants are included in #11528

Performance before all optimizations
(both Mesa compilers for AMD target are shown: ACO and LLVM)
(llama-bench output is annotated by the estimate bandwidth model size × t/s)
(Qwen IQ1 model files are from https://huggingface.co/legraphista/Qwen2.5-Coder-7B-Instruct-IMat-GGUF)
(model files from bartowski/Mistral-Small-24B-Instruct-2501-GGUF have wrong name "llama 13B")

Backend 1/2: Vulkan0
  Device description: AMD Radeon 780M (RADV GFX1103_R1)
  Device memory: 17066 MB (17066 MB free)

  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):      41.57 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):      75.72 GFLOPS

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    450.75 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    349.44 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    274.34 GFLOPS

  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   344.50 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   288.32 GFLOPS

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 345.72 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  325.93 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   262.45 GFLOPS

  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 358.35 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   310.26 GFLOPS

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  274.33 GFLOPS
  MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  265.44 GFLOPS

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        238.80 ± 4.48 |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         17.82 ± 0.38 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        233.74 ± 0.83 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         16.20 ± 0.03 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         59.33 ± 0.02 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.43 ± 0.07 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         59.93 ± 0.35 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          4.66 ± 0.02 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         55.63 ± 0.22 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.64 ± 0.10 |
| llama 13B Q2_K - Medium        |   8.88 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         56.05 ± 0.23 |
| llama 13B Q2_K - Medium        |   8.88 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          8.28 ± 0.06 | 73.5 GiB/s
| llama 13B Q3_K - Large         |  11.54 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         47.16 ± 0.02 |
| llama 13B Q3_K - Large         |  11.54 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.20 ± 0.03 | 71.5 GiB/s

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1 (LLVM 19.1.7)) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        133.73 ± 1.47 |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         12.92 ± 0.00 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        128.73 ± 2.73 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         11.15 ± 0.02 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         40.82 ± 0.06 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          3.49 ± 0.00 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         35.25 ± 0.19 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          2.00 ± 0.01 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         38.51 ± 0.02 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          4.03 ± 0.00 |
| llama 13B Q2_K - Medium        |   8.88 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         30.34 ± 0.03 |
| llama 13B Q2_K - Medium        |   8.88 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.08 ± 0.00 |
| llama 13B Q3_K - Large         |  11.54 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         27.12 ± 0.01 |
| llama 13B Q3_K - Large         |  11.54 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          4.56 ± 0.00 |

Performance after:

  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   707.53 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   639.12 GFLOPS

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 524.20 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  507.47 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   458.70 GFLOPS

  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 375.33 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   337.94 GFLOPS

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  257.80 GFLOPS

legraphista/Qwen2.5-Coder-7B-Instruct-IMat-GGUF
bartowski/Mistral-Small-24B-Instruct-2501-GGUF

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        248.47 ± 0.47 |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         34.39 ± 0.12 | 60.9 GiB/s
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        228.57 ± 6.27 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         32.25 ± 0.22 | 61.3 GiB/s
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         62.63 ± 0.06 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         tg128 |         10.06 ± 0.01 | 70.0 GiB/s
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         55.94 ± 0.29 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          8.75 ± 0.18 | 66.1 GiB/s
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         57.35 ± 0.05 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          7.61 ± 0.00 | 70.2 GiB/s

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1 (LLVM 19.1.7)) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        135.52 ± 0.62 |
| qwen2 7B IQ1_S - 1.5625 bpw    |   1.77 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         31.07 ± 0.53 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         pp512 |        122.89 ± 0.04 |
| qwen2 7B IQ1_M - 1.75 bpw      |   1.90 GiB |     7.62 B | Vulkan     |  99 |         tg128 |         28.14 ± 0.07 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         40.84 ± 0.06 |
| llama 13B IQ2_S - 2.5 bpw      |   6.96 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.37 ± 0.01 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         35.53 ± 0.02 |
| llama 13B IQ2_M - 2.7 bpw      |   7.55 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          4.64 ± 0.00 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         pp512 |         39.29 ± 0.04 |
| llama 13B IQ3_XS - 3.3 bpw     |   9.22 GiB |    23.57 B | Vulkan     |  99 |         tg128 |          6.22 ± 0.00 |

remyoudompheng · 2025-02-02T15:42:45Z

llvmpipe seems to have issues with the init_iq_shmem of iq2_xxs

  GET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b=7,v=1): OK
  GET_ROWS(type=iq2_xs,n=256,m=5,r=4,b=1,v=0): AddressSanitizer: CHECK failed: asan_allocator.cpp:190 "((old)) == ((kAllocBegMagic))" (0x2b2b2b1908081908, 0xcc6e96b9cc6e96b9) (tid=2409713)
    #0 0x56059d6dac9b in __asan::CheckUnwind() asan_rtl.cpp.o
    #1 0x56059d6fac00 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) (llama.cpp/build/bin/test-backend-ops+0x15cc00) (BuildId: b8c3518bde2946e83d4f9b8f4732cf76ed58a79a)

adding a bounds check makes it happy

shared uvec2 iq2xxs_grid[256];

void init_iq_shmem(uvec3 wgsize)
{
    // copy the table into shared memory and sync
    [[unroll]] for (uint i = 0; i < iq2xxs_grid.length(); i += wgsize.x) {
        if (i + gl_LocalInvocationIndex.x < iq2xxs_grid.length())
        iq2xxs_grid[i + gl_LocalInvocationIndex.x] = iq2xxs_grid_const[i + gl_LocalInvocationIndex.x];
    }
    barrier();
}

jeffbolznv · 2025-02-02T17:43:45Z

llvmpipe seems to have issues with the init_iq_shmem of iq2_xxs

adding a bounds check makes it happy

I didn't realize we were using such large workgroup sizes with these init functions for getrows. Maybe the branch condition should do something like ((length % wgsize.x) != 0) && so it's optimized away in the mul mat shaders.

netrunnereve · 2025-02-03T01:40:06Z

llvmpipe seems to have issues with the init_iq_shmem of iq2_xxs

That's why I love the llvmpipe test as it finds all those issues which get ignored by regular GPUs or traditional subgroup sizes.

BTW have you noticed an improvement on your end with bitfieldExtract? I've tried it in the past but ended up not bothering with it as the compiler was always smart enough to use the bfe hardware instruction instead of a shift and and. At the same time I've also seen it mess up the ternary operator and insert real branches sometimes which was why I got rid of all of them in #11081. Compilers are weird.

remyoudompheng · 2025-02-03T08:08:15Z

In that case I believe the issue also appears with actual GPU, but it is probably hidden by hardware bounds checking which is not in llvmpipe.
I don't think bitfieldExtract is necessary here but as a matter of personal taste, it feels a bit clearer than shifts and mask (avoiding too many parentheses). Here the ternary operator pattern is simple enough to compile to 2 instructions (test bit, then v_cndmask mask, -x, x) on AMD.

remyoudompheng added 7 commits January 30, 2025 04:19

vulkan: initial support for IQ4_XS quantization

743cfdf

vulkan: initial support for IQ1_S and IQ1_M quantizations

4aba265

vulkan: define MMV kernels for IQ1 quantizations

5246ae7

devops: increase timeout of Vulkan tests again

07820ef

vulkan: implement specialized MMV kernels for IQ2 quantizations

cc99d98

vulkan: add MMV kernels for IQ3 quants

7b820e7

vulkan: Increase MMV batch size and unroll IQ LUT setup

0c2ff18

github-actions bot added Vulkan Issues specific to the Vulkan backend devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations #11595

vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations #11595

remyoudompheng commented Feb 2, 2025 •

edited

Loading

remyoudompheng commented Feb 2, 2025

jeffbolznv commented Feb 2, 2025

netrunnereve commented Feb 3, 2025 •

edited

Loading

remyoudompheng commented Feb 3, 2025

vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations #11595

Are you sure you want to change the base?

vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations #11595

Conversation

remyoudompheng commented Feb 2, 2025 • edited Loading

remyoudompheng commented Feb 2, 2025

jeffbolznv commented Feb 2, 2025

netrunnereve commented Feb 3, 2025 • edited Loading

remyoudompheng commented Feb 3, 2025

remyoudompheng commented Feb 2, 2025 •

edited

Loading

netrunnereve commented Feb 3, 2025 •

edited

Loading