vulkan: add mul_mat variant for embedded gpus #15800

rmatif · 2025-09-04T16:25:36Z

This PR adds a mat_mul variant designed for embedded gpus, as the current shaders perform poorly. Essentially I reused the approach from my opencl implementation #14535

It has currently been tested only on mali gpu, but I believe it should suits well on others as well

Model	Test	Master	PR	Speedup
Qwen2 1.5B Q4_0	pp512	2.81 ± 0.03	66.26 ± 0.06	23.58x
Llama 1B F16	pp512	5.76 ± 0.08	95.83 ± 0.14	16.63x

Master:

  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 2 runs - 634214.00 us/run -  60.13 GFLOP/run -  8.92 GFLOPS

PR:

  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 2 runs - 634214.00 us/run -  60.13 GFLOP/run -  150.01 GFLOPS

I started messing with coopmat, but didn’t have enough time to include it in this PR. If we go this route for embedded gpus, I plan to add variants for conv2d, mul_vec, and maybe fa in the future

rmatif · 2025-09-04T16:40:38Z

I got two failures with iq2_s and iq3_s. Does anyone know why these two in particular are failing?

[MUL_MAT] NMSE = 0.363474692 > 0.000500000   MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): FAIL
[MUL_MAT] NMSE = 2.417624850 > 0.000500000   MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): FAIL

jeffbolznv · 2025-09-04T17:35:05Z

Does anyone know why these two in particular are failing?

I think there's a pre-existing bug in the dequant functions for these types. I ran into this once before when working on get_rows (I think?) and the bug didn't seen to be happening in any code paths that were getting hit in practice, and the bug wasn't obvious, so I didn't pursue it.

Can you explain what it is about your matmul shader that makes it faster for mobile? At first glance it doesn't seem fundamentally different from what the scalar path is doing, except that you're dequantizing the matrix as a separate pass, and maybe some minor differences to tile size. I'll leave some comments on the code in a little while.

rmatif · 2025-09-04T19:01:08Z

Can you explain what it is about your matmul shader that makes it faster for mobile?

The main difference I'd say is the very low register pressure and avoiding spilling. Mali provides only 64 registers per thread, so the entire design was built around that constraint. I tried fine-tuning the existing shaders some time ago but without success. I also believe that due to the simplicity, even outdated drivers on low-end hardware should handle them more easily

jeffbolznv · 2025-09-04T19:29:54Z

But you can reduce the register usage by changing the tile size via spec constants. The other big difference I see is you're using vec4s everywhere, I wonder if that's somehow related.

rmatif · 2025-09-04T19:42:25Z

But you can reduce the register usage by changing the tile size via spec constants. The other big difference I see is you're using vec4s everywhere, I wonder if that's somehow related.

I know, but for some reason that wasn’t enough (though I was testing on an older and less powerful device, so I should give it another try). On adreno vec4 is much faster, but on mali on the latest gen it shouldn’t make much difference, according to arm’s documentation it should perform similarly to scalar, so I kept it in case another device requires it

ggml/src/ggml-vulkan/ggml-vulkan.cpp

jeffbolznv · 2025-09-04T19:48:52Z

ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_embed.comp

+    for (uint t = 0; t < num_k_tiles; t++) {
+        const uint k_tile_start = t * BK;
+
+        #pragma unroll


[[unroll]] is preferred.

Isn't #pragma unroll better for old compilers?

We use [[unroll]] in a bunch of shaders and haven't had any problems.

ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_embed.comp

0cc4m

I don't have time right now for a full review, but I wanted to add this to Jeff's comments.

Edit: impressive results, nice work. It's great to add support to more kinds of devices.

0cc4m · 2025-09-04T20:00:33Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

@@ -2901,6 +2913,16 @@ static void ggml_vk_load_shaders(vk_device& device) {
            CREATE_MM(GGML_TYPE_MXFP4,   pipeline_dequant_mul_mat_mat_id[GGML_TYPE_MXFP4].f32acc,   matmul_id_mxfp4_f32,   , mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, 4, _id, 0);
        }
    }
+
+    if (device->vendor_id == VK_VENDOR_ID_ARM) {


I would prefer if this used the same codepath as the main shader, since it's duplicating quite a bit. If the push constants are the same, is there a reason not to just add an ARM path to the shader selection function and leave the dequant to the existing logic?

netrunnereve · 2025-09-05T00:53:55Z

If we go this route for embedded gpus, I plan to add variants for conv2d, mul_vec, and maybe fa in the future

If we go that route we'll need a better way of testing these shaders. As a start we can have an environment variable or compile flag to enable the embedded path, and while I don't see anything arm specific in the code we should try it out on some desktop GPUs to see if it runs fine there.

netrunnereve · 2025-09-05T01:37:21Z

Also I'm finding it a bit hard to believe that it's faster to dequant everything, write that to regular memory, and than read all that back into shared memory for the actual multiplication. That's basically limiting everything to your memory speed like inference.

I don't think having less registers matters in this case since there's not much difference in register count between writing the dequantized weights to regular memory versus shared memory. The mul mat shader can be treated as two sections where one does the dequantization to shared memory and the other does the actual multiplication, and the register counts for each can be more or less independent. I'd be curious to see how much long the dequantization takes compared to the multiplication if you run with GGML_VK_PERF_LOGGER.

And as a side note our dequant functions are like triplicated across dequant_funcs.comp, mul_mm.comp, and dequant_q*.comp and they really should be merged if possible.

rmatif · 2025-09-05T06:32:51Z

I'd be curious to see how much long the dequantization takes compared to the multiplication if you run with GGML_VK_PERF_LOGGER

I will try to take a look

I did a quick test by fine-tuning and matching the tile sizes of the current scalar shaders to this one, and I observed roughly the same results (a bit faster due to coopmat). Marking this as a draft for now, as I still need to test it on older devices and other vendors to see if it does any good

0cc4m · 2025-09-05T06:50:41Z

The perf logger does not show the dequant and matmul dispatches separately, you'd need a profiler for that.

rmatif · 2025-09-05T06:55:31Z

The perf logger does not show the dequant and matmul dispatches separately, you'd need a profiler for that.

I don’t know if Arm offers that. I will reach out to some of their engineers and ask them

0cc4m · 2025-09-05T07:27:22Z

I've done most of my work without profilers, just logically it should be better to dequant directly to shared memory because you avoid the need for an intermediate buffer in vram and you avoid the global reads/writes. But if the regular mul_mm shader works with smaller tiles already, then you don't need to implement that, luckily.

rmatif · 2025-09-05T07:41:16Z

I've done most of my work without profilers, just logically it should be better to dequant directly to shared memory because you avoid the need for an intermediate buffer in vram and you avoid the global reads/writes. But if the regular mul_mm shader works with smaller tiles already, then you don't need to implement that, luckily.

I spent quite a bit of time trying to do the dequant directly in the shared memory, but it turned out to be too much work for me, so I dropped it and and by "laziness" adopted this two-stage approach. That said I do agree that the optimal solution is to do it directly

rmatif · 2025-09-05T21:54:57Z

I’ve unlocked a massive improvement that goes far beyond the fine-tuned existing shaders, that would justify the existence of this one. I’ve updated the performance numbers in the OP and will soon look into reports on older devices

Turns out the compiler has some specific quirks. It generates much faster code from explicit manually unrolled mad, so I've flattened the inner loops. While further unrolling is tempting it drastically increases register pressure and may causes instability. I stopped here, I'm a bit short in time anyway

@jeffbolznv If you have a chance, could you please review again? I believe I’ve addressed your comments but let me know if I missed something

As a start we can have an environment variable

I think using an env var is better. I don’t see why it wouldn’t work on dGPU given the extreme simplicity, and honestly I don’t see the usefulness of running it there. The compilers and architectures are so different that I don’t think we’d gain much information from it

Also I'm finding it a bit hard to believe that it's faster to dequant everything, write that to regular memory, and than read all that back into shared memory for the actual multiplication. That's basically limiting everything to your memory speed like inference

I haven’t touched the dequant part, it’s still the same. I’m writing to an f16 buffer, and the f16xf32/f32xf32 path is now much faster hence the speedup. As I mentioned, I started experimenting with dequant in shared memory, but it turned out to be too much work for a first step. We can leave that as a future plan

If someone has a Raspberry Pi lying around, I’d be very curious to see the performance gains on that kind of device

rmatif · 2025-09-06T11:02:17Z

ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_embed.comp

+        }
+    }
+
+    const uint num_k_tiles = (p.K + BK - 1) / BK;


I think this is not robust enough and might be wrong for adreno case, but it passes the tests on test-backend-ops feel like it shouldn't

Shouldn't be hard to add a case or two with odd K. I suggest having relatively small M,N to avoid the error being hidden.

I misquoted, I was thinking more about the adreno case:

BM = 32, BK = 8 -> VEC_K = 2 WG_SIZE = 128 A_LOADS_PER_THREAD = (32 * 2) / 128 = 64 / 128 = 0

So theoretically it shouldn’t be able to load matrix A regardless of the dimensions, but the tests are passing so I’m a bit confused

jeffbolznv · 2025-09-06T17:34:28Z

 I think using an env var is better. I don’t see why it wouldn’t work on dGPU given the extreme simplicity, and honestly I don’t see the usefulness of running it there. The compilers and architectures are so different that I don’t think we’d gain much information from it

It's useful to be able to test it for correctness.

jeffbolznv

I'd still like to see this, at the least, using the same push constant and spec constant interface as the rest of the matmul shaders, and running through the same code paths in ggml_vk_mul_mat_q_f16. I think it would be nice if it could be folded into mul_mm, but I'm not sure we understand what specifically is causing the better perf.

jeffbolznv · 2025-09-06T17:39:28Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

+                    return;
+                }
+
+                const std::vector<uint32_t> pc = { (uint32_t)M, (uint32_t)K, (uint32_t)K, (uint32_t)K, (uint32_t)(ggml_nelements(src0)) };


I think this path is not handling noncontiguous src0. Like @0cc4m said, it'll be better to let this run through the existing code paths rather than having this separate code path.

jeffbolznv · 2025-09-06T17:40:38Z

ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_embed.comp

+        }
+    }
+
+    const uint num_k_tiles = (p.K + BK - 1) / BK;


Shouldn't be hard to add a case or two with odd K. I suggest having relatively small M,N to avoid the error being hidden.

jeffbolznv · 2025-09-08T03:10:21Z

I've made a PR to fix the failing dequant shaders.

netrunnereve · 2025-09-08T03:26:45Z

Turns out the compiler has some specific quirks. It generates much faster code from explicit manually unrolled mad, so I've flattened the inner loops. While further unrolling is tempting it drastically increases register pressure and may causes instability.

At this point I think it's time to start looking into the Arm dev tools and assembly dumps, if they even have them 😉

rmatif · 2025-09-08T04:24:22Z

Turns out the compiler has some specific quirks. It generates much faster code from explicit manually unrolled mad, so I've flattened the inner loops. While further unrolling is tempting it drastically increases register pressure and may causes instability.

At this point I think it's time to start looking into the Arm dev tools and assembly dumps, if they even have them 😉

I’ve tried reaching out to some ARM devs, so hopefully they’ll take a look into it

Speaking of compilers, the Adreno compilers crash when running the existing mul_mat shaders, although they work fine with this one, so I think we’ll need this variant anyway (performances are better than opencl but I suspect a bug so I won't jump too quickly)

Sorry I don’t have much time to address all the concerns right now. I’ll try when I have time at least to implement an env var to run the shaders on the dGPU

EDIT: Just got confirmation from an arm dev, they don’t have a public disassembler, only a profiler

add mul_mat variant for embed gpu

acf3c89

rmatif requested a review from 0cc4m as a code owner September 4, 2025 16:25

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 4, 2025

add line ending

e103de2

jeffbolznv reviewed Sep 4, 2025

View reviewed changes

0cc4m reviewed Sep 4, 2025

View reviewed changes

rmatif marked this pull request as draft September 5, 2025 06:33

refactor and opt mulmat shaders and adress review comments

d520e63

rmatif marked this pull request as ready for review September 5, 2025 21:27

rmatif requested a review from jeffbolznv September 5, 2025 21:54

add qcom support

850b9bf

rmatif commented Sep 6, 2025

View reviewed changes

jeffbolznv reviewed Sep 6, 2025

View reviewed changes

jeffbolznv mentioned this pull request Sep 8, 2025

vulkan: fix failing dequant shaders #15862

Open

vulkan: add mul_mat variant for embedded gpus #15800

Are you sure you want to change the base?

vulkan: add mul_mat variant for embedded gpus #15800

Uh oh!

Conversation

rmatif commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmatif commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Sep 4, 2025

Uh oh!

rmatif commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Sep 4, 2025

Uh oh!

rmatif commented Sep 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

0cc4m left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netrunnereve commented Sep 5, 2025

Uh oh!

netrunnereve commented Sep 5, 2025

Uh oh!

rmatif commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Sep 5, 2025

Uh oh!

rmatif commented Sep 5, 2025

Uh oh!

0cc4m commented Sep 5, 2025

Uh oh!

rmatif commented Sep 5, 2025

Uh oh!

rmatif commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffbolznv commented Sep 6, 2025

Uh oh!

jeffbolznv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffbolznv commented Sep 8, 2025

Uh oh!

netrunnereve commented Sep 8, 2025

Uh oh!

rmatif commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmatif commented Sep 4, 2025 •

edited

Loading

rmatif commented Sep 4, 2025 •

edited

Loading

rmatif commented Sep 4, 2025 •

edited

Loading

0cc4m left a comment •

edited

Loading

rmatif commented Sep 5, 2025 •

edited

Loading

rmatif commented Sep 5, 2025 •

edited

Loading

rmatif commented Sep 8, 2025 •

edited

Loading