Skip to content

Conversation

rmatif
Copy link
Collaborator

@rmatif rmatif commented Sep 4, 2025

This PR adds a mat_mul variant designed for embedded gpus, as the current shaders perform poorly. Essentially I reused the approach from my opencl implementation #14535

It has currently been tested only on mali gpu, but I believe it should suits well on others as well

ggml_vulkan: 0 = Mali-G715 (Mali-G715) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: KHR_coopmat

Model Test Master PR Speedup
Qwen2 1.5B Q4_0 pp512 2.81 ± 0.03 66.26 ± 0.06 23.58x
Llama 1B F16 pp512 5.76 ± 0.08 95.83 ± 0.14 16.63x

Master:

  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 2 runs - 634214.00 us/run -  60.13 GFLOP/run -  8.92 GFLOPS

PR:

  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 2 runs - 634214.00 us/run -  60.13 GFLOP/run -  150.01 GFLOPS

I started messing with coopmat, but didn’t have enough time to include it in this PR. If we go this route for embedded gpus, I plan to add variants for conv2d, mul_vec, and maybe fa in the future

@rmatif rmatif requested a review from 0cc4m as a code owner September 4, 2025 16:25
@rmatif
Copy link
Collaborator Author

rmatif commented Sep 4, 2025

I got two failures with iq2_s and iq3_s. Does anyone know why these two in particular are failing?

[MUL_MAT] NMSE = 0.363474692 > 0.000500000   MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): FAIL
[MUL_MAT] NMSE = 2.417624850 > 0.000500000   MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): FAIL

@jeffbolznv
Copy link
Collaborator

Does anyone know why these two in particular are failing?

I think there's a pre-existing bug in the dequant functions for these types. I ran into this once before when working on get_rows (I think?) and the bug didn't seen to be happening in any code paths that were getting hit in practice, and the bug wasn't obvious, so I didn't pursue it.

Can you explain what it is about your matmul shader that makes it faster for mobile? At first glance it doesn't seem fundamentally different from what the scalar path is doing, except that you're dequantizing the matrix as a separate pass, and maybe some minor differences to tile size. I'll leave some comments on the code in a little while.

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 4, 2025
@rmatif
Copy link
Collaborator Author

rmatif commented Sep 4, 2025

Can you explain what it is about your matmul shader that makes it faster for mobile?

The main difference I'd say is the very low register pressure and avoiding spilling. Mali provides only 64 registers per thread, so the entire design was built around that constraint. I tried fine-tuning the existing shaders some time ago but without success. I also believe that due to the simplicity, even outdated drivers on low-end hardware should handle them more easily

@jeffbolznv
Copy link
Collaborator

But you can reduce the register usage by changing the tile size via spec constants. The other big difference I see is you're using vec4s everywhere, I wonder if that's somehow related.

@rmatif
Copy link
Collaborator Author

rmatif commented Sep 4, 2025

But you can reduce the register usage by changing the tile size via spec constants. The other big difference I see is you're using vec4s everywhere, I wonder if that's somehow related.

I know, but for some reason that wasn’t enough (though I was testing on an older and less powerful device, so I should give it another try). On adreno vec4 is much faster, but on mali on the latest gen it shouldn’t make much difference, according to arm’s documentation it should perform similarly to scalar, so I kept it in case another device requires it

for (uint t = 0; t < num_k_tiles; t++) {
const uint k_tile_start = t * BK;

#pragma unroll
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[[unroll]] is preferred.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't #pragma unroll better for old compilers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use [[unroll]] in a bunch of shaders and haven't had any problems.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have time right now for a full review, but I wanted to add this to Jeff's comments.

Edit: impressive results, nice work. It's great to add support to more kinds of devices.

@@ -2901,6 +2913,16 @@ static void ggml_vk_load_shaders(vk_device& device) {
CREATE_MM(GGML_TYPE_MXFP4, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_MXFP4].f32acc, matmul_id_mxfp4_f32, , mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, 4, _id, 0);
}
}

if (device->vendor_id == VK_VENDOR_ID_ARM) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if this used the same codepath as the main shader, since it's duplicating quite a bit. If the push constants are the same, is there a reason not to just add an ARM path to the shader selection function and leave the dequant to the existing logic?

@netrunnereve
Copy link
Collaborator

If we go this route for embedded gpus, I plan to add variants for conv2d, mul_vec, and maybe fa in the future

If we go that route we'll need a better way of testing these shaders. As a start we can have an environment variable or compile flag to enable the embedded path, and while I don't see anything arm specific in the code we should try it out on some desktop GPUs to see if it runs fine there.

@netrunnereve
Copy link
Collaborator

Also I'm finding it a bit hard to believe that it's faster to dequant everything, write that to regular memory, and than read all that back into shared memory for the actual multiplication. That's basically limiting everything to your memory speed like inference.

I don't think having less registers matters in this case since there's not much difference in register count between writing the dequantized weights to regular memory versus shared memory. The mul mat shader can be treated as two sections where one does the dequantization to shared memory and the other does the actual multiplication, and the register counts for each can be more or less independent. I'd be curious to see how much long the dequantization takes compared to the multiplication if you run with GGML_VK_PERF_LOGGER.

And as a side note our dequant functions are like triplicated across dequant_funcs.comp, mul_mm.comp, and dequant_q*.comp and they really should be merged if possible.

@rmatif
Copy link
Collaborator Author

rmatif commented Sep 5, 2025

I'd be curious to see how much long the dequantization takes compared to the multiplication if you run with GGML_VK_PERF_LOGGER

I will try to take a look

I did a quick test by fine-tuning and matching the tile sizes of the current scalar shaders to this one, and I observed roughly the same results (a bit faster due to coopmat). Marking this as a draft for now, as I still need to test it on older devices and other vendors to see if it does any good

@rmatif rmatif marked this pull request as draft September 5, 2025 06:33
@0cc4m
Copy link
Collaborator

0cc4m commented Sep 5, 2025

The perf logger does not show the dequant and matmul dispatches separately, you'd need a profiler for that.

@rmatif
Copy link
Collaborator Author

rmatif commented Sep 5, 2025

The perf logger does not show the dequant and matmul dispatches separately, you'd need a profiler for that.

I don’t know if Arm offers that. I will reach out to some of their engineers and ask them

@0cc4m
Copy link
Collaborator

0cc4m commented Sep 5, 2025

I've done most of my work without profilers, just logically it should be better to dequant directly to shared memory because you avoid the need for an intermediate buffer in vram and you avoid the global reads/writes. But if the regular mul_mm shader works with smaller tiles already, then you don't need to implement that, luckily.

@rmatif
Copy link
Collaborator Author

rmatif commented Sep 5, 2025

I've done most of my work without profilers, just logically it should be better to dequant directly to shared memory because you avoid the need for an intermediate buffer in vram and you avoid the global reads/writes. But if the regular mul_mm shader works with smaller tiles already, then you don't need to implement that, luckily.

I spent quite a bit of time trying to do the dequant directly in the shared memory, but it turned out to be too much work for me, so I dropped it and and by "laziness" adopted this two-stage approach. That said I do agree that the optimal solution is to do it directly

@rmatif rmatif marked this pull request as ready for review September 5, 2025 21:27
@rmatif rmatif requested a review from jeffbolznv September 5, 2025 21:54
@rmatif
Copy link
Collaborator Author

rmatif commented Sep 5, 2025

I’ve unlocked a massive improvement that goes far beyond the fine-tuned existing shaders, that would justify the existence of this one. I’ve updated the performance numbers in the OP and will soon look into reports on older devices

Turns out the compiler has some specific quirks. It generates much faster code from explicit manually unrolled mad, so I've flattened the inner loops. While further unrolling is tempting it drastically increases register pressure and may causes instability. I stopped here, I'm a bit short in time anyway

@jeffbolznv If you have a chance, could you please review again? I believe I’ve addressed your comments but let me know if I missed something

As a start we can have an environment variable

I think using an env var is better. I don’t see why it wouldn’t work on dGPU given the extreme simplicity, and honestly I don’t see the usefulness of running it there. The compilers and architectures are so different that I don’t think we’d gain much information from it

Also I'm finding it a bit hard to believe that it's faster to dequant everything, write that to regular memory, and than read all that back into shared memory for the actual multiplication. That's basically limiting everything to your memory speed like inference

I haven’t touched the dequant part, it’s still the same. I’m writing to an f16 buffer, and the f16xf32/f32xf32 path is now much faster hence the speedup. As I mentioned, I started experimenting with dequant in shared memory, but it turned out to be too much work for a first step. We can leave that as a future plan

If someone has a Raspberry Pi lying around, I’d be very curious to see the performance gains on that kind of device

}
}

const uint num_k_tiles = (p.K + BK - 1) / BK;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not robust enough and might be wrong for adreno case, but it passes the tests on test-backend-ops feel like it shouldn't

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be hard to add a case or two with odd K. I suggest having relatively small M,N to avoid the error being hidden.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misquoted, I was thinking more about the adreno case:

BM = 32, BK = 8 -> VEC_K = 2
WG_SIZE = 128
A_LOADS_PER_THREAD = (32 * 2) / 128 = 64 / 128 = 0

So theoretically it shouldn’t be able to load matrix A regardless of the dimensions, but the tests are passing so I’m a bit confused

@jeffbolznv
Copy link
Collaborator

 I think using an env var is better. I don’t see why it wouldn’t work on dGPU given the extreme simplicity, and honestly I don’t see the usefulness of running it there. The compilers and architectures are so different that I don’t think we’d gain much information from it

It's useful to be able to test it for correctness.

Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to see this, at the least, using the same push constant and spec constant interface as the rest of the matmul shaders, and running through the same code paths in ggml_vk_mul_mat_q_f16. I think it would be nice if it could be folded into mul_mm, but I'm not sure we understand what specifically is causing the better perf.

return;
}

const std::vector<uint32_t> pc = { (uint32_t)M, (uint32_t)K, (uint32_t)K, (uint32_t)K, (uint32_t)(ggml_nelements(src0)) };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this path is not handling noncontiguous src0. Like @0cc4m said, it'll be better to let this run through the existing code paths rather than having this separate code path.

}
}

const uint num_k_tiles = (p.K + BK - 1) / BK;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be hard to add a case or two with odd K. I suggest having relatively small M,N to avoid the error being hidden.

@jeffbolznv
Copy link
Collaborator

I've made a PR to fix the failing dequant shaders.

@netrunnereve
Copy link
Collaborator

Turns out the compiler has some specific quirks. It generates much faster code from explicit manually unrolled mad, so I've flattened the inner loops. While further unrolling is tempting it drastically increases register pressure and may causes instability.

At this point I think it's time to start looking into the Arm dev tools and assembly dumps, if they even have them 😉

@rmatif
Copy link
Collaborator Author

rmatif commented Sep 8, 2025

Turns out the compiler has some specific quirks. It generates much faster code from explicit manually unrolled mad, so I've flattened the inner loops. While further unrolling is tempting it drastically increases register pressure and may causes instability.

At this point I think it's time to start looking into the Arm dev tools and assembly dumps, if they even have them 😉

I’ve tried reaching out to some ARM devs, so hopefully they’ll take a look into it

Speaking of compilers, the Adreno compilers crash when running the existing mul_mat shaders, although they work fine with this one, so I think we’ll need this variant anyway (performances are better than opencl but I suspect a bug so I won't jump too quickly)

Sorry I don’t have much time to address all the concerns right now. I’ll try when I have time at least to implement an env var to run the shaders on the dGPU

EDIT: Just got confirmation from an arm dev, they don’t have a public disassembler, only a profiler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants