-
Notifications
You must be signed in to change notification settings - Fork 13k
vulkan: add mul_mat variant for embedded gpus #15800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I got two failures with iq2_s and iq3_s. Does anyone know why these two in particular are failing? [MUL_MAT] NMSE = 0.363474692 > 0.000500000 MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): FAIL
[MUL_MAT] NMSE = 2.417624850 > 0.000500000 MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): FAIL |
I think there's a pre-existing bug in the dequant functions for these types. I ran into this once before when working on get_rows (I think?) and the bug didn't seen to be happening in any code paths that were getting hit in practice, and the bug wasn't obvious, so I didn't pursue it. Can you explain what it is about your matmul shader that makes it faster for mobile? At first glance it doesn't seem fundamentally different from what the scalar path is doing, except that you're dequantizing the matrix as a separate pass, and maybe some minor differences to tile size. I'll leave some comments on the code in a little while. |
The main difference I'd say is the very low register pressure and avoiding spilling. Mali provides only 64 registers per thread, so the entire design was built around that constraint. I tried fine-tuning the existing shaders some time ago but without success. I also believe that due to the simplicity, even outdated drivers on low-end hardware should handle them more easily |
But you can reduce the register usage by changing the tile size via spec constants. The other big difference I see is you're using vec4s everywhere, I wonder if that's somehow related. |
I know, but for some reason that wasn’t enough (though I was testing on an older and less powerful device, so I should give it another try). On adreno vec4 is much faster, but on mali on the latest gen it shouldn’t make much difference, according to arm’s documentation it should perform similarly to scalar, so I kept it in case another device requires it |
for (uint t = 0; t < num_k_tiles; t++) { | ||
const uint k_tile_start = t * BK; | ||
|
||
#pragma unroll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[[unroll]]
is preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't #pragma unroll better for old compilers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use [[unroll]]
in a bunch of shaders and haven't had any problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have time right now for a full review, but I wanted to add this to Jeff's comments.
Edit: impressive results, nice work. It's great to add support to more kinds of devices.
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
@@ -2901,6 +2913,16 @@ static void ggml_vk_load_shaders(vk_device& device) { | |||
CREATE_MM(GGML_TYPE_MXFP4, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_MXFP4].f32acc, matmul_id_mxfp4_f32, , mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, 4, _id, 0); | |||
} | |||
} | |||
|
|||
if (device->vendor_id == VK_VENDOR_ID_ARM) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if this used the same codepath as the main shader, since it's duplicating quite a bit. If the push constants are the same, is there a reason not to just add an ARM path to the shader selection function and leave the dequant to the existing logic?
If we go that route we'll need a better way of testing these shaders. As a start we can have an environment variable or compile flag to enable the embedded path, and while I don't see anything arm specific in the code we should try it out on some desktop GPUs to see if it runs fine there. |
Also I'm finding it a bit hard to believe that it's faster to dequant everything, write that to regular memory, and than read all that back into shared memory for the actual multiplication. That's basically limiting everything to your memory speed like inference. I don't think having less registers matters in this case since there's not much difference in register count between writing the dequantized weights to regular memory versus shared memory. The mul mat shader can be treated as two sections where one does the dequantization to shared memory and the other does the actual multiplication, and the register counts for each can be more or less independent. I'd be curious to see how much long the dequantization takes compared to the multiplication if you run with And as a side note our dequant functions are like triplicated across dequant_funcs.comp, mul_mm.comp, and dequant_q*.comp and they really should be merged if possible. |
I will try to take a look
|
The perf logger does not show the dequant and matmul dispatches separately, you'd need a profiler for that. |
I don’t know if Arm offers that. I will reach out to some of their engineers and ask them |
I've done most of my work without profilers, just logically it should be better to dequant directly to shared memory because you avoid the need for an intermediate buffer in vram and you avoid the global reads/writes. But if the regular mul_mm shader works with smaller tiles already, then you don't need to implement that, luckily. |
I spent quite a bit of time trying to do the dequant directly in the shared memory, but it turned out to be too much work for me, so I dropped it and and by "laziness" adopted this two-stage approach. That said I do agree that the optimal solution is to do it directly |
I’ve unlocked a massive improvement that goes far beyond the fine-tuned existing shaders, that would justify the existence of this one. I’ve updated the performance numbers in the OP and will soon look into reports on older devices Turns out the compiler has some specific quirks. It generates much faster code from explicit manually unrolled mad, so I've flattened the inner loops. While further unrolling is tempting it drastically increases register pressure and may causes instability. I stopped here, I'm a bit short in time anyway @jeffbolznv If you have a chance, could you please review again? I believe I’ve addressed your comments but let me know if I missed something
I think using an env var is better. I don’t see why it wouldn’t work on dGPU given the extreme simplicity, and honestly I don’t see the usefulness of running it there. The compilers and architectures are so different that I don’t think we’d gain much information from it
I haven’t touched the dequant part, it’s still the same. I’m writing to an f16 buffer, and the f16xf32/f32xf32 path is now much faster hence the speedup. As I mentioned, I started experimenting with dequant in shared memory, but it turned out to be too much work for a first step. We can leave that as a future plan If someone has a Raspberry Pi lying around, I’d be very curious to see the performance gains on that kind of device |
} | ||
} | ||
|
||
const uint num_k_tiles = (p.K + BK - 1) / BK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not robust enough and might be wrong for adreno case, but it passes the tests on test-backend-ops
feel like it shouldn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be hard to add a case or two with odd K. I suggest having relatively small M,N to avoid the error being hidden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misquoted, I was thinking more about the adreno case:
BM = 32, BK = 8 -> VEC_K = 2
WG_SIZE = 128
A_LOADS_PER_THREAD = (32 * 2) / 128 = 64 / 128 = 0
So theoretically it shouldn’t be able to load matrix A regardless of the dimensions, but the tests are passing so I’m a bit confused
It's useful to be able to test it for correctness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to see this, at the least, using the same push constant and spec constant interface as the rest of the matmul shaders, and running through the same code paths in ggml_vk_mul_mat_q_f16. I think it would be nice if it could be folded into mul_mm, but I'm not sure we understand what specifically is causing the better perf.
return; | ||
} | ||
|
||
const std::vector<uint32_t> pc = { (uint32_t)M, (uint32_t)K, (uint32_t)K, (uint32_t)K, (uint32_t)(ggml_nelements(src0)) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this path is not handling noncontiguous src0. Like @0cc4m said, it'll be better to let this run through the existing code paths rather than having this separate code path.
} | ||
} | ||
|
||
const uint num_k_tiles = (p.K + BK - 1) / BK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be hard to add a case or two with odd K. I suggest having relatively small M,N to avoid the error being hidden.
I've made a PR to fix the failing dequant shaders. |
At this point I think it's time to start looking into the Arm dev tools and assembly dumps, if they even have them 😉 |
I’ve tried reaching out to some ARM devs, so hopefully they’ll take a look into it Speaking of compilers, the Adreno compilers crash when running the existing mul_mat shaders, although they work fine with this one, so I think we’ll need this variant anyway (performances are better than opencl but I suspect a bug so I won't jump too quickly) Sorry I don’t have much time to address all the concerns right now. I’ll try when I have time at least to implement an env var to run the shaders on the dGPU EDIT: Just got confirmation from an arm dev, they don’t have a public disassembler, only a profiler |
This PR adds a mat_mul variant designed for embedded gpus, as the current shaders perform poorly. Essentially I reused the approach from my opencl implementation #14535
It has currently been tested only on mali gpu, but I believe it should suits well on others as well
ggml_vulkan: 0 = Mali-G715 (Mali-G715) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: KHR_coopmat
Master:
PR:
I started messing with coopmat, but didn’t have enough time to include it in this PR. If we go this route for embedded gpus, I plan to add variants for conv2d, mul_vec, and maybe fa in the future