-
Notifications
You must be signed in to change notification settings - Fork 12.4k
cuda : support Falcon-H1 state size for SSM_SCAN #14602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive speedup, thank you @compilade !!
@@ -215,10 +215,21 @@ static void ssm_scan_f32_cuda(const float * src0, const float * src1, const floa | |||
src0, src1, src2, src3, src4, src5, src6, dst, | |||
src0_nb2, src0_nb3, src1_nb2, src1_nb3, src2_nb1, src2_nb2, src3_nb1, | |||
src4_nb2, src4_nb3, src5_nb2, src5_nb3, s_off, n_head, head_dim, n_group, n_tok); | |||
} else if (d_state == 256) { // Falcon-H1 | |||
const int threads = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For learning purpose, the difference between the two calls is the number of threads used to call the cuda kernel - is there any implementational difference in the kernel itself for the different values of d_state ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@younesbelkada Basically, the kernel currently assumes the number of threads in a block is the same as d_state
.
It could also have been handled by restructuring the kernel to make each thread handle more than one intermediate state in the reduction (in the dot product with C
), which might or might not be faster.
Each thread technically already handles multiple intermediate states by reducing over multiple head elements at once (i.e. splitH
). This also allows calling expf
less often per head.
I didn't particularly optimize the kernel, so there's most likely room for improvement.
(It could potentially be faster to use the semi-structured matrices implementation of Mamba-2 for better prompt processing speed, but from my (maybe wrong) understanding, that only allows starting from a blank state.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for explaining this is very clear @compilade !
Massive thanks, @compilade! I tried these new changes and the difference is huge. I wanted to ask you something since I’m interested in debugging. Yesterday, before this PR, I was testing the inference and I always monitor my system with htop (for CPU) and nvtop (for GPU). I noticed my CPU was heavily loaded. My question is: is there a way to debug at the CUDA kernel level in GGML, or any tricks for deeper inspection? Have you used tools like NVIDIA Nsight Compute for this? If you have some time, could you share how you usually debug? Any tips or tricks you have would be amazing! |
Sorry to bother you again, @compilade. PS: Sorry to address this issue here |
@ibrahimkhadraoui I do want to learn to use NVIDIA Nsight Compute eventually.
The first step is always to locate the source of the problem. My method isn't the best, it's mostly about thinking through the problem. Especially with CUDA, since I don't have persistent access to a NVIDIA GPU (yet). To minimize the time I rent a GPU instance, my first draft is based on how I think it would work. I sometimes draw diagrams on paper if it helps. Then I test what I've written and iterate on that. In this case, I had written the Mamba-2 SSM_SCAN kernel in #9126 relatively recently, and so the assumptions of the kernel are still mostly clear to me. When I saw that Falcon-H1 used a different state size, I was a bit surprised (I only noticed it the other day), but I knew this change here would need to happen. All Mamba-1 models and derivatives use a state size of 16, so I was assuming Mamba-2 would also be pretty much always used with a state size of 128 (like the original Mamba-2 models), but apparently I was wrong. When trying to figure out the reason for crashes, I rely on For CPU code, I like to use $ perf record --call-graph=fp -- ./bin/llama-bench -m /path/to/model.gguf
$ perf report -M intel
I never personally tried to generate such graphs, but if I search for "plot the" in $ rg -F -A3 'plot the'
src/llama-context.cpp
1045: // plot the computation graph in dot format (for debugging purposes)
1046- //if (n_past%100 == 0) {
1047- // ggml_graph_dump_dot(gf, NULL, "llama.dot");
1048- //} It's not in the correct place, though. This should work (on 4a5686d, at least): diff --git a/src/llama-context.cpp b/src/llama-context.cpp
index 06e93b19c..964f255b3 100644
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -7,6 +7,7 @@
#include "llama-mmap.h"
#include "llama-model.h"
+#include <algorithm>
#include <cinttypes>
#include <cstring>
#include <limits>
@@ -709,6 +710,11 @@ llm_graph_result_ptr llama_context::process_ubatch(const llama_ubatch & ubatch,
res->set_inputs(&ubatch);
+ // plot the computation graph in dot format (for debugging purposes)
+ if (std::find(ubatch.pos, ubatch.pos + ubatch.n_tokens, 100) != ubatch.pos + ubatch.n_tokens) {
+ ggml_graph_dump_dot(gf, NULL, "llama.dot");
+ }
+
const auto status = graph_compute(gf, ubatch.n_tokens > 1);
if (status != GGML_STATUS_SUCCESS) {
LLAMA_LOG_ERROR("%s: failed to compute graph, compute status: %d\n", __func__, status);
diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index c21cc2880..397416c59 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -5215,7 +5215,7 @@ struct llm_build_llama : public llm_graph_context {
ggml_tensor * inp_out_ids = build_inp_out_ids();
- for (int il = 0; il < n_layer; ++il) {
+ for (int il = n_layer - 1; il < n_layer; ++il) {
ggml_tensor * inpSA = inpL;
// norm Put the layer skip in the graph of the model type you want to generate a graph of (the above patch assumes the |
* origin/master: cmake : do not search for curl libraries by ourselves (ggml-org#14613) SYCL: Initial set_rows kernel implementation (ggml-org#14562) llama : minor coding style fix for smollm3 (ggml-org#14605) cmake : bump llguidance version to v1.0.1 (ggml-org#14609) cmake : llguidance build parser library only (ggml-org#14608) cuda : support Falcon-H1 state size for SSM_SCAN (ggml-org#14602) Signed-off-by: Gabe Goodhart <[email protected]>
Falcon-H1 (see #14534) has Mamba-2 layers, but uses a bigger state size than the original Mamba-2 models (256 instead of 128).
The CUDA implementation of
SSM_SCAN
is specific to the state size, and so the bigger state size needs to be explicitly supported.I've tested this with https://huggingface.co/tiiuae/Falcon-H1-7B-Instruct-GGUF.
Before this PR:
cc @younesbelkada @ibrahimkhadraoui
--
Make sure to read the contributing guidelines before submitting a PR