Skip to content

cuda : support Falcon-H1 state size for SSM_SCAN #14602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 10, 2025

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Jul 9, 2025

Falcon-H1 (see #14534) has Mamba-2 layers, but uses a bigger state size than the original Mamba-2 models (256 instead of 128).

The CUDA implementation of SSM_SCAN is specific to the state size, and so the bigger state size needs to be explicitly supported.

I've tested this with https://huggingface.co/tiiuae/Falcon-H1-7B-Instruct-GGUF.

$ ./bin/llama-bench -m ~/Falcon-H1-7B-Instruct-Q4_0.gguf -t 8
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| falcon-h1 7B Q4_0              |   4.07 GiB |     7.59 B | CUDA       |  99 |       8 |           pp512 |       2384.28 _ 7.20 |
| falcon-h1 7B Q4_0              |   4.07 GiB |     7.59 B | CUDA       |  99 |       8 |           tg128 |         70.36 _ 0.21 |

build: 1180752

Before this PR:

$ ./bin/llama-bench -m ~/Falcon-H1-7B-Instruct-Q4_0.gguf -t 8
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no                
ggml_cuda_init: found 1 CUDA devices:                                         
  Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes 
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| falcon-h1 7B Q4_0              |   4.07 GiB |     7.59 B | CUDA       |  99 |       8 |           pp512 |        534.36 _ 8.63 |
| falcon-h1 7B Q4_0              |   4.07 GiB |     7.59 B | CUDA       |  99 |       8 |           tg128 |         31.05 _ 0.30 |
                                                                                                                                                            
build: 26a48ad

cc @younesbelkada @ibrahimkhadraoui

--

Make sure to read the contributing guidelines before submitting a PR

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 9, 2025
Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive speedup, thank you @compilade !!

@@ -215,10 +215,21 @@ static void ssm_scan_f32_cuda(const float * src0, const float * src1, const floa
src0, src1, src2, src3, src4, src5, src6, dst,
src0_nb2, src0_nb3, src1_nb2, src1_nb3, src2_nb1, src2_nb2, src3_nb1,
src4_nb2, src4_nb3, src5_nb2, src5_nb3, s_off, n_head, head_dim, n_group, n_tok);
} else if (d_state == 256) { // Falcon-H1
const int threads = 256;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For learning purpose, the difference between the two calls is the number of threads used to call the cuda kernel - is there any implementational difference in the kernel itself for the different values of d_state ?

Copy link
Collaborator Author

@compilade compilade Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@younesbelkada Basically, the kernel currently assumes the number of threads in a block is the same as d_state.

It could also have been handled by restructuring the kernel to make each thread handle more than one intermediate state in the reduction (in the dot product with C), which might or might not be faster.

Each thread technically already handles multiple intermediate states by reducing over multiple head elements at once (i.e. splitH). This also allows calling expf less often per head.

I didn't particularly optimize the kernel, so there's most likely room for improvement.

(It could potentially be faster to use the semi-structured matrices implementation of Mamba-2 for better prompt processing speed, but from my (maybe wrong) understanding, that only allows starting from a blank state.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for explaining this is very clear @compilade !

@ibrahimkhadraoui
Copy link
Contributor

Massive thanks, @compilade! I tried these new changes and the difference is huge. I wanted to ask you something since I’m interested in debugging. Yesterday, before this PR, I was testing the inference and I always monitor my system with htop (for CPU) and nvtop (for GPU). I noticed my CPU was heavily loaded.

My question is: is there a way to debug at the CUDA kernel level in GGML, or any tricks for deeper inspection? Have you used tools like NVIDIA Nsight Compute for this? If you have some time, could you share how you usually debug? Any tips or tricks you have would be amazing!

@ibrahimkhadraoui
Copy link
Contributor

Sorry to bother you again, @compilade.
I have a quick question about this amazing feature: https://github.com/ggml-org/llama.cpp/wiki/GGML-Tips-%26-Tricks
It was extremely helpful when I was integrating FalconH1 using an older commit of llama.cpp (from November 2024). However, after the recent refactoring, it seems like it needs to be called somewhere else. If you could give me a hand figuring out where to fix it, I’d really appreciate it! 🙏

PS: Sorry to address this issue here

@compilade
Copy link
Collaborator Author

compilade commented Jul 10, 2025

Have you used tools like NVIDIA Nsight Compute for this?

@ibrahimkhadraoui
I did not. Which is why I wrote in #14602 (comment) that it could likely be optimized further.

I do want to learn to use NVIDIA Nsight Compute eventually.

could you share how you usually debug?

The first step is always to locate the source of the problem.

My method isn't the best, it's mostly about thinking through the problem. Especially with CUDA, since I don't have persistent access to a NVIDIA GPU (yet). To minimize the time I rent a GPU instance, my first draft is based on how I think it would work. I sometimes draw diagrams on paper if it helps. Then I test what I've written and iterate on that.

In this case, I had written the Mamba-2 SSM_SCAN kernel in #9126 relatively recently, and so the assumptions of the kernel are still mostly clear to me. When I saw that Falcon-H1 used a different state size, I was a bit surprised (I only noticed it the other day), but I knew this change here would need to happen. All Mamba-1 models and derivatives use a state size of 16, so I was assuming Mamba-2 would also be pretty much always used with a state size of 128 (like the original Mamba-2 models), but apparently I was wrong.


When trying to figure out the reason for crashes, I rely on coredumpctl debug --debugger=lldb a lot (with systemd-coredump). I also usually compile with -DCMAKE_BUILD_TYPE=RelWithDebInfo.

For CPU code, I like to use perf. It's a sampling profiler, and can work at the instruction level. In this case, it could likely tell you that a good portion of the CPU time was spent on ggml_compute_forward_ssm_scan_f32, which would indicate it was not running on the GPU.

$ perf record --call-graph=fp -- ./bin/llama-bench -m /path/to/model.gguf
$ perf report -M intel

I have a quick question about this amazing feature: https://github.com/ggml-org/llama.cpp/wiki/GGML-Tips-%26-Tricks

I never personally tried to generate such graphs, but if I search for "plot the" in llama.cpp I find the section referred by the wiki page. It looks like it's in src/llama-context.cpp.

$ rg -F -A3 'plot the'
src/llama-context.cpp
1045:        // plot the computation graph in dot format (for debugging purposes)
1046-        //if (n_past%100 == 0) {
1047-        //    ggml_graph_dump_dot(gf, NULL, "llama.dot");
1048-        //}

It's not in the correct place, though.

This should work (on 4a5686d, at least):

diff --git a/src/llama-context.cpp b/src/llama-context.cpp
index 06e93b19c..964f255b3 100644
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -7,6 +7,7 @@
 #include "llama-mmap.h"
 #include "llama-model.h"
 
+#include <algorithm>
 #include <cinttypes>
 #include <cstring>
 #include <limits>
@@ -709,6 +710,11 @@ llm_graph_result_ptr llama_context::process_ubatch(const llama_ubatch & ubatch,
 
     res->set_inputs(&ubatch);
 
+    // plot the computation graph in dot format (for debugging purposes)
+    if (std::find(ubatch.pos, ubatch.pos + ubatch.n_tokens, 100) != ubatch.pos + ubatch.n_tokens) {
+        ggml_graph_dump_dot(gf, NULL, "llama.dot");
+    }
+
     const auto status = graph_compute(gf, ubatch.n_tokens > 1);
     if (status != GGML_STATUS_SUCCESS) {
         LLAMA_LOG_ERROR("%s: failed to compute graph, compute status: %d\n", __func__, status);
diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index c21cc2880..397416c59 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -5215,7 +5215,7 @@ struct llm_build_llama : public llm_graph_context {
 
         ggml_tensor * inp_out_ids = build_inp_out_ids();
 
-        for (int il = 0; il < n_layer; ++il) {
+        for (int il = n_layer - 1; il < n_layer; ++il) {
             ggml_tensor * inpSA = inpL;
 
             // norm

Put the layer skip in the graph of the model type you want to generate a graph of (the above patch assumes the llama arch is used). Then generating 100 tokens (you can change this number) in any manner should result in a llama.dot file.
There should be a log entry suggesting to run dot -Tpng llama.dot -o llama.dot.png to generate a PNG of the graph (assuming graphviz is installed).

@compilade compilade merged commit a57d1bc into master Jul 10, 2025
48 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jul 10, 2025
* origin/master:
cmake : do not search for curl libraries by ourselves (ggml-org#14613)
SYCL: Initial set_rows kernel implementation (ggml-org#14562)
llama : minor coding style fix for smollm3 (ggml-org#14605)
cmake : bump llguidance version to v1.0.1 (ggml-org#14609)
cmake : llguidance build parser library only (ggml-org#14608)
cuda : support Falcon-H1 state size for SSM_SCAN (ggml-org#14602)

Signed-off-by: Gabe Goodhart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants