cuda : support Falcon-H1 state size for SSM_SCAN #14602

compilade · 2025-07-09T17:44:34Z

Falcon-H1 (see #14534) has Mamba-2 layers, but uses a bigger state size than the original Mamba-2 models (256 instead of 128).

The CUDA implementation of SSM_SCAN is specific to the state size, and so the bigger state size needs to be explicitly supported.

I've tested this with https://huggingface.co/tiiuae/Falcon-H1-7B-Instruct-GGUF.

$ ./bin/llama-bench -m ~/Falcon-H1-7B-Instruct-Q4_0.gguf -t 8
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| falcon-h1 7B Q4_0              |   4.07 GiB |     7.59 B | CUDA       |  99 |       8 |           pp512 |       2384.28 _ 7.20 |
| falcon-h1 7B Q4_0              |   4.07 GiB |     7.59 B | CUDA       |  99 |       8 |           tg128 |         70.36 _ 0.21 |

build: 1180752

Before this PR:

$ ./bin/llama-bench -m ~/Falcon-H1-7B-Instruct-Q4_0.gguf -t 8
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no                
ggml_cuda_init: found 1 CUDA devices:                                         
  Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes 
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| falcon-h1 7B Q4_0              |   4.07 GiB |     7.59 B | CUDA       |  99 |       8 |           pp512 |        534.36 _ 8.63 |
| falcon-h1 7B Q4_0              |   4.07 GiB |     7.59 B | CUDA       |  99 |       8 |           tg128 |         31.05 _ 0.30 |
                                                                                                                                                            
build: 26a48ad

cc @younesbelkada @ibrahimkhadraoui

--

Make sure to read the contributing guidelines before submitting a PR

younesbelkada

Impressive speedup, thank you @compilade !!

younesbelkada · 2025-07-09T20:23:13Z

ggml/src/ggml-cuda/ssm-scan.cu

@@ -215,10 +215,21 @@ static void ssm_scan_f32_cuda(const float * src0, const float * src1, const floa
                    src0, src1, src2, src3, src4, src5, src6, dst,
                    src0_nb2, src0_nb3, src1_nb2, src1_nb3, src2_nb1, src2_nb2, src3_nb1,
                    src4_nb2, src4_nb3, src5_nb2, src5_nb3, s_off, n_head, head_dim, n_group, n_tok);
+        } else if (d_state == 256) { // Falcon-H1
+            const int threads = 256;


For learning purpose, the difference between the two calls is the number of threads used to call the cuda kernel - is there any implementational difference in the kernel itself for the different values of d_state ?

@younesbelkada Basically, the kernel currently assumes the number of threads in a block is the same as d_state.

It could also have been handled by restructuring the kernel to make each thread handle more than one intermediate state in the reduction (in the dot product with C), which might or might not be faster.

Each thread technically already handles multiple intermediate states by reducing over multiple head elements at once (i.e. splitH). This also allows calling expf less often per head.

I didn't particularly optimize the kernel, so there's most likely room for improvement.

(It could potentially be faster to use the semi-structured matrices implementation of Mamba-2 for better prompt processing speed, ~~but from my (maybe wrong) understanding, that only allows starting from a blank state~~.)

Thank you for explaining this is very clear @compilade !

ibrahimkhadraoui · 2025-07-09T21:36:43Z

Massive thanks, @compilade! I tried these new changes and the difference is huge. I wanted to ask you something since I’m interested in debugging. Yesterday, before this PR, I was testing the inference and I always monitor my system with htop (for CPU) and nvtop (for GPU). I noticed my CPU was heavily loaded.

My question is: is there a way to debug at the CUDA kernel level in GGML, or any tricks for deeper inspection? Have you used tools like NVIDIA Nsight Compute for this? If you have some time, could you share how you usually debug? Any tips or tricks you have would be amazing!

ibrahimkhadraoui · 2025-07-09T21:53:41Z

Sorry to bother you again, @compilade.
I have a quick question about this amazing feature: https://github.com/ggml-org/llama.cpp/wiki/GGML-Tips-%26-Tricks
It was extremely helpful when I was integrating FalconH1 using an older commit of llama.cpp (from November 2024). However, after the recent refactoring, it seems like it needs to be called somewhere else. If you could give me a hand figuring out where to fix it, I’d really appreciate it! 🙏

PS: Sorry to address this issue here

compilade · 2025-07-10T03:13:49Z

Have you used tools like NVIDIA Nsight Compute for this?

@ibrahimkhadraoui
I did not. Which is why I wrote in #14602 (comment) that it could likely be optimized further.

I do want to learn to use NVIDIA Nsight Compute eventually.

could you share how you usually debug?

The first step is always to locate the source of the problem.

My method isn't the best, it's mostly about thinking through the problem. Especially with CUDA, since I don't have persistent access to a NVIDIA GPU (yet). To minimize the time I rent a GPU instance, my first draft is based on how I think it would work. I sometimes draw diagrams on paper if it helps. Then I test what I've written and iterate on that.

In this case, I had written the Mamba-2 SSM_SCAN kernel in #9126 relatively recently, and so the assumptions of the kernel are still mostly clear to me. When I saw that Falcon-H1 used a different state size, I was a bit surprised (I only noticed it the other day), but I knew this change here would need to happen. All Mamba-1 models and derivatives use a state size of 16, so I was assuming Mamba-2 would also be pretty much always used with a state size of 128 (like the original Mamba-2 models), but apparently I was wrong.

When trying to figure out the reason for crashes, I rely on coredumpctl debug --debugger=lldb a lot (with systemd-coredump). I also usually compile with -DCMAKE_BUILD_TYPE=RelWithDebInfo.

For CPU code, I like to use perf. It's a sampling profiler, and can work at the instruction level. In this case, it could likely tell you that a good portion of the CPU time was spent on ggml_compute_forward_ssm_scan_f32, which would indicate it was not running on the GPU.

$ perf record --call-graph=fp -- ./bin/llama-bench -m /path/to/model.gguf
$ perf report -M intel

I have a quick question about this amazing feature: https://github.com/ggml-org/llama.cpp/wiki/GGML-Tips-%26-Tricks

I never personally tried to generate such graphs, but if I search for "plot the" in llama.cpp I find the section referred by the wiki page. It looks like it's in src/llama-context.cpp.

$ rg -F -A3 'plot the'
src/llama-context.cpp
1045:        // plot the computation graph in dot format (for debugging purposes)
1046-        //if (n_past%100 == 0) {
1047-        //    ggml_graph_dump_dot(gf, NULL, "llama.dot");
1048-        //}

It's not in the correct place, though.

This should work (on 4a5686d, at least):

diff --git a/src/llama-context.cpp b/src/llama-context.cpp
index 06e93b19c..964f255b3 100644
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -7,6 +7,7 @@
 #include "llama-mmap.h"
 #include "llama-model.h"
 
+#include <algorithm>
 #include <cinttypes>
 #include <cstring>
 #include <limits>
@@ -709,6 +710,11 @@ llm_graph_result_ptr llama_context::process_ubatch(const llama_ubatch & ubatch,
 
     res->set_inputs(&ubatch);
 
+    // plot the computation graph in dot format (for debugging purposes)
+    if (std::find(ubatch.pos, ubatch.pos + ubatch.n_tokens, 100) != ubatch.pos + ubatch.n_tokens) {
+        ggml_graph_dump_dot(gf, NULL, "llama.dot");
+    }
+
     const auto status = graph_compute(gf, ubatch.n_tokens > 1);
     if (status != GGML_STATUS_SUCCESS) {
         LLAMA_LOG_ERROR("%s: failed to compute graph, compute status: %d\n", __func__, status);
diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index c21cc2880..397416c59 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -5215,7 +5215,7 @@ struct llm_build_llama : public llm_graph_context {
 
         ggml_tensor * inp_out_ids = build_inp_out_ids();
 
-        for (int il = 0; il < n_layer; ++il) {
+        for (int il = n_layer - 1; il < n_layer; ++il) {
             ggml_tensor * inpSA = inpL;
 
             // norm

Put the layer skip in the graph of the model type you want to generate a graph of (the above patch assumes the llama arch is used). Then generating 100 tokens (you can change this number) in any manner should result in a llama.dot file.
There should be a log entry suggesting to run dot -Tpng llama.dot -o llama.dot.png to generate a PNG of the graph (assuming graphviz is installed).

* origin/master: cmake : do not search for curl libraries by ourselves (ggml-org#14613) SYCL: Initial set_rows kernel implementation (ggml-org#14562) llama : minor coding style fix for smollm3 (ggml-org#14605) cmake : bump llguidance version to v1.0.1 (ggml-org#14609) cmake : llguidance build parser library only (ggml-org#14608) cuda : support Falcon-H1 state size for SSM_SCAN (ggml-org#14602) Signed-off-by: Gabe Goodhart <[email protected]>

cuda : support Falcon-H1 state size for SSM_SCAN

1180752

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 9, 2025

CISC approved these changes Jul 9, 2025

View reviewed changes

younesbelkada approved these changes Jul 9, 2025

View reviewed changes

compilade merged commit a57d1bc into master Jul 10, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda : support Falcon-H1 state size for SSM_SCAN #14602

cuda : support Falcon-H1 state size for SSM_SCAN #14602

Uh oh!

compilade commented Jul 9, 2025 •

edited

Loading

Uh oh!

younesbelkada left a comment

Uh oh!

younesbelkada Jul 9, 2025

Uh oh!

compilade Jul 9, 2025 •

edited

Loading

Uh oh!

younesbelkada Jul 10, 2025

Uh oh!

ibrahimkhadraoui commented Jul 9, 2025

Uh oh!

ibrahimkhadraoui commented Jul 9, 2025

Uh oh!

compilade commented Jul 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cuda : support Falcon-H1 state size for SSM_SCAN #14602

cuda : support Falcon-H1 state size for SSM_SCAN #14602

Uh oh!

Conversation

compilade commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

younesbelkada Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

compilade Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

younesbelkada Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

ibrahimkhadraoui commented Jul 9, 2025

Uh oh!

ibrahimkhadraoui commented Jul 9, 2025

Uh oh!

compilade commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

compilade commented Jul 9, 2025 •

edited

Loading

compilade Jul 9, 2025 •

edited

Loading

compilade commented Jul 10, 2025 •

edited

Loading