Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Sep 10, 2025

ref 4f81b33#commitcomment-165464897

I'm not able to measure any significant impact on the overall performance from bumping this value, so I guess it's OK to do so. Probably we should refactor the implementation to support dynamically configure LLAMA_MAX_SEQ to remove any potential concerns about performance.

Here is a basic command to test the parallel text generation performance for various batch sizes:

llama-batched-bench -hf ggml-org/gemma-3-4b-it-GGUF:Q8_0 -c 150000 -npp 512 -ntg 64 -npl 1,2,4,8,16,32,64,80,96,112,128,144,160,186,192,208,224,240,256

Sample output on M2 Ultra:

main: n_kv_max = 196608, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |     64 |    1 |    576 |    0.197 |  2594.53 |    0.654 |    97.86 |    0.851 |   676.58 |
|   512 |     64 |    2 |   1152 |    0.387 |  2648.99 |    0.712 |   179.89 |    1.098 |  1049.06 |
|   512 |     64 |    4 |   2304 |    0.770 |  2659.82 |    0.951 |   269.05 |    1.721 |  1338.40 |
|   512 |     64 |    8 |   4608 |    1.542 |  2656.28 |    1.296 |   395.01 |    2.838 |  1623.59 |
|   512 |     64 |   16 |   9216 |    3.078 |  2661.10 |    3.221 |   317.89 |    6.300 |  1462.93 |
|   512 |     64 |   32 |  18432 |    6.175 |  2653.36 |    3.422 |   598.44 |    9.597 |  1920.60 |
|   512 |     64 |   64 |  36864 |   12.318 |  2660.25 |    4.063 |  1008.05 |   16.381 |  2250.42 |
|   512 |     64 |   80 |  46080 |   15.393 |  2661.00 |    5.032 |  1017.58 |   20.424 |  2256.14 |
|   512 |     64 |   96 |  55296 |   18.473 |  2660.69 |    5.210 |  1179.32 |   23.683 |  2334.83 |
|   512 |     64 |  112 |  64512 |   21.548 |  2661.17 |    6.157 |  1164.16 |   27.706 |  2328.48 |
|   512 |     64 |  128 |  73728 |   24.635 |  2660.23 |    6.365 |  1287.09 |   31.000 |  2378.31 |
|   512 |     64 |  144 |  82944 |   27.716 |  2660.13 |    7.384 |  1248.14 |   35.100 |  2363.09 |
|   512 |     64 |  160 |  92160 |   30.802 |  2659.55 |    7.688 |  1331.86 |   38.491 |  2394.35 |
|   512 |     64 |  186 | 107136 |   38.703 |  2460.56 |    8.963 |  1328.18 |   47.666 |  2247.64 |
|   512 |     64 |  192 | 110592 |   39.155 |  2510.66 |    9.029 |  1360.96 |   48.184 |  2295.22 |
|   512 |     64 |  208 | 119808 |   40.047 |  2659.27 |    9.854 |  1350.91 |   49.901 |  2400.91 |
|   512 |     64 |  224 | 129024 |   43.099 |  2661.06 |   10.151 |  1412.32 |   53.249 |  2423.02 |
|   512 |     64 |  240 | 138240 |   46.167 |  2661.63 |   11.240 |  1366.60 |   57.407 |  2408.08 |
|   512 |     64 |  256 | 147456 |   49.251 |  2661.33 |   11.502 |  1424.50 |   60.752 |  2427.17 |

ggerganov referenced this pull request Sep 10, 2025
* llama : validate seq id batch input

ggml-ci

* cont : fix the fix

ggml-ci
Copy link
Contributor

@matiaslin matiaslin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @ggerganov for the quick reply and proposal.

Using the provided llama-batched-bench command on a NVIDIA L40S GPU, we observe similar results from your sample output:

main: n_kv_max = 196608, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, n_gpu_layers = 100, n_threads = 2, n_threads_batch = 2

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |     64 |    1 |    576 |    0.240 |  2134.38 |    0.567 |   112.85 |    0.807 |   713.75 |
|   512 |     64 |    2 |   1152 |    0.059 | 17264.34 |    0.628 |   203.83 |    0.687 |  1676.18 |
|   512 |     64 |    4 |   2304 |    0.115 | 17803.28 |    0.616 |   415.66 |    0.731 |  3152.19 |
|   512 |     64 |    8 |   4608 |    0.229 | 17864.38 |    0.722 |   708.96 |    0.951 |  4843.06 |
|   512 |     64 |   16 |   9216 |    0.459 | 17844.70 |    1.004 |  1019.98 |    1.463 |  6299.33 |
|   512 |     64 |   32 |  18432 |    0.906 | 18090.51 |    1.354 |  1512.27 |    2.260 |  8156.01 |
|   512 |     64 |   64 |  36864 |    1.826 | 17949.42 |    2.090 |  1959.79 |    3.916 |  9414.68 |
|   512 |     64 |   80 |  46080 |    2.269 | 18052.32 |    2.454 |  2086.00 |    4.723 |  9755.65 |
|   512 |     64 |   96 |  55296 |    2.736 | 17966.10 |    2.817 |  2181.17 |    5.553 |  9958.49 |
|   512 |     64 |  112 |  64512 |    3.204 | 17898.85 |    3.228 |  2220.42 |    6.432 | 10029.85 |
|   512 |     64 |  128 |  73728 |    3.671 | 17851.78 |    3.550 |  2307.66 |    7.221 | 10210.17 |
|   512 |     64 |  144 |  82944 |    4.140 | 17806.97 |    3.867 |  2383.19 |    8.007 | 10358.31 |
|   512 |     64 |  160 |  92160 |    4.604 | 17794.63 |    4.178 |  2450.88 |    8.782 | 10494.53 |
|   512 |     64 |  186 | 107136 |    5.369 | 17738.38 |    4.751 |  2505.67 |   10.120 | 10587.06 |
|   512 |     64 |  192 | 110592 |    5.552 | 17707.19 |    4.877 |  2519.68 |   10.428 | 10604.84 |
|   512 |     64 |  208 | 119808 |    6.055 | 17587.03 |    5.272 |  2525.00 |   11.327 | 10576.78 |
|   512 |     64 |  224 | 129024 |    6.534 | 17552.09 |    5.600 |  2559.89 |   12.134 | 10632.93 |
|   512 |     64 |  240 | 138240 |    7.020 | 17505.06 |    5.956 |  2579.03 |   12.975 | 10653.99 |
|   512 |     64 |  256 | 147456 |    7.506 | 17463.02 |    6.267 |  2614.53 |   13.772 | 10706.78 |

I believe this is a good step forward, and I agree that the best solution would still be to support dynamic configuration of LLAMA_MAX_SEQ to maximize flexibility for any use case.

I wasn't able to find any GH issue/discussion tracking the progress of this enhancement, I'm happy to start one, if that's ok with you?

@WilliamTambellini
Copy link
Contributor

Tks @ggerganov
We confirm that for some LLMs, todays nvidia GPUs (L40, H20, H100, ...) and our type of prompts, will be able to go beyond 256 with further speedup.
Would you consider/review later another PR to convert LLAMA_MAX_SEQ to a cmake option (ie a 'define')?
Best
William
(PS: Matias works for my team)

@ggerganov
Copy link
Member Author

A CMake option is not desired. We should directly try to make this parameter dynamic based on the context input parameters, instead of being hardcoded constant.

@JohannesGaessler
Copy link
Collaborator

We confirm that for some LLMs, todays nvidia GPUs (L40, H20, H100, ...) and our type of prompts, will be able to go beyond 256 with further speedup.

Is that for MoE models running at FP16/BF16/FP32 precision? If yes, the issue is that the CUDA backend lacks a GEMM kernel that can directly handle MoE for those datatypes (for >16 slots) so there is some extra overhead per eval that amortizes with larger batch sizes. For quantized datatypes such a kernel exists, so using e.g. q8_0 precision should be faster for MoE.

Generally speaking, unless the context for each request is very short the runtime should be dominated by the attention rather than the weights so I don't think there should be much speedup beyond 256 slots. As it is, that many slots are also just impractical with llama.cpp because you have to split the context between slots.

@ggerganov ggerganov merged commit e58174c into master Sep 18, 2025
54 of 55 checks passed
@ggerganov ggerganov deleted the gg/llama-bump-max-seq branch September 18, 2025 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants