llama : bump max seq limit from 64 to 256 #15916

ggerganov · 2025-09-10T11:17:24Z

ref 4f81b33#commitcomment-165464897

I'm not able to measure any significant impact on the overall performance from bumping this value, so I guess it's OK to do so. Probably we should refactor the implementation to support dynamically configure LLAMA_MAX_SEQ to remove any potential concerns about performance.

Here is a basic command to test the parallel text generation performance for various batch sizes:

llama-batched-bench -hf ggml-org/gemma-3-4b-it-GGUF:Q8_0 -c 150000 -npp 512 -ntg 64 -npl 1,2,4,8,16,32,64,80,96,112,128,144,160,186,192,208,224,240,256

Sample output on M2 Ultra:

main: n_kv_max = 196608, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |     64 |    1 |    576 |    0.197 |  2594.53 |    0.654 |    97.86 |    0.851 |   676.58 |
|   512 |     64 |    2 |   1152 |    0.387 |  2648.99 |    0.712 |   179.89 |    1.098 |  1049.06 |
|   512 |     64 |    4 |   2304 |    0.770 |  2659.82 |    0.951 |   269.05 |    1.721 |  1338.40 |
|   512 |     64 |    8 |   4608 |    1.542 |  2656.28 |    1.296 |   395.01 |    2.838 |  1623.59 |
|   512 |     64 |   16 |   9216 |    3.078 |  2661.10 |    3.221 |   317.89 |    6.300 |  1462.93 |
|   512 |     64 |   32 |  18432 |    6.175 |  2653.36 |    3.422 |   598.44 |    9.597 |  1920.60 |
|   512 |     64 |   64 |  36864 |   12.318 |  2660.25 |    4.063 |  1008.05 |   16.381 |  2250.42 |
|   512 |     64 |   80 |  46080 |   15.393 |  2661.00 |    5.032 |  1017.58 |   20.424 |  2256.14 |
|   512 |     64 |   96 |  55296 |   18.473 |  2660.69 |    5.210 |  1179.32 |   23.683 |  2334.83 |
|   512 |     64 |  112 |  64512 |   21.548 |  2661.17 |    6.157 |  1164.16 |   27.706 |  2328.48 |
|   512 |     64 |  128 |  73728 |   24.635 |  2660.23 |    6.365 |  1287.09 |   31.000 |  2378.31 |
|   512 |     64 |  144 |  82944 |   27.716 |  2660.13 |    7.384 |  1248.14 |   35.100 |  2363.09 |
|   512 |     64 |  160 |  92160 |   30.802 |  2659.55 |    7.688 |  1331.86 |   38.491 |  2394.35 |
|   512 |     64 |  186 | 107136 |   38.703 |  2460.56 |    8.963 |  1328.18 |   47.666 |  2247.64 |
|   512 |     64 |  192 | 110592 |   39.155 |  2510.66 |    9.029 |  1360.96 |   48.184 |  2295.22 |
|   512 |     64 |  208 | 119808 |   40.047 |  2659.27 |    9.854 |  1350.91 |   49.901 |  2400.91 |
|   512 |     64 |  224 | 129024 |   43.099 |  2661.06 |   10.151 |  1412.32 |   53.249 |  2423.02 |
|   512 |     64 |  240 | 138240 |   46.167 |  2661.63 |   11.240 |  1366.60 |   57.407 |  2408.08 |
|   512 |     64 |  256 | 147456 |   49.251 |  2661.33 |   11.502 |  1424.50 |   60.752 |  2427.17 |

ggml-ci

* llama : validate seq id batch input ggml-ci * cont : fix the fix ggml-ci

matiaslin

LGTM. Thank you @ggerganov for the quick reply and proposal.

Using the provided llama-batched-bench command on a NVIDIA L40S GPU, we observe similar results from your sample output:

main: n_kv_max = 196608, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, n_gpu_layers = 100, n_threads = 2, n_threads_batch = 2

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |     64 |    1 |    576 |    0.240 |  2134.38 |    0.567 |   112.85 |    0.807 |   713.75 |
|   512 |     64 |    2 |   1152 |    0.059 | 17264.34 |    0.628 |   203.83 |    0.687 |  1676.18 |
|   512 |     64 |    4 |   2304 |    0.115 | 17803.28 |    0.616 |   415.66 |    0.731 |  3152.19 |
|   512 |     64 |    8 |   4608 |    0.229 | 17864.38 |    0.722 |   708.96 |    0.951 |  4843.06 |
|   512 |     64 |   16 |   9216 |    0.459 | 17844.70 |    1.004 |  1019.98 |    1.463 |  6299.33 |
|   512 |     64 |   32 |  18432 |    0.906 | 18090.51 |    1.354 |  1512.27 |    2.260 |  8156.01 |
|   512 |     64 |   64 |  36864 |    1.826 | 17949.42 |    2.090 |  1959.79 |    3.916 |  9414.68 |
|   512 |     64 |   80 |  46080 |    2.269 | 18052.32 |    2.454 |  2086.00 |    4.723 |  9755.65 |
|   512 |     64 |   96 |  55296 |    2.736 | 17966.10 |    2.817 |  2181.17 |    5.553 |  9958.49 |
|   512 |     64 |  112 |  64512 |    3.204 | 17898.85 |    3.228 |  2220.42 |    6.432 | 10029.85 |
|   512 |     64 |  128 |  73728 |    3.671 | 17851.78 |    3.550 |  2307.66 |    7.221 | 10210.17 |
|   512 |     64 |  144 |  82944 |    4.140 | 17806.97 |    3.867 |  2383.19 |    8.007 | 10358.31 |
|   512 |     64 |  160 |  92160 |    4.604 | 17794.63 |    4.178 |  2450.88 |    8.782 | 10494.53 |
|   512 |     64 |  186 | 107136 |    5.369 | 17738.38 |    4.751 |  2505.67 |   10.120 | 10587.06 |
|   512 |     64 |  192 | 110592 |    5.552 | 17707.19 |    4.877 |  2519.68 |   10.428 | 10604.84 |
|   512 |     64 |  208 | 119808 |    6.055 | 17587.03 |    5.272 |  2525.00 |   11.327 | 10576.78 |
|   512 |     64 |  224 | 129024 |    6.534 | 17552.09 |    5.600 |  2559.89 |   12.134 | 10632.93 |
|   512 |     64 |  240 | 138240 |    7.020 | 17505.06 |    5.956 |  2579.03 |   12.975 | 10653.99 |
|   512 |     64 |  256 | 147456 |    7.506 | 17463.02 |    6.267 |  2614.53 |   13.772 | 10706.78 |

I believe this is a good step forward, and I agree that the best solution would still be to support dynamic configuration of LLAMA_MAX_SEQ to maximize flexibility for any use case.

I wasn't able to find any GH issue/discussion tracking the progress of this enhancement, I'm happy to start one, if that's ok with you?

WilliamTambellini · 2025-09-10T16:50:48Z

Tks @ggerganov
We confirm that for some LLMs, todays nvidia GPUs (L40, H20, H100, ...) and our type of prompts, will be able to go beyond 256 with further speedup.
Would you consider/review later another PR to convert LLAMA_MAX_SEQ to a cmake option (ie a 'define')?
Best
William
(PS: Matias works for my team)

ggerganov · 2025-09-10T19:30:29Z

A CMake option is not desired. We should directly try to make this parameter dynamic based on the context input parameters, instead of being hardcoded constant.

JohannesGaessler · 2025-09-11T08:54:14Z

We confirm that for some LLMs, todays nvidia GPUs (L40, H20, H100, ...) and our type of prompts, will be able to go beyond 256 with further speedup.

Is that for MoE models running at FP16/BF16/FP32 precision? If yes, the issue is that the CUDA backend lacks a GEMM kernel that can directly handle MoE for those datatypes (for >16 slots) so there is some extra overhead per eval that amortizes with larger batch sizes. For quantized datatypes such a kernel exists, so using e.g. q8_0 precision should be faster for MoE.

Generally speaking, unless the context for each request is very short the runtime should be dominated by the attention rather than the weights so I don't think there should be much speedup beyond 256 slots. As it is, that many slots are also just impractical with llama.cpp because you have to split the context between slots.

llama : bump max seq limit from 64 to 256

333c9ea

ggml-ci

ggerganov referenced this pull request Sep 10, 2025

llama : validate seq id batch input (#13809)

4f81b33

* llama : validate seq id batch input ggml-ci * cont : fix the fix ggml-ci

matiaslin approved these changes Sep 10, 2025

View reviewed changes

WilliamTambellini approved these changes Sep 10, 2025

View reviewed changes

ggerganov merged commit e58174c into master Sep 18, 2025
54 of 55 checks passed

ggerganov deleted the gg/llama-bump-max-seq branch September 18, 2025 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : bump max seq limit from 64 to 256 #15916

llama : bump max seq limit from 64 to 256 #15916

ggerganov commented Sep 10, 2025 •

edited

Loading

Uh oh!

matiaslin left a comment

Uh oh!

WilliamTambellini commented Sep 10, 2025

Uh oh!

ggerganov commented Sep 10, 2025

Uh oh!

JohannesGaessler commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

llama : bump max seq limit from 64 to 256 #15916

llama : bump max seq limit from 64 to 256 #15916

Conversation

ggerganov commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matiaslin left a comment

Choose a reason for hiding this comment

Uh oh!

WilliamTambellini commented Sep 10, 2025

Uh oh!

ggerganov commented Sep 10, 2025

Uh oh!

JohannesGaessler commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Sep 10, 2025 •

edited

Loading