-
Notifications
You must be signed in to change notification settings - Fork 13.1k
llama : bump max seq limit from 64 to 256 #15916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* llama : validate seq id batch input ggml-ci * cont : fix the fix ggml-ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you @ggerganov for the quick reply and proposal.
Using the provided llama-batched-bench
command on a NVIDIA L40S
GPU, we observe similar results from your sample output:
main: n_kv_max = 196608, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, n_gpu_layers = 100, n_threads = 2, n_threads_batch = 2
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 512 | 64 | 1 | 576 | 0.240 | 2134.38 | 0.567 | 112.85 | 0.807 | 713.75 |
| 512 | 64 | 2 | 1152 | 0.059 | 17264.34 | 0.628 | 203.83 | 0.687 | 1676.18 |
| 512 | 64 | 4 | 2304 | 0.115 | 17803.28 | 0.616 | 415.66 | 0.731 | 3152.19 |
| 512 | 64 | 8 | 4608 | 0.229 | 17864.38 | 0.722 | 708.96 | 0.951 | 4843.06 |
| 512 | 64 | 16 | 9216 | 0.459 | 17844.70 | 1.004 | 1019.98 | 1.463 | 6299.33 |
| 512 | 64 | 32 | 18432 | 0.906 | 18090.51 | 1.354 | 1512.27 | 2.260 | 8156.01 |
| 512 | 64 | 64 | 36864 | 1.826 | 17949.42 | 2.090 | 1959.79 | 3.916 | 9414.68 |
| 512 | 64 | 80 | 46080 | 2.269 | 18052.32 | 2.454 | 2086.00 | 4.723 | 9755.65 |
| 512 | 64 | 96 | 55296 | 2.736 | 17966.10 | 2.817 | 2181.17 | 5.553 | 9958.49 |
| 512 | 64 | 112 | 64512 | 3.204 | 17898.85 | 3.228 | 2220.42 | 6.432 | 10029.85 |
| 512 | 64 | 128 | 73728 | 3.671 | 17851.78 | 3.550 | 2307.66 | 7.221 | 10210.17 |
| 512 | 64 | 144 | 82944 | 4.140 | 17806.97 | 3.867 | 2383.19 | 8.007 | 10358.31 |
| 512 | 64 | 160 | 92160 | 4.604 | 17794.63 | 4.178 | 2450.88 | 8.782 | 10494.53 |
| 512 | 64 | 186 | 107136 | 5.369 | 17738.38 | 4.751 | 2505.67 | 10.120 | 10587.06 |
| 512 | 64 | 192 | 110592 | 5.552 | 17707.19 | 4.877 | 2519.68 | 10.428 | 10604.84 |
| 512 | 64 | 208 | 119808 | 6.055 | 17587.03 | 5.272 | 2525.00 | 11.327 | 10576.78 |
| 512 | 64 | 224 | 129024 | 6.534 | 17552.09 | 5.600 | 2559.89 | 12.134 | 10632.93 |
| 512 | 64 | 240 | 138240 | 7.020 | 17505.06 | 5.956 | 2579.03 | 12.975 | 10653.99 |
| 512 | 64 | 256 | 147456 | 7.506 | 17463.02 | 6.267 | 2614.53 | 13.772 | 10706.78 |
I believe this is a good step forward, and I agree that the best solution would still be to support dynamic configuration of LLAMA_MAX_SEQ
to maximize flexibility for any use case.
I wasn't able to find any GH issue/discussion tracking the progress of this enhancement, I'm happy to start one, if that's ok with you?
Tks @ggerganov |
A CMake option is not desired. We should directly try to make this parameter dynamic based on the context input parameters, instead of being hardcoded constant. |
Is that for MoE models running at FP16/BF16/FP32 precision? If yes, the issue is that the CUDA backend lacks a GEMM kernel that can directly handle MoE for those datatypes (for >16 slots) so there is some extra overhead per eval that amortizes with larger batch sizes. For quantized datatypes such a kernel exists, so using e.g. q8_0 precision should be faster for MoE. Generally speaking, unless the context for each request is very short the runtime should be dominated by the attention rather than the weights so I don't think there should be much speedup beyond 256 slots. As it is, that many slots are also just impractical with llama.cpp because you have to split the context between slots. |
ref 4f81b33#commitcomment-165464897
I'm not able to measure any significant impact on the overall performance from bumping this value, so I guess it's OK to do so. Probably we should refactor the implementation to support dynamically configure
LLAMA_MAX_SEQ
to remove any potential concerns about performance.Here is a basic command to test the parallel text generation performance for various batch sizes:
Sample output on M2 Ultra: