-
Notifications
You must be signed in to change notification settings - Fork 13.6k
kv-cache : pad the cache size to 256 for performance #17046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tools/llama-bench/llama-bench.cpp
Outdated
| llama_context_params cparams = llama_context_default_params(); | ||
|
|
||
| cparams.n_ctx = n_prompt + n_gen + n_depth; | ||
| cparams.n_ctx = GGML_PAD(n_prompt + n_gen + n_depth, 256); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is good, applications shouldn't be responsible for this padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the padding to llama_context constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a significant performance advantage from doing that then yes, that seems like a good thing to do. Otherwise, every single application would need to pad n_ctx manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, the padding in llama-bench here is important because we do the TG test at full context. So we hit the bounds of the unpadded cache buffer when generating tokens (given that the resulting cparams.n_ctx number is not multiple of 256):
llama.cpp/src/llama-kv-cache.cpp
Lines 975 to 989 in 2fd7b77
| uint32_t llama_kv_cache::get_n_kv(const slot_info & sinfo) const { | |
| uint32_t result = 0; | |
| // pad the n_kv value so that the graph remains constant across batches and can be reused | |
| // note: this also helps some backends with performance (f.ex https://github.com/ggml-org/llama.cpp/pull/16812#issuecomment-3455112220) | |
| const uint32_t n_pad_cur = std::max(n_pad, 256u); | |
| for (uint32_t s = 0; s < sinfo.n_stream(); ++s) { | |
| const auto & cells = v_cells[sinfo.strm[s]]; | |
| result = std::max(std::min(cells.size(), std::max(n_pad_cur, GGML_PAD(cells.used_max_p1(), n_pad_cur))), result); | |
| } | |
| return result; | |
| } |
Normally, applications will rarely experience this problem, because it shows up only when you are close to the full context.
I think it's ok to be in the constructor. The drawback is that you could have your requested n_ctx mutated which is not ideal. But probably a less of a problem than the current one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The drawback is that you could have your requested
n_ctxmutated which is not ideal.
As long as the value is never below what the application requested, it should be fine. The application can choose to use llama_n_ctx to take advantage of every slot, but they don't have to. I think this was already the situation before the padding was removed.
2fd7b77 to
d2c30c6
Compare
src/llama-kv-cache-iswa.cpp
Outdated
| // note: the SWA cache is always padded to 256 for performance | ||
| // https://github.com/ggml-org/llama.cpp/issues/17037 | ||
| uint32_t size_swa = std::min(size_base, GGML_PAD(hparams.n_swa*(unified ? n_seq_max : 1) + n_ubatch, 256)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, I think this relies on kv_size being padded to 256. The GGML_PAD could be moved outside the std::min to ensure that it is still padded if this changes in the future.
5108c7a to
7b2551a
Compare
fix #17037 #17058
cont #16812
The small SWA caches can be padded to 256 without concerns about memory usage.Pad the cache size to 256. This is friendly for the CUDA backend since the FA implementation benefits from round sizes of the K/V tensors. Can also help other backends.This is essentially a partial revert of #16812.
Comparing with before the regression in #16812:
GGML_CUDA=ON CUDA_VISIBLE_DEVICES=0 ./scripts/compare-commits.sh a8ca18b4b d2c30c61a llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -m /home/ggerganov/.cache/llama.cpp/ggml-org_Qwen2.5-Coder-3B-Q8_0-GGUF_qwen2.5-coder-3b-q8_0.gguf -ngl 99 -d 4096,8192,16384,32768 -ub 512,4096 -b 4096 -fa 1 -n 32 -mmp 0ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes