-
Notifications
You must be signed in to change notification settings - Fork 12.2k
kv-cache : use ggml_set_rows #14285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
kv-cache : use ggml_set_rows #14285
Conversation
1e86597
to
2b940c0
Compare
I tried this PR with the following change in the RPC backend: diff --git a/ggml/src/ggml-rpc/ggml-rpc.cpp b/ggml/src/ggml-rpc/ggml-rpc.cpp
index f468f796..dcbede89 100644
--- a/ggml/src/ggml-rpc/ggml-rpc.cpp
+++ b/ggml/src/ggml-rpc/ggml-rpc.cpp
@@ -761,6 +761,8 @@ static enum ggml_status ggml_backend_rpc_graph_compute(ggml_backend_t backend, g
ggml_backend_rpc_context * rpc_ctx = (ggml_backend_rpc_context *)backend->context;
std::vector<uint8_t> input;
serialize_graph(cgraph, input);
+ auto graph_hash = fnv_hash(input.data(), input.size());
+ printf("RPC graph compute: hash = 0x%" PRIx64 ", size = %zu\n", graph_hash, input.size());
rpc_msg_graph_compute_rsp response;
auto sock = get_socket(rpc_ctx->endpoint);
bool status = send_rpc_cmd(sock, RPC_CMD_GRAPH_COMPUTE, input.data(), input.size(), &response, sizeof(response));
The compute graph doesn't change and produces the same hash with gpt2, tinyllama and mistral-7b models. However, the hash does change with gemma3 models. The serialized graph includes tensor addresses, so it's possible that we rebuild same tensors on different addresses resulting in different graph hash. But in any way this looks like a great progress! |
8f1c5e3
to
5f87f28
Compare
Yes, this is expected. I've applied the change only for the unified cache. For the unified+iswa, it is still using the |
Should work with Gemma now as well. |
4d0c0ea
to
db0cd69
Compare
The non-FA path is now also supported, though I am not 100% sure this is the best way to do it. |
I don't observe any performance regression with CPU-only build, so the implementation should be good enough I think. |
d40f705
to
d1da992
Compare
1031a5d
to
14554a8
Compare
335161d
to
e1aba6a
Compare
b5fea54
to
c4273b8
Compare
@ggerganov the following test segfaults on my machine:
|
Apply this patch: diff --git a/ggml/src/ggml-cpu/ggml-cpu.cpp b/ggml/src/ggml-cpu/ggml-cpu.cpp
index 735ef3f01..cc9b922fa 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.cpp
+++ b/ggml/src/ggml-cpu/ggml-cpu.cpp
@@ -416,6 +416,7 @@ static bool ggml_backend_cpu_device_supports_op(ggml_backend_dev_t dev, const st
switch (op->op) {
case GGML_OP_CPY:
+ case GGML_OP_SET_ROWS:
return
op->type != GGML_TYPE_IQ3_XXS &&
op->type != GGML_TYPE_IQ3_S &&
|
96327b5
to
36f8e20
Compare
aef1996
to
3d930a9
Compare
ggml-ci
ggml-ci
82277da
to
4534123
Compare
@slaren This is ready to for a detailed review. I've prototyped 2 important use cases that can be enabled by adopting this change:
These still need some more work, which if we accept the current PR, I will do in next PRs. For now, this PR updates the KV cache logic to start using Without setting the |
|
||
slot_info res; | ||
|
||
res.idxs.resize(n_tokens); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing this with reserve
and using push_back
would remove the possibility of an OOB write.
idx_vec_t idxs; | ||
|
||
uint32_t head() const { | ||
return idxs[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to use at
here for safety.
cont #14274
Utilize
ggml_set_rows()
for updating the KV cache.head
offsetCurrently enabled only if the environment variable
LLAMA_SET_ROWS
is defined. If not, we fallback to the original way of updating the KV cache using a view + cpy of continuous slots. This is needed until theggml_set_rows()
implementation is finalized and supported by all backends.Testing
Next PRs
llama_kv_cache_unified
to support virtual sequences