kv-cache : use ggml_set_rows #14285

ggerganov · 2025-06-19T16:30:58Z

Utilize ggml_set_rows() for updating the KV cache.

Make the graph static with respect to the KV cells head offset
Relax the requirement for continuous KV slots of the input ubatch, making the defrag logic almost obsolete

Currently enabled only if the environment variable LLAMA_SET_ROWS is defined. If not, we fallback to the original way of updating the KV cache using a view + cpy of continuous slots. This is needed until the ggml_set_rows() implementation is finalized and supported by all backends.

Testing

# regular
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf \
     -p "I believe the meaning of life is" -n 32 --top-k 1 -fa

# SWA
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/gemma-3-4b/ggml-model-q8_0.gguf \
     -p "I believe the meaning of life is" -n 32 --top-k 1 -fa

Next PRs

Introduce the concept of "virtual sequences"
Extend llama_kv_cache_unified to support virtual sequences
Extend the attention graph to support virtual sequences
Enable efficient multi-sequence decoding
Enable compute graph reuse

rgerganov · 2025-06-20T07:53:30Z

I tried this PR with the following change in the RPC backend:

diff --git a/ggml/src/ggml-rpc/ggml-rpc.cpp b/ggml/src/ggml-rpc/ggml-rpc.cpp
index f468f796..dcbede89 100644
--- a/ggml/src/ggml-rpc/ggml-rpc.cpp
+++ b/ggml/src/ggml-rpc/ggml-rpc.cpp
@@ -761,6 +761,8 @@ static enum ggml_status ggml_backend_rpc_graph_compute(ggml_backend_t backend, g
     ggml_backend_rpc_context * rpc_ctx = (ggml_backend_rpc_context *)backend->context;
     std::vector<uint8_t> input;
     serialize_graph(cgraph, input);
+    auto graph_hash = fnv_hash(input.data(), input.size());
+    printf("RPC graph compute: hash = 0x%" PRIx64 ", size = %zu\n", graph_hash, input.size());
     rpc_msg_graph_compute_rsp response;
     auto sock = get_socket(rpc_ctx->endpoint);
     bool status = send_rpc_cmd(sock, RPC_CMD_GRAPH_COMPUTE, input.data(), input.size(), &response, sizeof(response));

The compute graph doesn't change and produces the same hash with gpt2, tinyllama and mistral-7b models. However, the hash does change with gemma3 models. The serialized graph includes tensor addresses, so it's possible that we rebuild same tensors on different addresses resulting in different graph hash.

But in any way this looks like a great progress!

ggerganov · 2025-06-20T08:02:02Z

However, the hash does change with gemma3 models.

Yes, this is expected. I've applied the change only for the unified cache. For the unified+iswa, it is still using the ggml_cpy:

https://github.com/ggml-org/llama.cpp/pull/14285/files#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7feR1290-R1291

ggerganov · 2025-06-20T08:07:58Z

Should work with Gemma now as well.

ggerganov · 2025-06-20T09:02:46Z

The non-FA path is now also supported, though I am not 100% sure this is the best way to do it.

ggerganov · 2025-06-20T10:06:08Z

The non-FA path is now also supported, though I am not 100% sure this is the best way to do it.

I don't observe any performance regression with CPU-only build, so the implementation should be good enough I think.

rgerganov · 2025-06-23T08:14:04Z

@ggerganov the following test segfaults on my machine:

$ bin/test-backend-ops -b CPU -p "type=iq2_xxs,n=256,m=5,r=4,b0=1,bs=1,v=0"
Testing 1 devices

Backend 1/1: CPU
  Device description: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
  Device memory: 31835 MB (31835 MB free)

  SET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b0=1,bs=1,v=0): [1]    786752 segmentation fault (core dumped)  bin/test-backend-ops -b CPU -p "type=iq2_xxs,n=256,m=5,r=4,b0=1,bs=1,v=0"

ggerganov · 2025-06-23T08:20:31Z

Apply this patch:

diff --git a/ggml/src/ggml-cpu/ggml-cpu.cpp b/ggml/src/ggml-cpu/ggml-cpu.cpp
index 735ef3f01..cc9b922fa 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.cpp
+++ b/ggml/src/ggml-cpu/ggml-cpu.cpp
@@ -416,6 +416,7 @@ static bool ggml_backend_cpu_device_supports_op(ggml_backend_dev_t dev, const st
 
     switch (op->op) {
         case GGML_OP_CPY:
+        case GGML_OP_SET_ROWS:
             return
                 op->type != GGML_TYPE_IQ3_XXS &&
                 op->type != GGML_TYPE_IQ3_S   &&

ggml-ci

ggerganov · 2025-07-01T15:16:42Z

@slaren This is ready to for a detailed review. I've prototyped 2 important use cases that can be enabled by adopting this change:

llama : reuse compute graphs #14482 for reusing a graph from previous ubatch
llama : add high-throughput mode #14363 for decoupling the KV cache buffers of separate sequences in order to improve multi-sequence performance

These still need some more work, which if we accept the current PR, I will do in next PRs.

For now, this PR updates the KV cache logic to start using ggml_set_rows() and splits the kv_idxs into 2 tensors: k_idxs and v_idxs. The latter is currently not relevant, but it will be needed for #14363.

Without setting the LLAMA_SET_ROWS environment variable, this branch should behave identical to master.

slaren · 2025-07-01T22:43:20Z

src/llama-kv-cache-unified.cpp

+
+    slot_info res;
+
+    res.idxs.resize(n_tokens);


Replacing this with reserve and using push_back would remove the possibility of an OOB write.

slaren · 2025-07-01T22:56:10Z

src/llama-kv-cache-unified.h

+        idx_vec_t idxs;
+
+        uint32_t head() const {
+            return idxs[0];


Might be better to use at here for safety.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 19, 2025

ggerganov mentioned this pull request Jun 19, 2025

ggml : add ggml_set_rows #14274

Merged

ggerganov force-pushed the gg/model-rework-out-ids branch from 1e86597 to 2b940c0 Compare June 20, 2025 07:16

Base automatically changed from gg/model-rework-out-ids to master June 20, 2025 07:50

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 8f1c5e3 to 5f87f28 Compare June 20, 2025 07:59

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 4d0c0ea to db0cd69 Compare June 20, 2025 09:01

ggerganov force-pushed the gg/kv-cache-use-set-rows branch 2 times, most recently from d40f705 to d1da992 Compare June 21, 2025 06:20

github-actions bot added the examples label Jun 21, 2025

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 1031a5d to 14554a8 Compare June 21, 2025 12:29

ggerganov marked this pull request as ready for review June 21, 2025 13:42

ggerganov force-pushed the gg/kv-cache-use-set-rows branch 2 times, most recently from 335161d to e1aba6a Compare June 22, 2025 08:05

github-actions bot added testing Everything test related Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 22, 2025

ggerganov force-pushed the gg/kv-cache-use-set-rows branch 3 times, most recently from b5fea54 to c4273b8 Compare June 23, 2025 06:52

ggerganov force-pushed the gg/kv-cache-use-set-rows branch 2 times, most recently from 96327b5 to 36f8e20 Compare June 23, 2025 10:22

This was referenced Jun 24, 2025

llama : add high-throughput mode #14363

Draft

metal : batch rows copy in a single threadgroup #14384

Merged

ggerganov marked this pull request as draft June 26, 2025 16:05

ggerganov force-pushed the gg/kv-cache-use-set-rows branch 5 times, most recently from aef1996 to 3d930a9 Compare June 30, 2025 11:10

ggerganov added 3 commits June 30, 2025 17:07

kv-cache : use ggml_set_rows

3499892

ggml-ci

graph : separate k and v indices

6f33a9d

ggml-ci

cont : remove redundant ifs

4534123

ggml-ci

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 82277da to 4534123 Compare June 30, 2025 14:08

ggerganov mentioned this pull request Jul 1, 2025

llama : reuse compute graphs #14482

Draft

7 tasks

ggerganov marked this pull request as ready for review July 1, 2025 15:08

ggerganov requested a review from slaren July 1, 2025 15:16

slaren approved these changes Jul 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv-cache : use ggml_set_rows #14285

kv-cache : use ggml_set_rows #14285

ggerganov commented Jun 19, 2025 •

edited

Loading

Uh oh!

rgerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

rgerganov commented Jun 23, 2025

Uh oh!

ggerganov commented Jun 23, 2025

Uh oh!

ggerganov commented Jul 1, 2025 •

edited

Loading

Uh oh!

slaren Jul 1, 2025

Uh oh!

slaren Jul 1, 2025

Uh oh!

Uh oh!

kv-cache : use ggml_set_rows #14285

Are you sure you want to change the base?

kv-cache : use ggml_set_rows #14285

Conversation

ggerganov commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Next PRs

Uh oh!

rgerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

rgerganov commented Jun 23, 2025

Uh oh!

ggerganov commented Jun 23, 2025

Uh oh!

ggerganov commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggerganov commented Jun 19, 2025 •

edited

Loading

ggerganov commented Jul 1, 2025 •

edited

Loading