Eval bug: commit: no pending KV cache updates to commit - might indicate a bug

### Name and Version

% ~/projects/llama.gguf/llama-cli --version
version: 5033 (f01bd023)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0

### Operating systems

Mac

### GGML backends

Metal

### Hardware

Apple M2 Max
macOS 15.3


### Models

https://huggingface.co/sydneyfong/Qwerky-QwQ-32B-Q6_K-GGUF

The above model was created using https://huggingface.co/spaces/ggml-org/gguf-my-repo

### Problem description & steps to reproduce

On inference in llama-cli, getting a wall of these logs:

```
commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
```

Note that this model uses rwkv and is apparently a relatively experimental architecture, and originally the model failed to convert properly due to https://github.com/ggml-org/llama.cpp/issues/12662 , so there may be other issues at play. 



### First Bad Commit

Presumably the issue is related to https://github.com/ggml-org/llama.cpp/pull/12695  since the logs say so.

Using a build before PR #12695 seems to make the warning go away (as expected). However the inference speed is abysmally slow. (This may or may not be due to inherent nature of the RXKV arch.)

### Relevant log output

```shell
~/projects/llama.gguf/llama-cli --no-escape -cnv -c 8192 -m ~/Downloads/qwerky-qwq-32b-q6_k.gguf
build: 5033 (f01bd023) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 89999 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 1283 tensors from /Users/sidney_fong/Downloads/qwerky-qwq-32b-q6_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = rwkv6qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwerky QwQ 32B
llama_model_loader: - kv   3:                           general.basename str              = Qwerky-QwQ
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                  rwkv6qwen2.context_length u32              = 1048576
llama_model_loader: - kv   7:                rwkv6qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   8:                     rwkv6qwen2.block_count u32              = 64
llama_model_loader: - kv   9:                   rwkv6qwen2.wkv.head_size u32              = 128
llama_model_loader: - kv  10:              rwkv6qwen2.time_mix_extra_dim u32              = 128
llama_model_loader: - kv  11:            rwkv6qwen2.time_decay_extra_dim u32              = 128
llama_model_loader: - kv  12:             rwkv6qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv  13: rwkv6qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:               rwkv6qwen2.token_shift_count u32              = 1
llama_model_loader: - kv  15:         rwkv6qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  16:            rwkv6qwen2.attention.head_count u32              = 0
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                          general.file_type u32              = 18
llama_model_loader: - type  f32:  769 tensors
llama_model_loader: - type q6_K:  514 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 28.20 GiB (6.93 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = rwkv6qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 1048576
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 0
print_info: n_head_kv        = 8
print_info: n_rot            = 0
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 0
print_info: n_embd_head_v    = 0
print_info: n_gqa            = 0
print_info: n_embd_k_gqa     = 0
print_info: n_embd_v_gqa     = 0
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 27648
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = -1
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 1048576
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 34.95 B
print_info: general.name     = Qwerky QwQ 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 64 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors: Metal_Mapped model buffer size = 28876.19 MiB
load_tensors:   CPU_Mapped model buffer size =   609.08 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (8192) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = true
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 94371.84 MB
llama_context:        CPU  output buffer size =     0.58 MiB
init: kv_size = 1, offload = 1, type_k = 'f32', type_v = 'f32', n_layer = 64, can_shift = 0
init:      Metal KV buffer size =   161.25 MiB
llama_context: KV self size  =  161.25 MiB, K (f32):    1.25 MiB, V (f32):  160.00 MiB
llama_context:      Metal compute buffer size =   317.00 MiB
llama_context:        CPU compute buffer size =    12.50 MiB
llama_context: graph nodes  = 5702
llama_context: graph splits = 259
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
main: llama threadpool init, n_threads = 8
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 8 (n_threads_batch = 8) / 12 | Metal : EMBED_LIBRARY = 1 | BF16 = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 860079672
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> hello
commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
<think>commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)

commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
Okaycommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
,commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 thecommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 usercommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 sentcommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 acommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 "commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
hellocommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
".commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 Icommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 shouldcommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 respondcommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 politelycommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
.commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 Letcommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 mecommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 seecommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
...commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 Maybecommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 saycommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 hellocommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
 backcommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: commit: no pending KV cache updates to commit - might indicate a bug #12730

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: commit: no pending KV cache updates to commit - might indicate a bug #12730

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions