-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Description
Name and Version
% ~/projects/llama.gguf/llama-cli --version
version: 5033 (f01bd02)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
Operating systems
Mac
GGML backends
Metal
Hardware
Apple M2 Max
macOS 15.3
Models
https://huggingface.co/sydneyfong/Qwerky-QwQ-32B-Q6_K-GGUF
The above model was created using https://huggingface.co/spaces/ggml-org/gguf-my-repo
Problem description & steps to reproduce
On inference in llama-cli, getting a wall of these logs:
commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
Note that this model uses rwkv and is apparently a relatively experimental architecture, and originally the model failed to convert properly due to #12662 , so there may be other issues at play.
First Bad Commit
Presumably the issue is related to #12695 since the logs say so.
Using a build before PR #12695 seems to make the warning go away (as expected). However the inference speed is abysmally slow. (This may or may not be due to inherent nature of the RXKV arch.)
Relevant log output
~/projects/llama.gguf/llama-cli --no-escape -cnv -c 8192 -m ~/Downloads/qwerky-qwq-32b-q6_k.gguf
build: 5033 (f01bd023) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 89999 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 1283 tensors from /Users/sidney_fong/Downloads/qwerky-qwq-32b-q6_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = rwkv6qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwerky QwQ 32B
llama_model_loader: - kv 3: general.basename str = Qwerky-QwQ
llama_model_loader: - kv 4: general.size_label str = 32B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: rwkv6qwen2.context_length u32 = 1048576
llama_model_loader: - kv 7: rwkv6qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 8: rwkv6qwen2.block_count u32 = 64
llama_model_loader: - kv 9: rwkv6qwen2.wkv.head_size u32 = 128
llama_model_loader: - kv 10: rwkv6qwen2.time_mix_extra_dim u32 = 128
llama_model_loader: - kv 11: rwkv6qwen2.time_decay_extra_dim u32 = 128
llama_model_loader: - kv 12: rwkv6qwen2.feed_forward_length u32 = 27648
llama_model_loader: - kv 13: rwkv6qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 14: rwkv6qwen2.token_shift_count u32 = 1
llama_model_loader: - kv 15: rwkv6qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 16: rwkv6qwen2.attention.head_count u32 = 0
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 25: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 26: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 18
llama_model_loader: - type f32: 769 tensors
llama_model_loader: - type q6_K: 514 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q6_K
print_info: file size = 28.20 GiB (6.93 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = rwkv6qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 1048576
print_info: n_embd = 5120
print_info: n_layer = 64
print_info: n_head = 0
print_info: n_head_kv = 8
print_info: n_rot = 0
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 0
print_info: n_embd_head_v = 0
print_info: n_gqa = 0
print_info: n_embd_k_gqa = 0
print_info: n_embd_v_gqa = 0
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 27648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = -1
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 1048576
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 32B
print_info: model params = 34.95 B
print_info: general.name = Qwerky QwQ 32B
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 64 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors: Metal_Mapped model buffer size = 28876.19 MiB
load_tensors: CPU_Mapped model buffer size = 609.08 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (8192) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets = false
ggml_metal_init: has bfloat = true
ggml_metal_init: use bfloat = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 94371.84 MB
llama_context: CPU output buffer size = 0.58 MiB
init: kv_size = 1, offload = 1, type_k = 'f32', type_v = 'f32', n_layer = 64, can_shift = 0
init: Metal KV buffer size = 161.25 MiB
llama_context: KV self size = 161.25 MiB, K (f32): 1.25 MiB, V (f32): 160.00 MiB
llama_context: Metal compute buffer size = 317.00 MiB
llama_context: CPU compute buffer size = 12.50 MiB
llama_context: graph nodes = 5702
llama_context: graph splits = 259
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
main: llama threadpool init, n_threads = 8
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
system_info: n_threads = 8 (n_threads_batch = 8) / 12 | Metal : EMBED_LIBRARY = 1 | BF16 = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |
main: interactive mode on.
sampler seed: 860079672
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
- Not using system message. To change it, set a different value via -sys PROMPT
> hello
commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
<think>commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
Okaycommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
,commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
thecommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
usercommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
sentcommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
acommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
"commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
hellocommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
".commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
Icommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
shouldcommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
respondcommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
politelycommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
.commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
Letcommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
mecommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
seecommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
...commit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
Maybecommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
saycommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
hellocommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)
backcommit: no pending KV cache updates to commit - might indicate a bug (ref: https://github.com/ggml-org/llama.cpp/pull/12695)