Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile bug: Vulkan can not work on Android (cross-compilation from linux) - Aborted without explaination #11327

Open
samkoesnadi opened this issue Jan 21, 2025 · 30 comments

Comments

@samkoesnadi
Copy link

Git commit

2139667

Operating systems

Linux, Other? (Please let us know in description)

GGML backends

Vulkan

Problem description & steps to reproduce

I have followed all instructions, all existing solutions to build Vulkan on Android using cross compilation method. I just can not seem to make it work. The cli just aborts without explanation.

My phone is Redmi Note 13 Pro 5G. Using qualcomm CPU and Adreno GPU.
Operating System I use to cross-compile: Linux. Although, I also tried to cross compile it on Windows with the exact same issue.
NDK=26 and 28 give the same result

I have attached the log output below. Thank you in advance!

First Bad Commit

No response

Compile command

cmake   -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake   -DANDROID_ABI=arm64-v8a   -DANDROID_PLATFORM=latest   -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod   -DGGML_VULKAN=ON   -DGGML_VULKAN_CHECK_RESULTS=OFF   -DGGML_VULKAN_DEBUG=ON   -DGGML_VULKAN_MEMORY_DEBUG=ON   -DGGML_VULKAN_SHADER_DEBUG_INFO=ON   -DGGML_VULKAN_PERF=OFF   -DGGML_VULKAN_VALIDATE=OFF   -DGGML_VULKAN_RUN_TESTS=OFF -DVK_USE_PLATFORM_ANDROID_KHR=ON  -B build-android
cmake --build build-android --config Release -j8
cmake --install build-android --prefix install-android --config Release
adb push install-android /data/local/tmp/

Relevant log output

ggml_vk_instance_init()
ggml_vulkan: Found 1 Vulkan devices:
ggml_vk_print_gpu_info(0)
ggml_vulkan: 0 = Adreno (TM) 710 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: none
build: 4520 (2139667e) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
llama_model_load_from_file_impl: using device Vulkan0 (Adreno (TM) 710) - 7301 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 338 tensors from /data/local/tmp/Qwen2-VL-2B-Instruct-Q4_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 2B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2 VL 2B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-2B
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                        qwen2vl.block_count u32              = 28
llama_model_loader: - kv  14:                     qwen2vl.context_length u32              = 32768
llama_model_loader: - kv  15:                   qwen2vl.embedding_length u32              = 1536
llama_model_loader: - kv  16:                qwen2vl.feed_forward_length u32              = 8960
llama_model_loader: - kv  17:               qwen2vl.attention.head_count u32              = 12
llama_model_loader: - kv  18:            qwen2vl.attention.head_count_kv u32              = 2
llama_model_loader: - kv  19:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  20:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                          general.file_type u32              = 15
llama_model_loader: - kv  22:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-2B-Instruct-GGUF...
llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  35:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  36:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   28 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 988.60 MiB (5.37 BPW) 
load: special tokens cache size = 14
ggml_vk_get_device(0)
Initializing new vk_device
load: token to piece cache size = 0.9309 MB
print_info: arch             = qwen2vl
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 1536
ggml_vk_find_queue_family_index()print_info: n_layer          = 28

ggml_vk_find_queue_family_index()
print_info: n_head           = 12
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 8960
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1.5B
print_info: model params     = 1.54 B
print_info: general.name     = Qwen2 VL 2B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 148848 'ÄĬ'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: max token length = 256
ggml_vk_create_queue()
ggml_vk_load_shaders(Vulkan0)
ggml_vulkan: Compiling shadersggml_vk_create_pipeline(ggml_vk_create_pipeline(Vulkan0, matmul_f32_f32_m, main, 3ggml_vk_create_pipeline(, 56, (64Vulkan0, matmul_f32_f32_l, main, 3ggml_vk_create_pipeline(, 56, (128,64,Vulkan0ggml_vk_create_pipeline(,ggml_vk_create_pipeline(Vulkan0ggml_vk_create_pipeline(, matmul_f32_f32_aligned_s, main, matmul_f32_f32_s, Vulkan0ggml_vk_create_pipeline(1128,1), specialization_constants, 1Vulkan0, , , 3), specialization_constants, 1, main, 56, (32Vulkan0matmul_f32_f16_l, matmul_f32_f32_aligned_m, , main, 3, 0, 0, 00, 0, 3, mainmatmul_f32_f16_m, , 0)
, 56, (128,128,1), specialization_constants, 1, 0, 0Vulkan0, 0)
main, 3, 56, (56, (32,32,1), specialization_constants, 1, 0, 0, 0)
, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
)
, matmul_f32_f32_aligned_l, main, 3, 56,64,64,1), specialization_constants, 1, 0, 0, 0)
32,1), specialization_constants, 32, 0, , (128,128,1), specialization_constants, 128, 0, 0, 0)
0, 0)
.ggml_vk_create_pipeline(Vulkan0, matmul_f32_f16_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f32_f16_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f32_f16_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f32_f16_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f16acc_l, main, 3, 56, (128,128,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f16acc_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f16acc_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
.ggml_vk_create_pipeline(Vulkan0, matmul_f16_l, main, 3, 56, (128,128,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f32_f16acc_l, main, 3, 56, (128,128,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f32_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f32_f16acc_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f32_f16acc_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128, 0, 0, 0)
.ggml_vk_create_pipeline(Vulkan0, matmul_f16_f32_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f32_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_f16_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_0_f32_f16acc_l, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_0_f32_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
.ggml_vk_create_pipeline(Vulkan0, matmul_q4_0_f32_f16acc_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_0_f32_f16acc_aligned_l, main, 3, 56, (64,64,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_0_f32_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_0_f32_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_1_f32_f16acc_l, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_1_f32_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_1_f32_f16acc_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_1_f32_f16acc_aligned_l, main, 3, 56, (64,64,1), specialization_constants, 128, 0, 0ggml_vk_create_pipeline(Vulkan0, matmul_q4_1_f32_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_1_f32_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
.ggml_vk_create_pipeline(Vulkan0, matmul_q5_0_f32_f16acc_l, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_0_f32_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_0_f32_f16acc_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_0_f32_f16acc_aligned_l, main, 3, 56, (64,64,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_0_f32_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_0_f32_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_1_f32_f16acc_l, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_1_f32_f16acc_s, main, ggml_vk_create_pipeline(Vulkan0, matmul_q5_1_f32_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_1_f32_f16acc_aligned_l, main, 3, 56, (64,64,1), specialization_constants, 128, 0, 0, 0)
.ggml_vk_create_pipeline(Vulkan0, matmul_q5_1_f32_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_1_f32_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q8_0_f32_f16acc_l, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q8_0_f32_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q8_0_f32_f16acc_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q8_0_f32_f16acc_aligned_l, main, 3, 56, (64,64,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q8_0_f32_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q8_0_f32_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q2_k_f32_f16acc_l, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q2_k_f32_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
.ggml_vk_create_pipeline(Vulkan0, matmul_q2_k_f32_f16acc_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q2_k_f32_f16acc_aligned_l, main, 3, 56, (64,64,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q2_k_f32_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q2_k_f32_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q3_k_f32_f16acc_l, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q3_k_f32_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q3_k_f32_f16acc_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q3_k_f32_f16acc_aligned_l, main, 3, 56, (64,64,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q3_k_f32_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q3_k_f32_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
.ggml_vk_create_pipeline(Vulkan0, matmul_q4_k_f32_f16acc_l, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_k_f32_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_k_f32_f16acc_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_k_f32_f16acc_aligned_l, main, 3, 56, (64,64,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_k_f32_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_k_f32_f16acc_l, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q4_k_f32_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_k_f32_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_k_f32_f16acc_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_k_f32_f16acc_aligned_l, main, 3, 56, (64,64,1), specialization_constants, 128, 0, 0, 0)
.ggml_vk_create_pipeline(Vulkan0, matmul_q5_k_f32_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q5_k_f32_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q6_k_f32_f16acc_l, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q6_k_f32_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q6_k_f32_f16acc_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q6_k_f32_f16acc_aligned_l, main, 3, 56, (64,64,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q6_k_f32_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_q6_k_f32_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_iq4_nl_f32_f16acc_l, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_iq4_nl_f32_f16acc_m, main, 3, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
.ggml_vk_create_pipeline(Vulkan0, matmul_iq4_nl_f32_f16acc_s, main, 3, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_iq4_nl_f32_f16acc_aligned_l, main, 3, 56, (64,64,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_iq4_nl_f32_f16acc_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_iq4_nl_f32_f16acc_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_id_f32_f32_l, main, 4, 56, (128,128,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_id_f32_f32_m, main, 4, 56, (64,64,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_id_f32_f32_s, main, 4, 56, (32,32,1), specialization_constants, 1, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_id_f32_f32_aligned_l, main, 4, 56, (128,128,1), specialization_constants, 128, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_id_f32_f32_aligned_m, main, 4, 56, (64,64,1), specialization_constants, 64, 0, 0, 0)
ggml_vk_create_pipeline(Vulkan0, matmul_id_f32_f32_aligned_s, main, 4, 56, (32,32,1), specialization_constants, 32, 0, 0, 0)
Aborted
@jeffbolznv
Copy link
Collaborator

It's hard to be sure with the multithreaded compiles, but it seems like it's having trouble with the matmul_id shaders. Can you try enabling the Vulkan Validation Layers? Also, can you try disabling mul_mat_id_l/m/s in ggml_vk_get_device?

@samkoesnadi
Copy link
Author

It's hard to be sure with the multithreaded compiles, but it seems like it's having trouble with the matmul_id shaders. Can you try enabling the Vulkan Validation Layers? Also, can you try disabling mul_mat_id_l/m/s in ggml_vk_get_device?

So I set the -DGGML_VULKAN_VALIDATE=ON, disabling mul_mat_id_l/m/s in the ggml_vk_get_device in the ggml-vulkan.cpp file. Also, compiling without multi-threading.

It still is aborted as before. Where it is aborted (the last ggml_vk_create_pipeline) is different everytime. Once in a while it is also just stuck instead of aborted... Quirky...

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 23, 2025

Qualcomm GPUs are known not to work with the Vulkan backend yet, there's a number of issues about it.

@samkoesnadi
Copy link
Author

Qualcomm GPUs are known not to work with the Vulkan backend yet, there's a number of issues about it.

I see... thought it has been fixed right now, as my impression was some people have been able to find their ways around it. Do we have some kind of ongoing discussion about it I can join? I can contribute as well. Or Github Issues and Discussions here are all we got

@samkoesnadi
Copy link
Author

Qualcomm GPUs are known not to work with the Vulkan backend yet, there's a number of issues about it.

I wonder, how come AI on mobile, such as utilizing Qualcomm Adreno GPU, not become the focus of many. As running it on edge seems to be the next big step in AI, no?

@max-krasnyansky
Copy link
Collaborator

I believe we mentioned this before.
Please use the OpenCL backend with Adreno GPUs.

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 24, 2025

Qualcomm GPUs are known not to work with the Vulkan backend yet, there's a number of issues about it.

I see... thought it has been fixed right now, as my impression was some people have been able to find their ways around it. Do we have some kind of ongoing discussion about it I can join? I can contribute as well. Or Github Issues and Discussions here are all we got

There's the OpenCL backend, but also @slp is looking into implementing/optimizing Vulkan for embedded GPUs. Github issues and discussions are all we got, basically.

@samkoesnadi
Copy link
Author

I believe we mentioned this before. Please use the OpenCL backend with Adreno GPUs.

I did try your team's implementation before. It also had an issue which is llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3769: GGML_ASSERT(ne10 == ne02) failed. This issue appears as soons as I activated GPU offload layers via -ngl.

I opted to focus on Vulkan as it seems to have more future support for different kind of GPUs, mainly Mali, Adreno...

@samkoesnadi
Copy link
Author

Qualcomm GPUs are known not to work with the Vulkan backend yet, there's a number of issues about it.

I see... thought it has been fixed right now, as my impression was some people have been able to find their ways around it. Do we have some kind of ongoing discussion about it I can join? I can contribute as well. Or Github Issues and Discussions here are all we got

There's the OpenCL backend, but also @slp is looking into implementing/optimizing Vulkan for embedded GPUs. Github issues and discussions are all we got, basically.

I can contact him to ask how far along are we in the development. And if there is some TODO list for the topic already.

@slp
Copy link
Collaborator

slp commented Jan 24, 2025

There's the OpenCL backend, but also @slp is looking into implementing/optimizing Vulkan for embedded GPUs. Github issues and discussions are all we got, basically.

FWIW, I'm getting something ready and I plan to open a PR for discussion next week. Hardest part is ensuring the code doesn't get too convoluted, but I think it can be done.

@samkoesnadi
Copy link
Author

There's the OpenCL backend, but also @slp is looking into implementing/optimizing Vulkan for embedded GPUs. Github issues and discussions are all we got, basically.

FWIW, I'm getting something ready and I plan to open a PR for discussion next week. Hardest part is ensuring the code doesn't get too convoluted, but I think it can be done.

Any branch I can already take a look? That sounds great, and let me know if there is something I can do apart from testing it on my device 😄

@max-krasnyansky
Copy link
Collaborator

I believe we mentioned this before. Please use the OpenCL backend with Adreno GPUs.

I did try your team's implementation before. It also had an issue which is llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3769: GGML_ASSERT(ne10 == ne02) failed. This issue appears as soons as I activated GPU offload layers via -ngl.

I opted to focus on Vulkan as it seems to have more future support for different kind of GPUs, mainly Mali, Adreno...

@samkoesnadi Which model gave you that error with OpenCL?
Please share more details how you tried it (ideally full llama-cli command, and llama-quantize if you quantized the model yourself).

@samkoesnadi
Copy link
Author

I believe we mentioned this before. Please use the OpenCL backend with Adreno GPUs.

I did try your team's implementation before. It also had an issue which is llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3769: GGML_ASSERT(ne10 == ne02) failed. This issue appears as soons as I activated GPU offload layers via -ngl.
I opted to focus on Vulkan as it seems to have more future support for different kind of GPUs, mainly Mali, Adreno...

@samkoesnadi Which model gave you that error with OpenCL? Please share more details how you tried it (ideally full llama-cli command, and llama-quantize if you quantized the model yourself).

I tried with two models with the following details:

LD_LIBRARY_PATH=lib ./bin/llama-simple -m '/data/local/tmp/Qwen2-VL-2B-Instruct-Q4_K_L.gguf' -p 'Describe this image.' --image "image.jpg" -ngl 10

This gave me the following error

llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp:3769: GGML_ASSERT(ne10 == ne02) failed
0: 0x7c2dcb81a4 
1: 0x7c2dcb810c ggml_abort
2: 0x7c2918e234 
3: 0x7c29181a2c _Z23ggml_cl_compute_forwardP12ggml_backendP11ggml_tensor
4: 0x7c2918f020 
5: 0x7c2dcce2dc ggml_backend_sched_graph_compute_async
6: 0x7c292f3864 
7: 0x7c292ce844 llama_decode
8: 0x59b11acb44 
9: 0x7c280c8bb0 __libc_init
Aborted 
  • Tried with the original llama 3.2 4 bit quantized (I forgot from which HF I downloaded it)
LD_LIBRARY_PATH=lib ./bin/llama-cli -m /data/local/tmp/llama.cpp/Llama-3.2-3B-Instruct-Q4_0.gguf -p How are you? -no-cnv -ngl 10

And this one only gave me Illegal instruction


After further investigation, I notice this in the debug log:
ggml_opencl: Unsupported Adreno GPU: , using wave size 128, may not work as expected
and
llama_model_load_from_file_impl: using device GPUOpenCL (QUALCOMM Adreno(TM)) - 0 MiB free

I wonder why is it 0 MiB free, where there are still around 3 GiB available in the RAM


Here is the complete log:

ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM)'
ggml_opencl: Unsupported Adreno GPU: , using wave size 128, may not work as expected
ggml_opencl: device OpenCL version: OpenCL 3.0 Adreno(TM) 710
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit #8f5499ec14 changeid #Ie6ef1a0a80 Date: 07/14/23 Fri Local Branch:  Remote Branch: refs/tags/AU_LINUX_ANDROID_LA.VENDOR.1.0.11.00.00.766.892 Compiler E031.38.11.14
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 1024
ggml_opencl: max mem alloc size: 912 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
build: 4520 (2139667e) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device GPUOpenCL (QUALCOMM Adreno(TM)) - 0 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from /data/local/tmp/llama.cpp/Llama-3.2-3B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 28
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 2
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_0:  193 tensors
llama_model_loader: - type q4_1:    3 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 1.78 GiB (4.77 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_layer          = 28
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 128 'Ä'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: offloading 10 repeating layers to GPU
load_tensors: offloaded 10/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1825.40 MiB
load_tensors:       OpenCL model buffer size =   540.24 MiB
load_tensors:  CPU_AARCH64 model buffer size =   931.50 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   288.00 MiB
llama_kv_cache_init:     OpenCL KV buffer size =   160.00 MiB
llama_init_from_model: KV self size  =  448.00 MiB, K (f16):  224.00 MiB, V (f16):  224.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model:     OpenCL compute buffer size =   224.00 MiB
llama_init_from_model:        CPU compute buffer size =   256.50 MiB
llama_init_from_model: graph nodes  = 902
llama_init_from_model: graph splits = 3
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Illegal instruction

@jeffbolznv
Copy link
Collaborator

It might be better to have a separate issue to track the OpenCL backend failure.

For the Vulkan backend, please try #11406. It'll only compile the shaders that are actually needed and might dodge some problems.

@samkoesnadi
Copy link
Author

It might be better to have a separate issue to track the OpenCL backend failure.

For the Vulkan backend, please try #11406. It'll only compile the shaders that are actually needed and might dodge some problems.

I will try that PR out, many thanks!

@samkoesnadi
Copy link
Author

It might be better to have a separate issue to track the OpenCL backend failure.

For the Vulkan backend, please try #11406. It'll only compile the shaders that are actually needed and might dodge some problems.

This works as a fix for LLM! With one small error only when the offload layers are all in the GPU:

Command prompt: LD_LIBRARY_PATH=lib ./bin/llama-cli -m /data/local/tmp/llama.cpp/Llama-3.2-3B-Instruct-Q4_0.gguf -p How are you? -no-cnv -ngl 29

print_info: LF token         = 128 'Ä'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
load_tensors:      Vulkan0 model buffer size =  1825.40 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =   448.00 MiB
llama_init_from_model: KV self size  =  448.00 MiB, K (f16):  224.00 MiB, V (f16):  224.00 MiB
llama_init_from_model: Vulkan_Host  output buffer size =     0.49 MiB
llama_init_from_model:    Vulkan0 compute buffer size =   256.50 MiB
llama_init_from_model: Vulkan_Host compute buffer size =    14.01 MiB
llama_init_from_model: graph nodes  = 902
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
**libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found
Aborted** 

If I set the -ngl to 28 then it works like a charm, with eval time being faster than the CPU-counterpart (2.25 vs 3.09 tokens/s). However, if the -ngl is too low like one, then the speed is actually worse than CPU - around 1.2 tokens/s. I assume this is expected. Utilizing almost 100% GPU seems to have one bad thing, the mobile seems to become laggy.

@samkoesnadi
Copy link
Author

For VLM itself is trickier to try at the moment, since the visual projector is defaulted to CPU. Will see later

@jeffbolznv
Copy link
Collaborator

libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found
Aborted

Hmm, I'm not able to reproduce this error on my system. I had a bug that caused this symptom in my first commit, but I had fixed it before asking you to test.

@jeffbolznv
Copy link
Collaborator

#11406 is merged now. Are you still seeing the out_of_range exception at TOT? I tried for a while today and still haven't been able to reproduce that.

@samkoesnadi
Copy link
Author

#11406 is merged now. Are you still seeing the out_of_range exception at TOT? I tried for a while today and still haven't been able to reproduce that.

I tried, and it is still there. Caches were already clear and all. I am trying to debug it now

@jeffbolznv
Copy link
Collaborator

Thanks. In an earlier version there was a pointer getting nulled out in load_shaders that was freeing a previously compiled pipeline. Maybe it's something like that, but I couldn't find anything by inspection.

@samkoesnadi
Copy link
Author

Thanks. In an earlier version there was a pointer getting nulled out in load_shaders that was freeing a previously compiled pipeline. Maybe it's something like that, but I couldn't find anything by inspection.

I used lldb to debug it - log snippet below. Perhaps, the numbering of the first index of device->pipeline_descriptor_set_requirements pair list has to be the same as device->pipelines, which is currently not reflected in the code?

The fact that the issue happens in my system and not yours, might mean also that the code actually works on my system which do not support many shaders features. This is as far as I can understand for now. But still that it works for partial offload layers and not all of them is still something I can not yet make sense.

(lldb) frame select 8
frame #8: 0x0000007fef1e1ee4 libggml-vulkan.so`ggml_pipeline_allocate_descriptor_sets(device=nullptr) at ggml-vulkan.cpp:914:50
   911      std::lock_guard<std::mutex> guard(device->mutex);
   912 
   913      for (auto& pair : device->pipeline_descriptor_set_requirements) {
-> 914          vk_pipeline pipeline = device->pipelines.at(pair.first).lock();
   915          const uint64_t n = pair.second;
   916 
   917          VK_LOG_DEBUG("ggml_pipeline_allocate_descriptor_sets(" << pipeline->name << ", " << n << ")");

@jeffbolznv
Copy link
Collaborator

Can you enable GGML_VULKAN_DEBUG and share the log?

@samkoesnadi
Copy link
Author

Can you enable GGML_VULKAN_DEBUG and share the log?

Here you go... I attached the full log in a txt file. Below is the snippet of the last log before it gets aborted:

...
ggml_pipeline_allocate_descriptor_sets(mul_mat_vec_q4_0_f32_f32_1, 3)
ggml_pipeline_allocate_descriptor_sets(get_rows_f32_f32, 2)
ggml_pipeline_allocate_descriptor_sets(silu_f32, 28)
ggml_pipeline_allocate_descriptor_sets(cpy_f32_f32, 28)
ggml_pipeline_allocate_descriptor_sets(add_f32_norepeat, 56)
ggml_pipeline_allocate_descriptor_sets(matmul_f16_f32_f16acc_s, 28)
ggml_pipeline_allocate_descriptor_sets(soft_max_f32, 28)
ggml_pipeline_allocate_descriptor_sets(mul_f32_norepeat, 30)
ggml_pipeline_allocate_descriptor_sets(cpy_f16_f16, 56)
ggml_pipeline_allocate_descriptor_sets(rope_norm_f32, 56)
ggml_pipeline_allocate_descriptor_sets(contig_cpy_f32_f16, 28)
ggml_pipeline_allocate_descriptor_sets(mul_mat_vec_q4_1_f32_f32_2, 3)
ggml_pipeline_allocate_descriptor_sets(matmul_f16_s, 28)
ggml_pipeline_allocate_descriptor_sets(cpy_f32_f16, 56)
ggml_pipeline_allocate_descriptor_sets(mul_mat_vec_q4_0_f32_f32_2, 190)
libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found
Aborted 

android_vulkan_log.txt

@jeffbolznv
Copy link
Collaborator

Thanks. I see 18 unique pipelines being requested, all 18 call into ggml_vk_create_pipeline_func, and all but three of them made it to ggml_pipeline_allocate_descriptor_sets. Please also add a VK_LOG_DEBUG right before the crash to print out the name of the crashing pipeline. And please also add a log around this line to see if the pipeline is added to the map:

device->pipelines.insert({ pipeline->name, pipeline });

I still don't understand where things are going wrong. Maybe the compile fails and somehow doesn't end up in the map? Or maybe pipeline_descriptor_set_requirements is messed up somehow.

@jeffbolznv
Copy link
Collaborator

I have a pretty good guess as to what's happening. I think the pipeline creation fails, probably for the mul_mat_vec_q6_k_f32_f32_1 pipeline (the other two possibilities were mul_f32 and rms_norm_f32), vulkan.hpp turns the failure into an exception, std::future silently swallows the exception and the pipeline ends up not being in device->pipelines leading to the out of range exception.

The pipeline creation failure is likely a driver/compiler bug. I'll add some code to catch the exception and make the failure more obvious. If you want you could try experimenting with the shader to see if you can get it to successfully compile. It's a shame it's q6_k that's broken, I think that's pretty common as a final layer in many networks.

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 26, 2025

Oh yeah, that's probably it. We've had issues with Qualcomm's shader compiler before. It crashed on (non-threadgroup-uniform) branches on loads/stores to/from global/shared memory, if I remember correctly. It started working after removing the branches, but probably came back after some optimization work on the mmv shaders.

@samkoesnadi
Copy link
Author

The fact that it comes from Qualcomm's shader compiler just made it a more difficult fix, I guess. I currently have some other todos, but will get to it after...

@jeffbolznv
Copy link
Collaborator

@samkoesnadi when you get a chance, please try #11436 and verify it prints a useful message like:

ggml_vulkan: Compute pipeline creation failed for mul_mat_vec_q6_k_f32_f32_1
ggml_vulkan: vk::Device::createComputePipeline: ErrorInitializationFailed

@a750sd
Copy link

a750sd commented Jan 26, 2025

Termux

Developer options -> enable "Disable child process restrictions"

pkg install cmake
pkg install shaderc
pkg install vulkan-headers
pkg install vulkan-loader-android

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants