Bug Description
LocalAI's llama-cpp-avx2 gRPC backend binary segfaults when loading Qwen3-VL GGUF models. The same GGUF file loads successfully with the official ghcr.io/ggml-org/llama.cpp:server-cuda image on the same hardware.
Environment
- LocalAI version: v4.1.3 and master (tested both, same result)
- Image tags:
localai/localai:v4.1.3-gpu-nvidia-cuda-12 and localai/localai:master-gpu-nvidia-cuda-12
- Hardware: Tesla V100-PCIE-32GB, Tesla P40, Intel Xeon E5-2680 v4
- Driver: NVIDIA 535.288.01, CUDA 12.2
Model
Model YAML config
name: qwen3-vl-8b
backend: llama-cpp
parameters:
model: Qwen3-VL-8B-Thinking-abliterated.Q8_0.gguf
context_size: 2048
gpu_layers: 0
f16: true
Note: crashes with both gpu_layers: 999 (GPU) and gpu_layers: 0 (CPU-only).
Steps to reproduce
- Place GGUF file in models directory
- Create YAML config above
- Set
LOCALAI_DISABLE_GUESSING=true (the Go GGUF parser also panics on this file)
- Send a chat completion request
- Backend crashes with
rpc error: code = Unavailable desc = error reading from server: EOF
Debug logs (with DEBUG=true)
The gRPC stderr shows a stack trace — the binary segfaults during model loading, crashing in __clone (thread creation):
GRPC stderr line="/backends/cuda12-llama-cpp/llama-cpp-avx2(+0x16440e7)"
...
GRPC stderr line="/backends/cuda12-llama-cpp/lib/libc.so.6(__clone+0x44)"
Failed to load model error=rpc error: code = Unavailable desc = error reading from server: EOF
Tested with both the bundled libc and the container's system libc — same crash.
What works
The exact same GGUF file loads and runs correctly with the official llama.cpp server:
docker run --gpus all -v /models:/models ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/Qwen3-VL-8B-Thinking-abliterated.Q8_0.gguf \
--host 0.0.0.0 --port 9999 -ngl 999 -c 2048
This confirms the issue is in LocalAI's custom llama-cpp-avx2 gRPC binary, not the model or hardware.
Additional note
guessDefaultsFromFile also panics on Qwen3-VL GGUF files (separate from the gRPC crash). Setting LOCALAI_DISABLE_GUESSING=true avoids that panic but does not fix the backend crash.
Bug Description
LocalAI's
llama-cpp-avx2gRPC backend binary segfaults when loading Qwen3-VL GGUF models. The same GGUF file loads successfully with the officialghcr.io/ggml-org/llama.cpp:server-cudaimage on the same hardware.Environment
localai/localai:v4.1.3-gpu-nvidia-cuda-12andlocalai/localai:master-gpu-nvidia-cuda-12Model
Qwen3-VL-8B-Thinking-abliterated.Q8_0.gguffrom mradermacher/Qwen3-VL-8B-Thinking-abliterated-GGUFModel YAML config
Note: crashes with both
gpu_layers: 999(GPU) andgpu_layers: 0(CPU-only).Steps to reproduce
LOCALAI_DISABLE_GUESSING=true(the Go GGUF parser also panics on this file)rpc error: code = Unavailable desc = error reading from server: EOFDebug logs (with DEBUG=true)
The gRPC stderr shows a stack trace — the binary segfaults during model loading, crashing in
__clone(thread creation):Tested with both the bundled libc and the container's system libc — same crash.
What works
The exact same GGUF file loads and runs correctly with the official llama.cpp server:
This confirms the issue is in LocalAI's custom
llama-cpp-avx2gRPC binary, not the model or hardware.Additional note
guessDefaultsFromFilealso panics on Qwen3-VL GGUF files (separate from the gRPC crash). SettingLOCALAI_DISABLE_GUESSING=trueavoids that panic but does not fix the backend crash.