Skip to content

llama-cpp-avx2 gRPC backend segfaults loading Qwen3-VL GGUF models #9279

@rmdevpro

Description

@rmdevpro

Bug Description

LocalAI's llama-cpp-avx2 gRPC backend binary segfaults when loading Qwen3-VL GGUF models. The same GGUF file loads successfully with the official ghcr.io/ggml-org/llama.cpp:server-cuda image on the same hardware.

Environment

  • LocalAI version: v4.1.3 and master (tested both, same result)
  • Image tags: localai/localai:v4.1.3-gpu-nvidia-cuda-12 and localai/localai:master-gpu-nvidia-cuda-12
  • Hardware: Tesla V100-PCIE-32GB, Tesla P40, Intel Xeon E5-2680 v4
  • Driver: NVIDIA 535.288.01, CUDA 12.2

Model

Model YAML config

name: qwen3-vl-8b
backend: llama-cpp
parameters:
  model: Qwen3-VL-8B-Thinking-abliterated.Q8_0.gguf
context_size: 2048
gpu_layers: 0
f16: true

Note: crashes with both gpu_layers: 999 (GPU) and gpu_layers: 0 (CPU-only).

Steps to reproduce

  1. Place GGUF file in models directory
  2. Create YAML config above
  3. Set LOCALAI_DISABLE_GUESSING=true (the Go GGUF parser also panics on this file)
  4. Send a chat completion request
  5. Backend crashes with rpc error: code = Unavailable desc = error reading from server: EOF

Debug logs (with DEBUG=true)

The gRPC stderr shows a stack trace — the binary segfaults during model loading, crashing in __clone (thread creation):

GRPC stderr line="/backends/cuda12-llama-cpp/llama-cpp-avx2(+0x16440e7)"
...
GRPC stderr line="/backends/cuda12-llama-cpp/lib/libc.so.6(__clone+0x44)"
Failed to load model error=rpc error: code = Unavailable desc = error reading from server: EOF

Tested with both the bundled libc and the container's system libc — same crash.

What works

The exact same GGUF file loads and runs correctly with the official llama.cpp server:

docker run --gpus all -v /models:/models ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Qwen3-VL-8B-Thinking-abliterated.Q8_0.gguf \
  --host 0.0.0.0 --port 9999 -ngl 999 -c 2048

This confirms the issue is in LocalAI's custom llama-cpp-avx2 gRPC binary, not the model or hardware.

Additional note

guessDefaultsFromFile also panics on Qwen3-VL GGUF files (separate from the gRPC crash). Setting LOCALAI_DISABLE_GUESSING=true avoids that panic but does not fix the backend crash.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions