llama-cpp-avx2 gRPC backend segfaults loading Qwen3-VL GGUF models

## Bug Description

LocalAI's `llama-cpp-avx2` gRPC backend binary segfaults when loading Qwen3-VL GGUF models. The same GGUF file loads successfully with the official `ghcr.io/ggml-org/llama.cpp:server-cuda` image on the same hardware.

## Environment

- **LocalAI version**: v4.1.3 and master (tested both, same result)
- **Image tags**: `localai/localai:v4.1.3-gpu-nvidia-cuda-12` and `localai/localai:master-gpu-nvidia-cuda-12`
- **Hardware**: Tesla V100-PCIE-32GB, Tesla P40, Intel Xeon E5-2680 v4
- **Driver**: NVIDIA 535.288.01, CUDA 12.2

## Model

- `Qwen3-VL-8B-Thinking-abliterated.Q8_0.gguf` from [mradermacher/Qwen3-VL-8B-Thinking-abliterated-GGUF](https://huggingface.co/mradermacher/Qwen3-VL-8B-Thinking-abliterated-GGUF)
- File size: 8,709,520,352 bytes (verified complete, GGUF v3 header valid)

## Model YAML config

```yaml
name: qwen3-vl-8b
backend: llama-cpp
parameters:
  model: Qwen3-VL-8B-Thinking-abliterated.Q8_0.gguf
context_size: 2048
gpu_layers: 0
f16: true
```

Note: crashes with both `gpu_layers: 999` (GPU) and `gpu_layers: 0` (CPU-only).

## Steps to reproduce

1. Place GGUF file in models directory
2. Create YAML config above
3. Set `LOCALAI_DISABLE_GUESSING=true` (the Go GGUF parser also panics on this file)
4. Send a chat completion request
5. Backend crashes with `rpc error: code = Unavailable desc = error reading from server: EOF`

## Debug logs (with DEBUG=true)

The gRPC stderr shows a stack trace — the binary segfaults during model loading, crashing in `__clone` (thread creation):

```
GRPC stderr line="/backends/cuda12-llama-cpp/llama-cpp-avx2(+0x16440e7)"
...
GRPC stderr line="/backends/cuda12-llama-cpp/lib/libc.so.6(__clone+0x44)"
Failed to load model error=rpc error: code = Unavailable desc = error reading from server: EOF
```

Tested with both the bundled libc and the container's system libc — same crash.

## What works

The exact same GGUF file loads and runs correctly with the official llama.cpp server:

```bash
docker run --gpus all -v /models:/models ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Qwen3-VL-8B-Thinking-abliterated.Q8_0.gguf \
  --host 0.0.0.0 --port 9999 -ngl 999 -c 2048
```

This confirms the issue is in LocalAI's custom `llama-cpp-avx2` gRPC binary, not the model or hardware.

## Additional note

`guessDefaultsFromFile` also panics on Qwen3-VL GGUF files (separate from the gRPC crash). Setting `LOCALAI_DISABLE_GUESSING=true` avoids that panic but does not fix the backend crash.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-cpp-avx2 gRPC backend segfaults loading Qwen3-VL GGUF models #9279

Bug Description

Environment

Model

Model YAML config

Steps to reproduce

Debug logs (with DEBUG=true)

What works

Additional note

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

llama-cpp-avx2 gRPC backend segfaults loading Qwen3-VL GGUF models #9279

Description

Bug Description

Environment

Model

Model YAML config

Steps to reproduce

Debug logs (with DEBUG=true)

What works

Additional note

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions