server : do not default to multiple slots with speculative decoding #17017

ggerganov · 2025-11-05T08:28:50Z

The current implementation of speculative decoding in the llama-server requires a separate draft llama_context for each slot. Combined with the new defaults from #16736 this results in extra draft contexts being allocated, increasing memory usage.

This PR updates the logic to not increase the default server slots when a draft models is specified.

pockers21 · 2025-11-05T08:48:18Z

Multiple PRs are hitting webgpu-related CI failures , perhaps there is some issue in mater branch code.

* origin/master: (21 commits) vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (ggml-org#16919) examples(gguf): GGUF example outputs (ggml-org#17025) mtmd: allow QwenVL to process larger image by default (ggml-org#17020) server : do not default to multiple slots with speculative decoding (ggml-org#17017) mtmd: improve struct initialization (ggml-org#16981) docs: Clarify the endpoint that webui uses (ggml-org#17001) model : add openPangu-Embedded (ggml-org#16941) ggml webgpu: minor set rows optimization (ggml-org#16810) sync : ggml ggml : fix conv2d_dw SVE path (ggml/1380) CUDA: update ops.md (ggml-org#17005) opencl: update doc (ggml-org#17011) refactor: replace sprintf with snprintf for safer string handling in dump functions (ggml-org#16913) vulkan: remove the need for the dryrun (ggml-org#16826) server : do context shift only while generating (ggml-org#17000) readme : update hot topics (ggml-org#17002) ggml-cpu : bicubic interpolation (ggml-org#16891) ci : apply model label to models (ggml-org#16994) chore : fix models indent after refactor (ggml-org#16992) Fix garbled output with REPACK at high thread counts (ggml-org#16956) ...

server : do not default to multiple slots with speculative decoding

caa4ca7

ggerganov requested a review from ngxson as a code owner November 5, 2025 08:28

github-actions bot added examples server labels Nov 5, 2025

ggerganov mentioned this pull request Nov 5, 2025

Eval bug: Dense model with draft model cause crash #16980

Closed

DajanaV mentioned this pull request Nov 5, 2025

UPSTREAM PR #17017: server : do not default to multiple slots with speculative decoding auroralabs-loci/llama.cpp#88

Open

cont : fix

3ce702f

ggerganov merged commit 13b339b into master Nov 5, 2025
64 of 71 checks passed

ggerganov deleted the server/fix-draft-slots branch November 5, 2025 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : do not default to multiple slots with speculative decoding #17017

server : do not default to multiple slots with speculative decoding #17017

ggerganov commented Nov 5, 2025

Uh oh!

pockers21 commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

server : do not default to multiple slots with speculative decoding #17017

server : do not default to multiple slots with speculative decoding #17017

Conversation

ggerganov commented Nov 5, 2025

Uh oh!

pockers21 commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants