Skip to content

Conversation

@ggerganov
Copy link
Member

fix #16980

The current implementation of speculative decoding in the llama-server requires a separate draft llama_context for each slot. Combined with the new defaults from #16736 this results in extra draft contexts being allocated, increasing memory usage.

This PR updates the logic to not increase the default server slots when a draft models is specified.

@pockers21
Copy link
Contributor

Multiple PRs are hitting webgpu-related CI failures , perhaps there is some issue in mater branch code.

@ggerganov ggerganov merged commit 13b339b into master Nov 5, 2025
64 of 71 checks passed
@ggerganov ggerganov deleted the server/fix-draft-slots branch November 5, 2025 12:33
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Nov 5, 2025
* origin/master: (21 commits)
vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (ggml-org#16919)
examples(gguf): GGUF example outputs (ggml-org#17025)
mtmd: allow QwenVL to process larger image by default (ggml-org#17020)
server : do not default to multiple slots with speculative decoding (ggml-org#17017)
mtmd: improve struct initialization (ggml-org#16981)
docs: Clarify the endpoint that webui uses (ggml-org#17001)
model : add openPangu-Embedded (ggml-org#16941)
ggml webgpu: minor set rows optimization (ggml-org#16810)
sync : ggml
ggml : fix conv2d_dw SVE path (ggml/1380)
CUDA: update ops.md (ggml-org#17005)
opencl: update doc (ggml-org#17011)
refactor: replace sprintf with snprintf for safer string handling in dump functions (ggml-org#16913)
vulkan: remove the need for the dryrun (ggml-org#16826)
server : do context shift only while generating (ggml-org#17000)
readme : update hot topics (ggml-org#17002)
ggml-cpu : bicubic interpolation (ggml-org#16891)
ci : apply model label to models (ggml-org#16994)
chore : fix models indent after refactor (ggml-org#16992)
Fix garbled output with REPACK at high thread counts (ggml-org#16956)
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Dense model with draft model cause crash

3 participants