-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Summary
Speculative decoding not interfacing as expected.
My understanding is that a draft model should be specified, and that the draft model should itself be another llm (typically, smaller with similar/same token formatting). This understanding is derived from ggml-org/llama.cpp#2926 which is referenced in some llama-cpp-python issues, specifically #675 . I've also come across #1120 .
There does not seem to be a direct interface to specify another llm, yet there is a "draft_model" argument in Llama() which instead points to a LlamaPromptLookupDecoding object. This is where my confusion arises from.
Expected Behavior
Adjusting the current example on the main page, i would have expected speculative decoding to operate something like the following:
from llama_cpp import Llama
llama_draft = (
model_path="path/to/**small_draft_model.gguf**"
)
llama = Llama(
model_path="path/to/**big_primary_model**.gguf",
draft_model=llama_draft
)
Current Behavior
this is the current suggested operation:
from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
llama = Llama(
model_path="path/to/model.gguf",
draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)
here, the draft_model is not another llm.
Just asking for some clarification and if it's currently possible or on the roadmap to implement the expected behaviour
cheers!