Skip to content

Confusion regarding operation/terminology of speculative decoding and sampling #1865

@MushroomHunting

Description

@MushroomHunting

Summary

Speculative decoding not interfacing as expected.

My understanding is that a draft model should be specified, and that the draft model should itself be another llm (typically, smaller with similar/same token formatting). This understanding is derived from ggml-org/llama.cpp#2926 which is referenced in some llama-cpp-python issues, specifically #675 . I've also come across #1120 .

There does not seem to be a direct interface to specify another llm, yet there is a "draft_model" argument in Llama() which instead points to a LlamaPromptLookupDecoding object. This is where my confusion arises from.

Expected Behavior

Adjusting the current example on the main page, i would have expected speculative decoding to operate something like the following:

from llama_cpp import Llama

llama_draft = (
    model_path="path/to/**small_draft_model.gguf**"
)

llama = Llama(
    model_path="path/to/**big_primary_model**.gguf",
    draft_model=llama_draft
)

Current Behavior

this is the current suggested operation:

from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

llama = Llama(
    model_path="path/to/model.gguf",
    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)

here, the draft_model is not another llm.

Just asking for some clarification and if it's currently possible or on the roadmap to implement the expected behaviour

cheers!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions