Confusion regarding operation/terminology of speculative decoding and sampling

# Summary

Speculative decoding not interfacing as expected.

My understanding is that a draft model should be specified, and that the draft model should itself be another llm (typically, smaller with similar/same token formatting). This understanding is derived from https://github.com/ggerganov/llama.cpp/pull/2926 which is referenced in some llama-cpp-python issues, specifically https://github.com/abetlen/llama-cpp-python/issues/675 . I've also come across https://github.com/abetlen/llama-cpp-python/pull/1120 .

There does not seem to be a direct interface to specify another llm, yet there is a "draft_model" argument in Llama() which instead points to a LlamaPromptLookupDecoding object. This is where my confusion arises from.

# Expected Behavior

Adjusting the current example on the main page, i would have expected speculative decoding to operate something like the following:

```
from llama_cpp import Llama

llama_draft = (
    model_path="path/to/**small_draft_model.gguf**"
)

llama = Llama(
    model_path="path/to/**big_primary_model**.gguf",
    draft_model=llama_draft
)
```

# Current Behavior

this is the current suggested operation:

```
from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

llama = Llama(
    model_path="path/to/model.gguf",
    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)
```

here, the draft_model is not another llm.

Just asking for some clarification and if it's currently possible or on the roadmap to implement the expected behaviour

cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Confusion regarding operation/terminology of speculative decoding and sampling #1865

Summary

Expected Behavior

Current Behavior

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Confusion regarding operation/terminology of speculative decoding and sampling #1865

Description

Summary

Expected Behavior

Current Behavior

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions