can't infer with a "exclude_lm_head" model #1166

busishengui · 2025-01-06T03:40:08Z

Describe the bug
can't infer with a "exclude_lm_head" model

To Reproduce

python3 builder.py -m llama3.2-3b -o row_llama3.2-3b-onnx-int4 -p int4 -e cpu --extra_options int4_block_size=128 int4_accuracy_level=4 int4_op_types_to_quantize=MatMul/Gather exclude_lm_head=1

auto model = OgaModel::Create("llama3.2-3b-onnx-int4");
auto tokenizer = OgaTokenizer::Create(*model);
params->SetSearchOption("max_length", 128);
auto tokenizer_stream = OgaTokenizerStream::Create(*tokenizer);
auto params = OgaGeneratorParams::Create(*model);
std::string query = "tell me hello";
auto seq = OgaSequences::Create();
tokenizer->Encode(query.c_str(), *seq);
params->SetInputSequences(*seq);
auto generator = OgaGenerator::Create(*model, *params);

Then you will see the bug:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Model output was not found: logits
Aborted

onnxruntime-genai version: 0.5.2
OS：linux

The text was updated successfully, but these errors were encountered:

kunal-vaishnavi · 2025-01-09T17:41:28Z

If you are using exclude_lm_head, then the ONNX model's output will be the last hidden states instead of logits. When running this model, you will then have to convert your hidden states to logits afterwards since the generation loop in ONNX Runtime GenAI is performed on logits. Support for this approach is currently limited since there have not been many requests for it.

An alternative approach is to have both the last hidden states and the logits as outputs in the ONNX model. You can achieve that by using include_hidden_states in the extra_options (see example usage here).

busishengui · 2025-01-15T06:49:49Z

If you are using exclude_lm_head, then the ONNX model's output will be the last hidden states instead of logits. When running this model, you will then have to convert your hidden states to logits afterwards since the generation loop in ONNX Runtime GenAI is performed on logits. Support for this approach is currently limited since there have not been many requests for it.

An alternative approach is to have both the last hidden states and the logits as outputs in the ONNX model. You can achieve that by using include_hidden_states in the extra_options (see example usage here).

Thank you for your reply. But there is no include_hidden_states in 0.5.2, should I wait for the 0.6.0?

kunal-vaishnavi · 2025-01-15T07:35:28Z

You can run the model builder from source to access that option. Here's how you can do that using your provided command.

# Clone the repo
$ git clone https://github.com/microsoft/onnxruntime-genai

# Navigate to the model builder
$ cd onnxruntime-genai/src/python/py/models/

# Run your command with `include_hidden_states`
$ python3 builder.py -m llama3.2-3b -o row_llama3.2-3b-onnx-int4 -p int4 -e cpu --extra_options int4_block_size=128 int4_accuracy_level=4 int4_op_types_to_quantize=MatMul/Gather include_hidden_states=1

Alternatively, you can wait for ONNX Runtime GenAI v0.6.0 to be released since it is scheduled to come out this month.

RyanUnderhill assigned kunal-vaishnavi Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can't infer with a "exclude_lm_head" model #1166

can't infer with a "exclude_lm_head" model #1166

busishengui commented Jan 6, 2025

kunal-vaishnavi commented Jan 9, 2025

busishengui commented Jan 15, 2025

kunal-vaishnavi commented Jan 15, 2025

can't infer with a "exclude_lm_head" model #1166

can't infer with a "exclude_lm_head" model #1166

Comments

busishengui commented Jan 6, 2025

kunal-vaishnavi commented Jan 9, 2025

busishengui commented Jan 15, 2025

kunal-vaishnavi commented Jan 15, 2025