Qualcomm AI Engine Direct - Enable Lookahead Decoding #11437

shewu-quic · 2025-06-06T07:31:41Z

Summary:

Add new eval_mode: lookahead
Add three arguments: ngram, window, gcap
Add lhd_token_generator

Command

python3 examples/qualcomm/oss_scripts/llama/llama.py -b build-android --checkpoint stories110M.pt --params params.json --tokenizer_model tokenizer.model --prompt "Once" --temperature 0 --tokenizer_bin tokenizer.bin --llama_model stories110m --model_mode lookahead --ptq 16a4w -m SM8650 -H ${host} -s ${device}  -a ${artifacts} --max_seq_len 4096 --kv_updater smart_mask  --prefill_ar_len 64 --ngram 3 --window 2 --gcap 2

Test Results

QNN SDK: 2.28
Device: SM8650
max_seq_len: 4096

Performance Improvement under different AR-N and different W/G/N

Llama 3.2 3B

Llama 3.2 1B

Story Llama 110M

Performance Improvement under different prompt

Reference

cc: @haowhsu-quic

pytorch-bot · 2025-06-06T07:31:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11437

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 6bd4645 with merge base 56392aa ():

NEW FAILURE - The following job has failed:

Lint / lintrunner / linux-job (gh)
RuntimeError: Command docker exec -t e368a39b51fa7fd6f04d143c70d9df48f9a096e6c9c989f1abce042e72715bc5 /exec failed with exit code 127

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-06-06T07:32:20Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

cccclai · 2025-06-12T15:54:28Z

Hey can you add a read me to explain how this specific lookahead decoding work? Also, does 0/0/0 mean no lookahead decoding?

cccclai

It looks great!

summary: - Add new eval_mode: lookahead - Add three arguments: ngram, window, gcap - Add lhd_token_generator

shewu-quic · 2025-06-16T02:58:27Z

Hey can you add a read me to explain how this specific lookahead decoding work? Also, does 0/0/0 mean no lookahead decoding?

Sure. Let me work on it.
Yes, 0/0/0 means AR-N decoding without lookahead.
I manually change ar_len of decoding model in hybrid mode.

executorch/examples/qualcomm/oss_scripts/llama/llama.py

Line 519 in 56392aa

ar_len=1,

facebook-github-bot · 2025-06-16T03:22:08Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@haowhsu-quic

## Summary: - Add new eval_mode: lookahead - Add three arguments: ngram, window, gcap - Add lhd_token_generator ## Command ``` python3 examples/qualcomm/oss_scripts/llama/llama.py -b build-android --checkpoint stories110M.pt --params params.json --tokenizer_model tokenizer.model --prompt "Once" --temperature 0 --tokenizer_bin tokenizer.bin --llama_model stories110m --model_mode lookahead --ptq 16a4w -m SM8650 -H ${host} -s ${device} -a ${artifacts} --max_seq_len 4096 --kv_updater smart_mask --prefill_ar_len 64 --ngram 3 --window 2 --gcap 2 ``` ## Test Results QNN SDK: 2.28 Device: SM8650 max_seq_len: 4096 ### Performance Improvement under different AR-N and different W/G/N Llama 3.2 3B ![image](https://github.com/user-attachments/assets/98365f9f-ccb9-49b0-a4ab-e51b9880efc3) Llama 3.2 1B ![image](https://github.com/user-attachments/assets/c21aba0a-ab2d-4f30-9fbe-cce439bd5f7e) Story Llama 110M ![image](https://github.com/user-attachments/assets/debdf888-5ece-400f-b3ae-892c06ef352a) ### Performance Improvement under different prompt ![image](https://github.com/user-attachments/assets/cd072fa5-1eda-4390-9748-882baab442e0) ## Reference - https://lmsys.org/blog/2023-11-21-lookahead-decoding/ - https://github.com/hao-ai-lab/LookaheadDecoding/tree/main/lade - https://github.com/ggml-org/llama.cpp/blob/master/examples/lookahead/lookahead.cpp cc: @haowhsu-quic

shewu-quic requested review from jathu, larryliu0820, kirklandsign and cccclai as code owners June 6, 2025 07:31

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 6, 2025

cccclai approved these changes Jun 12, 2025

View reviewed changes

Qualcomm AI Engine Direct - Enable Lookahead Decoding

12e1ae5

summary: - Add new eval_mode: lookahead - Add three arguments: ngram, window, gcap - Add lhd_token_generator

shewu-quic force-pushed the dev1/hutton/enable_lookahead_decoding branch from 1bc9dcb to 12e1ae5 Compare June 16, 2025 02:56

Add the lookahead into Readme

6bd4645

cccclai merged commit 16ffc96 into pytorch:main Jun 16, 2025
101 of 104 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qualcomm AI Engine Direct - Enable Lookahead Decoding #11437

Qualcomm AI Engine Direct - Enable Lookahead Decoding #11437

Uh oh!

shewu-quic commented Jun 6, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 6, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

cccclai commented Jun 12, 2025 •

edited

Loading

Uh oh!

cccclai left a comment

Uh oh!

shewu-quic commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

Qualcomm AI Engine Direct - Enable Lookahead Decoding #11437

Qualcomm AI Engine Direct - Enable Lookahead Decoding #11437

Uh oh!

Conversation

shewu-quic commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Command

Test Results

Performance Improvement under different AR-N and different W/G/N

Performance Improvement under different prompt

Reference

Uh oh!

pytorch-bot bot commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11437

❌ 1 New Failure

Uh oh!

github-actions bot commented Jun 6, 2025

This PR needs a release notes: label

Uh oh!

cccclai commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cccclai left a comment

Choose a reason for hiding this comment

Uh oh!

shewu-quic commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

shewu-quic commented Jun 6, 2025 •

edited

Loading

pytorch-bot bot commented Jun 6, 2025 •

edited

Loading

This PR needs a `release notes:` label

cccclai commented Jun 12, 2025 •

edited

Loading