-
Notifications
You must be signed in to change notification settings - Fork 667
Qualcomm AI Engine Direct - GA Static Gemma3-1B #14108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14108
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Cancelled Job, 5 Unrelated FailuresAs of commit 9c80e2f with merge base 8496f27 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "release notes: qualcomm" |
Hi @cccclai, Both accuracy and performance in Hybrid/KV mode are promising. cc: @haowhsu-quic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the changes in examples/models/gemma3
and examples/models/llama
relevant?
Yes because we reuse the config from the etllm for qualcomm llm models as well |
There are some merge conflicts, can you resolve it? |
Summary: - e2e script for GA Static Gemma3-1B - perf: 16a4w block quant token rate in kv mode: ~= 110 tokens/sec(SM8750) - acc: PPL ~= (fp:21.375 -> htp:23.086) in wikitext dataset - add model params config - add End-to-End example in README - add new architecture: - add new class to support global/local ROPE static llama architecture required by Gemma3 - enable global/local static llama architecture support in runner - refactoring: - refactor attention mask to improve integration with global/local ROPE static llama model - refactor kv_inference and prefill_inference for better readability - Unitest: - add unit test for Gemma3-1B - improve readability of memory size constant in unit test - LLM model config visualization - support tabular LLMmodelConfig visulization
2333ffc
to
9c80e2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Trunk errors look like flakes
There's a bug in this code that was uncovered in our internal testing:
|
atten_mask is sometimes Tensor and sometimes AttenionMask object. cc @cccclai |
Can you fix this asap, otherwise, I'll revert this PR |
May I know the details of internal test scenario? I tested the latest mainline without |
The tokens, atten_mask, pos_ids, k_cache, v_cache = model.get_example_inputs()
logits, new_k_caches, new_v_caches = module(
tokens,
*atten_mask,
pos_ids,
*k_caches,
*v_caches,
) |
I'll try to fix it, looks like just an internal reference need to update |
Summary:
Test plan