You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qualcomm AI Engine Direct - GA Static Gemma3-1B (#14108)
Summary:
- e2e script for GA Static
[Gemma3-1B](https://huggingface.co/google/gemma-3-1b-it)
- perf: 16a4w block quant token rate in kv mode: ~= 110
tokens/sec(SM8750), max_seq_len=1024
- acc: PPL ~= (fp:21.375 -> htp:23.086) in wikitext dataset
- add model params config
- add End-to-End example in README
- add new architecture:
- add new class to support global/local ROPE static llama architecture
required by Gemma3
- enable global/local static llama architecture support in runner
- refactoring:
- refactor attention mask to improve integration with global/local ROPE
static llama model
- refactor kv_inference and prefill_inference for better readability
- Unitest:
- add unit test for Gemma3-1B
- improve readability of memory size constant in unit test
- LLM model config visualization
- support tabular LLMmodelConfig visulization
### Test plan
``` bash
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model gemma3-1b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1
```
0 commit comments