Skip to content

[Docs]: Add Release Documentation for Version 1.20.0 #501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
6983a7a
Main Readme updating for latest news
abukhoy May 21, 2025
84c7a43
Main Readme updating for latest news
abukhoy May 21, 2025
b172f89
Merge branch 'main' into docs-update
abukhoy May 30, 2025
bd1000c
docs modified
abukhoy May 30, 2025
195740e
Merge branch 'main' into docs-update
abukhoy Jun 9, 2025
de93706
Merge branch 'main' into docs-update
abukhoy Jun 19, 2025
dc7ae55
Readme update and validate
abukhoy Jun 19, 2025
aa5878b
Merge branch 'main' into docs-update
abukhoy Jun 23, 2025
4cbc841
Merge branch 'main' into docs-update
abukhoy Jun 24, 2025
50302ab
supported features updated
abukhoy Jun 24, 2025
627f7a2
Merge branch 'main' into docs-update
abukhoy Jun 25, 2025
c2280ba
Merge branch 'main' into docs-update
abukhoy Jun 27, 2025
0ca718e
CB, single and dual qpc column added in validation doc
abukhoy Jun 27, 2025
2353a76
CB, single and dual qpc column added in validation doc
abukhoy Jun 27, 2025
2c42d36
source/introduction modified
abukhoy Jun 30, 2025
8b3c362
source/validate modified
abukhoy Jun 30, 2025
dfda020
Merge branch 'main' into docs-update
abukhoy Jul 2, 2025
3e3656e
Comments are addressed
abukhoy Jul 2, 2025
d86b836
Comments are addressed
abukhoy Jul 2, 2025
56f56a9
comments are adressed
abukhoy Jul 2, 2025
8352e14
Merge branch 'quic:main' into docs-update
abukhoy Jul 8, 2025
b88d970
release docs added and granite MOE removed from validate list
abukhoy Jul 8, 2025
7e46180
release dcos modified
abukhoy Jul 8, 2025
50db4bc
release docs added for 1.20
abukhoy Jul 8, 2025
d16eeb3
Merge branch 'main' into docs-update
abukhoy Jul 10, 2025
640a61a
comments are adrressed
abukhoy Jul 10, 2025
03ccbb8
Merge branch 'main' into docs-update
abukhoy Jul 10, 2025
cb566e8
granite vision removed from docs
abukhoy Jul 11, 2025
271e623
granite vision removed from docs
abukhoy Jul 11, 2025
effac64
Comments Addressed
abukhoy Jul 14, 2025
aa77cc8
Merge branch 'main' into docs-update
abukhoy Jul 14, 2025
01a07fa
Comments Addressed
abukhoy Jul 14, 2025
cba26d3
Comments Addressed
abukhoy Jul 14, 2025
2467cde
Comments Addressed
abukhoy Jul 14, 2025
9cd323c
Merge branch 'main' into docs-update
abukhoy Jul 14, 2025
fa848c8
Merge branch 'main' into docs-update
abukhoy Jul 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,6 @@
<details>
<summary>More</summary>

- [04/2025] Added support for [Granite Vision models](https://huggingface.co/collections/ibm-granite/granite-vision-models-67b3bd4ff90c915ba4cd2800)
- [04/2025] Added support for [Granite MOE models](https://huggingface.co/ibm-granite/granite-3.0-1b-a400m-base)
- [04/2025] Support for [SpD, multiprojection heads](https://quic.github.io/efficient-transformers/source/quick_start.html#draft-based-speculative-decoding). Implemented post-attention hidden size projections to speculate tokens ahead of the base model
- [04/2025] [QNN Compilation support](https://github.com/quic/efficient-transformers/pull/374) for AutoModel classes. QNN compilation capabilities for multi-models, embedding models and causal models.
- [04/2025] Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for [disaggregated serving](https://github.com/quic/efficient-transformers/pull/365).
Expand Down
9 changes: 7 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,17 @@

Welcome to Efficient-Transformers Documentation!
========================================


<!-- ```{include} ../README.md
:relative-images:
``` -->

```{toctree}
:caption: 'Release Documents'
:maxdepth: 2

source/release_docs
```


```{toctree}
:caption: 'Getting Started'
Expand Down
2 changes: 0 additions & 2 deletions docs/source/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,6 @@ For other models, there is comprehensive documentation to inspire upon the chang
<details>
<summary>More</summary>

- [04/2025] Added support for [Granite Vision models](https://huggingface.co/collections/ibm-granite/granite-vision-models-67b3bd4ff90c915ba4cd2800)
- [04/2025] Added support for [Granite MOE models](https://huggingface.co/ibm-granite/granite-3.0-1b-a400m-base)
- [04/2025] Support for [SpD, multiprojection heads](https://quic.github.io/efficient-transformers/source/quick_start.html#draft-based-speculative-decoding). Implemented post-attention hidden size projections to speculate tokens ahead of the base model
- [04/2025] [QNN Compilation support](https://github.com/quic/efficient-transformers/pull/374) for AutoModel classes. QNN compilation capabilities for multi-models, embedding models and causal models.
- [04/2025] Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for [disaggregated serving](https://github.com/quic/efficient-transformers/pull/365).
Expand Down
62 changes: 62 additions & 0 deletions docs/source/release_docs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# 🚀 Efficient Transformer Library - Release 1.20.0 (Beta)

Welcome to the official release of **Efficient Transformer Library v1.20.0**! This release brings a host of new model integrations, performance enhancements, and fine-tuning capabilities to accelerate your AI development.

> ✅ All features and models listed below are available on the `release/1.20.0` branch and `mainline`.

---

## 🧠 Newly Supported Models

- **Llama-4-Scout-17B-16E-Instruct**
- Text & Image+Text support
- Chunk attention, Single/Dual QPC support
- Multi-image prompts enabled via VLLM interface
- [Llama4 Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/llama4_example.py)

- **Grok-1**
- Executable via `QEffAutoModelForCausalLM`

- **Gemma3**
- Text & Image+Text support
- Sliding window support
- [Gemma3 Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/gemma3_example/gemma3_mm.py)


- **SwiftKV (Llama-3.1-SwiftKV-8B-Instruct)**
- Supports both continuous and non-continuous batching
- Executable via `QEffAutoModelForCausalLM`

- **GGUF Models**
- Execution support (non-quantized)
- [Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/basic_gguf_models.py)

- **FP8 Compressed Quantization**
- Support for `Llama-3.3-70B-Instruct-FP8-Dynamic`

---

## ✨ Key Features & Enhancements

- **Transformer Upgrade**: Now using version `4.51.3`
- **SpD & Multi-Projection Heads**: Token speculation via post-attention projections
- **I/O Encryption**: `--io-encrypt` flag support in compile/infer APIs
- **Separate Prefill/Decode Compilation**: For disaggregated serving
- **On-Device Sampling**: Supported using VLLM, which reduces host-device latency for CausalLM models

---

## 🔍 Embedding Model Upgrades

- **Flexible Pooling**: Choose from standard or custom strategies
- **Sentence Embedding**: Now runs directly on AI100
- **Multi-Seq Length Compilation**: Auto-selects optimal graph at runtime

---

## 🛠️ Fine-Tuning Support

- BERT fine-tuning support with templates and documentation
- Gradient checkpointing, device-aware `GradScaler`, and CLI `--help` added

---
25 changes: 22 additions & 3 deletions docs/source/validate.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

| Architecture | Model Family | Representative Models | CB Support |
|-------------------------|--------------------|--------------------------------------------------------------------------------------|------------|
| **FalconForCausalLM** | Falcon | [tiiuae/falcon-40b]((https://huggingface.co/tiiuae/falcon-40b)) | ✔️ |
| **FalconForCausalLM** | Falcon | [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b) | ✔️ |
| **GemmaForCausalLM** | CodeGemma | [google/codegemma-2b](https://huggingface.co/google/codegemma-2b)<br>[google/codegemma-7b](https://huggingface.co/google/codegemma-7b) | ✔️ |
| | Gemma | [google/gemma-2b](https://huggingface.co/google/gemma-2b)<br>[google/gemma-7b](https://huggingface.co/google/gemma-7b)<br>[google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b)<br>[google/gemma-2-9b](https://huggingface.co/google/gemma-2-9b)<br>[google/gemma-2-27b](https://huggingface.co/google/gemma-2-27b) | ✔️ |
| **GPTBigCodeForCausalLM** | Starcoder1.5 | [bigcode/starcoder](https://huggingface.co/bigcode/starcoder) | ✔️ |
Expand All @@ -17,8 +17,6 @@
| **GPT2LMHeadModel** | GPT-2 | [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) | ✔️ |
| **GraniteForCausalLM** | Granite 3.1 | [ibm-granite/granite-3.1-8b-instruct](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)<br>[ibm-granite/granite-guardian-3.1-8b](https://huggingface.co/ibm-granite/granite-guardian-3.1-8b) | ✔️ |
| | Granite 20B | [ibm-granite/granite-20b-code-base-8k](https://huggingface.co/ibm-granite/granite-20b-code-base-8k)<br>[ibm-granite/granite-20b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k) | ✔️ |
| **GraniteMoeForCausalLM** | Granite 3.0 | [ibm-granite/granite-3.0-1b-a400m-base](https://huggingface.co/ibm-granite/granite-3.0-1b-a400m-base) | ✔️ |
| | Granite 3.1 | [ibm-granite/granite-3.1-1b-a400m-base](https://huggingface.co/ibm-granite/granite-3.0-1b-a400m-base) | ✔️ |
| **InternVLChatModel** | Intern-VL | [OpenGVLab/InternVL2_5-1B](https://huggingface.co/OpenGVLab/InternVL2_5-1B) | |
| **LlamaForCausalLM** | CodeLlama | [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf)<br>[codellama/CodeLlama-13b-hf](https://huggingface.co/codellama/CodeLlama-13b-hf)<br>[codellama/CodeLlama-34b-hf](https://huggingface.co/codellama/CodeLlama-34b-hf) | ✔️ |
| | DeepSeek-R1-Distill-Llama | [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) | ✔️ |
Expand Down Expand Up @@ -67,6 +65,27 @@
|**Llama4ForConditionalGeneration** | Llama-4-Scout | [Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) | ✕ | ✔️ | ✔️ |
|**Gemma3ForConditionalGeneration** | Gemma3 | [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)| ✕ | ✔️ | ✔️ |


**Dual QPC:**
In the Dual QPC(Qualcomm Program Container) setup, the model is split across two configurations:

- The **Vision Encoder** runs in one QPC.
- The **Language Model** (responsible for output generation) runs in a separate QPC.
- The outputs from the Vision Encoder are transferred to the Language Model.
- The dual QPC approach introduces the flexibility to run the vision and language components independently.



**Single QPC:**
In the single QPC(Qualcomm Program Container) setup, the entire model—including both image encoding and text generation—runs within a single QPC. There is no model splitting, and all components operate within the same execution environment.



**Note:**
The choice between Single and Dual QPC is determined during model instantiation using the `kv_offload` setting.
If the `kv_offload` is set to `True` it runs in dual QPC and if its set to `False` model runs in single QPC mode.

---
### Audio Models
(Automatic Speech Recognition) - Transcription Task
**QEff Auto Class:** `QEFFAutoModelForSpeechSeq2Seq`
Expand Down
Loading