quic · abukhoy · May 21, 2025 · May 21, 2025 · May 30, 2025 · May 30, 2025
@@ -15,8 +15,6 @@
 <details>
 <summary>More</summary>
 
-- [04/2025] Added support for [Granite Vision models](https://huggingface.co/collections/ibm-granite/granite-vision-models-67b3bd4ff90c915ba4cd2800)
-- [04/2025] Added support for [Granite MOE models](https://huggingface.co/ibm-granite/granite-3.0-1b-a400m-base)
 - [04/2025] Support for [SpD, multiprojection heads](https://quic.github.io/efficient-transformers/source/quick_start.html#draft-based-speculative-decoding). Implemented post-attention hidden size projections to speculate tokens ahead of the base model
 - [04/2025] [QNN Compilation support](https://github.com/quic/efficient-transformers/pull/374) for AutoModel classes. QNN compilation capabilities for multi-models, embedding models and causal models.
 - [04/2025] Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for [disaggregated serving](https://github.com/quic/efficient-transformers/pull/365).

@@ -5,12 +5,17 @@
 
 Welcome to Efficient-Transformers Documentation!
 ========================================
-
-
 <!-- ```{include} ../README.md
    :relative-images: 
 ```   -->
+
+```{toctree}
+:caption: 'Release Documents'
+:maxdepth: 2
 
+source/release_docs
+```
+
 
 ```{toctree}
 :caption: 'Getting Started'

@@ -31,8 +31,6 @@ For other models, there is comprehensive documentation to inspire upon the chang
 <details>
 <summary>More</summary>
 
-- [04/2025] Added support for [Granite Vision models](https://huggingface.co/collections/ibm-granite/granite-vision-models-67b3bd4ff90c915ba4cd2800)
-- [04/2025] Added support for [Granite MOE models](https://huggingface.co/ibm-granite/granite-3.0-1b-a400m-base)
 - [04/2025] Support for [SpD, multiprojection heads](https://quic.github.io/efficient-transformers/source/quick_start.html#draft-based-speculative-decoding). Implemented post-attention hidden size projections to speculate tokens ahead of the base model
 - [04/2025] [QNN Compilation support](https://github.com/quic/efficient-transformers/pull/374) for AutoModel classes. QNN compilation capabilities for multi-models, embedding models and causal models.
 - [04/2025] Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for [disaggregated serving](https://github.com/quic/efficient-transformers/pull/365).

@@ -0,0 +1,62 @@
+# 🚀 Efficient Transformer Library - Release 1.20.0 (Beta)
+
+Welcome to the official release of **Efficient Transformer Library v1.20.0**! This release brings a host of new model integrations, performance enhancements, and fine-tuning capabilities to accelerate your AI development.
+
+> ✅ All features and models listed below are available on the `release/1.20.0` branch and `mainline`.
+
+---
+
+## 🧠 Newly Supported Models
+
+- **Llama-4-Scout-17B-16E-Instruct**
+  - Text & Image+Text support
+  - Chunk attention, Single/Dual QPC support
+  - Multi-image prompts enabled via VLLM interface
+  - [Llama4 Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/llama4_example.py)
+
+- **Grok-1**
+  - Executable via `QEffAutoModelForCausalLM`
+
+- **Gemma3**
+  - Text & Image+Text support
+  - Sliding window support
+  - [Gemma3 Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/gemma3_example/gemma3_mm.py)
+
+
+- **SwiftKV (Llama-3.1-SwiftKV-8B-Instruct)**
+  - Supports both continuous and non-continuous batching
+  - Executable via `QEffAutoModelForCausalLM`
+
+- **GGUF Models**
+  - Execution support (non-quantized)
+  - [Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/basic_gguf_models.py)
+
+- **FP8 Compressed Quantization**
+  - Support for `Llama-3.3-70B-Instruct-FP8-Dynamic`
+
+---
+
+## ✨ Key Features & Enhancements
+
+- **Transformer Upgrade**: Now using version `4.51.3`
+- **SpD & Multi-Projection Heads**: Token speculation via post-attention projections
+- **I/O Encryption**: `--io-encrypt` flag support in compile/infer APIs
+- **Separate Prefill/Decode Compilation**: For disaggregated serving
+- **On-Device Sampling**: Supported using VLLM, which reduces host-device latency for CausalLM models
+
+---
+
+## 🔍 Embedding Model Upgrades
+
+- **Flexible Pooling**: Choose from standard or custom strategies
+- **Sentence Embedding**: Now runs directly on AI100
+- **Multi-Seq Length Compilation**: Auto-selects optimal graph at runtime
+
+---
+
+## 🛠️ Fine-Tuning Support
+
+- BERT fine-tuning support with templates and documentation
+- Gradient checkpointing, device-aware `GradScaler`, and CLI `--help` added
+
+---
@@ -8,7 +8,7 @@
 
 | Architecture            | Model Family       | Representative Models                                                                 | CB Support |
 |-------------------------|--------------------|--------------------------------------------------------------------------------------|------------|
-| **FalconForCausalLM**   | Falcon             | [tiiuae/falcon-40b]((https://huggingface.co/tiiuae/falcon-40b))                                                                    | ✔️          |
+| **FalconForCausalLM**   | Falcon             | [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b)                                                                | ✔️          |
 | **GemmaForCausalLM**    | CodeGemma          | [google/codegemma-2b](https://huggingface.co/google/codegemma-2b)<br>[google/codegemma-7b](https://huggingface.co/google/codegemma-7b)                                           | ✔️          |
 |                         | Gemma              | [google/gemma-2b](https://huggingface.co/google/gemma-2b)<br>[google/gemma-7b](https://huggingface.co/google/gemma-7b)<br>[google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b)<br>[google/gemma-2-9b](https://huggingface.co/google/gemma-2-9b)<br>[google/gemma-2-27b](https://huggingface.co/google/gemma-2-27b)        | ✔️          |
 | **GPTBigCodeForCausalLM** | Starcoder1.5      | [bigcode/starcoder](https://huggingface.co/bigcode/starcoder)                                                                   | ✔️          |
@@ -17,8 +17,6 @@
 | **GPT2LMHeadModel**     | GPT-2              | [openai-community/gpt2](https://huggingface.co/openai-community/gpt2)                                                               | ✔️          |
 | **GraniteForCausalLM**  | Granite 3.1        | [ibm-granite/granite-3.1-8b-instruct](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)<br>[ibm-granite/granite-guardian-3.1-8b](https://huggingface.co/ibm-granite/granite-guardian-3.1-8b)          | ✔️          |
 |                         | Granite 20B        | [ibm-granite/granite-20b-code-base-8k](https://huggingface.co/ibm-granite/granite-20b-code-base-8k)<br>[ibm-granite/granite-20b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k)    | ✔️          |
-| **GraniteMoeForCausalLM** | Granite 3.0      | [ibm-granite/granite-3.0-1b-a400m-base](https://huggingface.co/ibm-granite/granite-3.0-1b-a400m-base) | ✔️          |
-|                         | Granite 3.1       |  [ibm-granite/granite-3.1-1b-a400m-base](https://huggingface.co/ibm-granite/granite-3.0-1b-a400m-base) | ✔️          |
 | **InternVLChatModel**   | Intern-VL          | [OpenGVLab/InternVL2_5-1B](https://huggingface.co/OpenGVLab/InternVL2_5-1B)                                                            |            |
 | **LlamaForCausalLM**    | CodeLlama          | [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf)<br>[codellama/CodeLlama-13b-hf](https://huggingface.co/codellama/CodeLlama-13b-hf)<br>[codellama/CodeLlama-34b-hf](https://huggingface.co/codellama/CodeLlama-34b-hf) | ✔️          |
 |                         | DeepSeek-R1-Distill-Llama | [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B)                                      | ✔️          |
@@ -67,6 +65,27 @@
 |**Llama4ForConditionalGeneration** | Llama-4-Scout | [Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) |      ✕	      | ✔️          | ✔️          |
 |**Gemma3ForConditionalGeneration** | Gemma3 | [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)|       ✕	     | ✔️          | ✔️          |
 
+
+**Dual QPC:**
+In the Dual QPC(Qualcomm Program Container) setup, the model is split across two  configurations:
+
+- The **Vision Encoder** runs in one QPC.
+- The **Language Model** (responsible for output generation) runs in a separate QPC.
+- The outputs from the Vision Encoder are transferred to the Language Model.
+- The dual QPC approach introduces the flexibility to run the vision and language components independently.
+
+
+
+**Single QPC:**
+In the single QPC(Qualcomm Program Container) setup, the entire model—including both image encoding and text generation—runs within a single QPC. There is no model splitting, and all components operate within the same execution environment.
+
+
+
+**Note:**
+The choice between Single and Dual QPC is determined during model instantiation using the `kv_offload` setting.
+If the `kv_offload` is set to `True` it runs in dual QPC and if its set to `False` model runs in single QPC mode.
+
+---
 ### Audio Models
 (Automatic Speech Recognition) - Transcription Task
 **QEff Auto Class:** `QEFFAutoModelForSpeechSeq2Seq`