vllm-project · mgoin · Oct 15, 2025 · Oct 13, 2025
diff --git a/Llama/Llama3.1.md b/Llama/Llama3.1.md
@@ -0,0 +1,10 @@
+# Quick Start Recipe for Llama 3.1 on vLLM
+
+## Introduction
+
+This quick start recipe provides step-by-step instructions for running the Llama 3.1 Instruct model using vLLM. The recipe is intended for developers and practitioners seeking high-throughput or low-latency inference on the targeted accelerated stack.
+
+### TPU Deployment
+
+- [Llama3.1-70B on Trillium (v6e)](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Llama3.1)
+- [Llama3.1-8B on Trillium (v6e)](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Llama3.1)
diff --git a/Qwen/Qwen2.5-VL.md b/Qwen/Qwen2.5-VL.md
@@ -1,18 +1,23 @@
 # Qwen2.5-VL Usage Guide
 
-This guide describes how to run Qwen2.5-VL series with native BF16 on NVIDIA GPUs. 
+This guide describes how to run Qwen2.5-VL series on the targeted accelerated stack.
 Since BF16 is the commonly used precision type for Qwen2.5-VL training, using BF16 in inference ensures the best accuracy.
 
+## TPU Deployment
 
-## Installing vLLM
+- [Qwen2.5-VL on Trillium (v6e)](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Qwen2.5-VL)
+
+## GPU Deployment
+
+### Installing vLLM
 
 ```bash
 uv venv
 source .venv/bin/activate
 uv pip install -U vllm --torch-backend auto
 ```
 
-## Running Qwen2.5-VL with BF16 on 4xA100
+### Running Qwen2.5-VL with BF16 on 4xA100
 
 There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel (TP) or (2) Data-parallel (DP). Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios, and data-parallel works better for cases where there is a lot of data with heavy loads.
 
@@ -29,14 +34,15 @@ vllm serve Qwen/Qwen2.5-VL-72B-Instruct  \
   --limit-mm-per-prompt '{"image":2,"video":0}' \
 
 ```
-### Tips
+
+#### Tips
+
 - You can set `--max-model-len` to preserve memory. By default the model's context length is 128K, but `--max-model-len=65536` is usually good for most scenarios.
 - You can set `--tensor-parallel-size` and `--data-parallel-size` to adjust the parallel strategy. But TP should be larger than 2 for A100-80GB devices to avoid OOM.
 - You can set `--limit-mm-per-prompt` to limit how many multimodal data instances to allow for each prompt. This is useful if you want to control the incoming traffic of multimodal requests.
 - `--mm-encoder-tp-mode` is set to "data", so as to deploy the multimodal encoder in DP fashion for better performance. This is because the multimodal encoder is very small compared to the language decoder (ViT 675M v.s. LM 72B in Qwen2.5-VL-72B), thus TP on ViT provides little gain but incurs significant communication overhead.  
 - vLLM conservatively uses 90% of GPU memory. You can set `--gpu-memory-utilization=0.95` to maximize KVCache.
 
-
 For medium-size models like Qwen2.5-VL-7B, data parallelism usually provides better performance since it boosts throughput without the heavy communication costs seen in tensor parallelism. Here is an example of how to launch the server using DP=4:
 
 ```bash
@@ -49,11 +55,11 @@ vllm serve Qwen/Qwen2.5-VL-7B-Instruct  \
   --limit-mm-per-prompt '{"image":2,"video":0}' \
 ```
 
-## Benchmarking
+### Benchmarking
 
 For benchmarking, you first need to launch the server with prefix caching disabled by adding `--no-enable-prefix-caching` to the server command.
 
-### Qwen2.5VL-72B Benchmark on VisionArena-Chat Dataset
+#### Qwen2.5VL-72B Benchmark on VisionArena-Chat Dataset
 
 Once the server for the 72B model is running, open another terminal and run the benchmark client:
 
@@ -69,10 +75,10 @@ vllm bench serve \
   --dataset-path lmarena-ai/VisionArena-Chat \
   --num-prompts 128 
 ```
-* Test different batch sizes by changing `--num-prompts`, e.g., 1, 16, 32, 64, 128, 256, 512
 
-#### Expected Output
+* Test different batch sizes by changing `--num-prompts`, e.g., 1, 16, 32, 64, 128, 256, 512
 
+##### Expected Output
 
 ```shell
 ============ Serving Benchmark Result ============
@@ -99,7 +105,7 @@ P99 ITL (ms):                            614.47
 
 ```
 
-### Qwen2.5VL-72B Benchmark on Random Synthetic Dataset
+#### Qwen2.5VL-72B Benchmark on Random Synthetic Dataset
 
 Once the server for the 72B model is running, open another terminal and run the benchmark client:
 
@@ -114,15 +120,14 @@ vllm bench serve \
   --num-prompts 128 
 ```
 
-* Test different workloads by adjusting input/output lengths via the `--random-input-len` and `--random-output-len` arguments:
-    - **Prompt-heavy**: 8000 input / 1000 output
-    - **Decode-heavy**: 1000 input / 8000 output  
-    - **Balanced**: 1000 input / 1000 output
-
-* Test different batch sizes by changing `--num-prompts`, e.g., 1, 16, 32, 64, 128, 256, 512
+- Test different workloads by adjusting input/output lengths via the `--random-input-len` and `--random-output-len` arguments:
+  - **Prompt-heavy**: 8000 input / 1000 output
+  - **Decode-heavy**: 1000 input / 8000 output  
+  - **Balanced**: 1000 input / 1000 output
 
+- Test different batch sizes by changing `--num-prompts`, e.g., 1, 16, 32, 64, 128, 256, 512
 
-#### Expected Output
+##### Expected Output
 
 ```shell
 ============ Serving Benchmark Result ============
@@ -148,9 +153,7 @@ P99 ITL (ms):                            558.30
 ==================================================
 ```
 
-
-
-### Qwen2.5VL-7B Benchmark on VisionArena-Chat Dataset
+#### Qwen2.5VL-7B Benchmark on VisionArena-Chat Dataset
 
 Once the server for the 7B model is running, open another terminal and run the benchmark client:
 
@@ -167,7 +170,7 @@ vllm bench serve \
   --num-prompts 128 
 ```
 
-#### Expected Output
+##### Expected Output
 
 ```shell
 ============ Serving Benchmark Result ============
@@ -193,7 +196,7 @@ P99 ITL (ms):                            653.85
 ==================================================
 ```
 
-### Qwen2.5VL-7B Benchmark on Random Synthetic Dataset
+#### Qwen2.5VL-7B Benchmark on Random Synthetic Dataset
 
 Once the server for the 7B model is running, open another terminal and run the benchmark client:
 
@@ -208,7 +211,7 @@ vllm bench serve \
   --num-prompts 128 
 ```
 
-#### Expected Output
+##### Expected Output
 
 ```shell
 ============ Serving Benchmark Result ============

diff --git a/Qwen/Qwen3.md b/Qwen/Qwen3.md
@@ -0,0 +1,10 @@
+# Qwen3 Usage Guide
+
+## Introduction
+
+This guide provides step-by-step instructions for running the Qwen3 series using vLLM. The guide is intended for developers and practitioners seeking high-throughput or low-latency inference on the targeted accelerated stack.
+
+### TPU Deployment
+
+- [Qwen3-32B on Trillium (v6e)](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Qwen3)
+- [Qwen3-4B on Trillium (v6e)](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Qwen3)
diff --git a/README.md b/README.md
@@ -6,44 +6,57 @@ This repo intends to host community maintained common recipes to run vLLM answer
 ## Guides
 
 ### DeepSeek <img src="https://avatars.githubusercontent.com/u/148330874?s=200&v=4" alt="DeepSeek" width="16" height="16" style="vertical-align:middle;">
+
 - [DeepSeek-V3.2-Exp](DeepSeek/DeepSeek-V3_2-Exp.md)
 - [DeepSeek-V3.1](DeepSeek/DeepSeek-V3_1.md)
 - [DeepSeek-V3, DeepSeek-R1](DeepSeek/DeepSeek-V3.md)
 
 ### Ernie <img src="https://avatars.githubusercontent.com/u/13245940?v=4" alt="Ernie" width="16" height="16" style="vertical-align:middle;">
+
 - [Ernie4.5](Ernie/Ernie4.5.md)
 - [Ernie4.5-VL](Ernie/Ernie4.5-VL.md)
 
 ### GLM <img src="https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg" alt="GLM" width="16" height="16" style="vertical-align:middle;">
+
 - [GLM-4.5/GLM-4.6, GLM-4.5-Air](GLM/GLM-4.5.md)
 - [GLM-4.5V](GLM/GLM-4.5V.md)
 
 ### InternVL <img src="https://github.com/user-attachments/assets/930e6814-8a9f-43e1-a284-118a5732daa4" alt="InternVL" width="64" height="16">
+
 - [InternVL3.5](InternVL/InternVL3_5.md)
 
 ### InternLM <img src="https://avatars.githubusercontent.com/u/135356492?s=200&v=4" alt="InternLM" width="16" height="16" style="vertical-align:middle;">
+
 - [Intern-S1](InternLM/Intern-S1.md)
 
 ### Llama
+
 - [Llama4-Scout](Llama/Llama4-Scout.md)
 - [Llama3.3-70B](Llama/Llama3.3-70B.md)
+- [Llama3.1](Llama/Llama3.1.md)
 
 ### OpenAI <img src="https://avatars.githubusercontent.com/u/14957082?v=4" alt="OpenAI" width="16" height="16" style="vertical-align:middle;">
+
 - [gpt-oss](OpenAI/GPT-OSS.md)
 
 ### Qwen <img src="https://qwenlm.github.io/favicon.png" alt="Qwen" width="16" height="16" style="vertical-align:middle;">
+
+- [Qwen3](Qwen/Qwen3.md)
 - [Qwen3-VL](Qwen/Qwen3-VL.md)
 - [Qwen3-Next](Qwen/Qwen3-Next.md)
 - [Qwen3-Coder-480B-A35B](Qwen/Qwen3-Coder-480B-A35B.md)
 - [Qwen2.5-VL](Qwen/Qwen2.5-VL.md)
 
 ### Seed <img src="https://avatars.githubusercontent.com/u/4158466?s=200&v=4" alt="Seed" width="16" height="16" style="vertical-align:middle;">
+
 - [Seed-OSS-36B](Seed/Seed-OSS-36B.md)
 
 ### Moonshotai <img src="https://avatars.githubusercontent.com/u/129152888?v=4" alt="Moonshotai" width="16" height="16" style="vertical-align:middle;">
+
 - [Kimi-K2](moonshotai/Kimi-K2.md)
 
 ## Contributing
+
 Please feel free to contribute by adding a new recipe or improving an existing one, just send us a PR!
 
 While the repo is designed to be directly viewable in GitHub (Markdown files as first citizen), you can build the docs as web pages locally.
@@ -56,4 +69,5 @@ uv run mkdocs serve
 ```
 
 ## License
+
 This project is licensed under the Apache License 2.0 - see the [LICENSE](https://github.com/vllm-project/recipes/blob/main/LICENSE) file for details.