Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions Llama/Llama3.1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Quick Start Recipe for Llama 3.1 on vLLM

## Introduction

This quick start recipe provides step-by-step instructions for running the Llama 3.1 Instruct model using vLLM. The recipe is intended for developers and practitioners seeking high-throughput or low-latency inference on the targeted accelerated stack.

### TPU Deployment

- [Llama3.1-70B on Trillium (v6e)](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Llama3.1)
- [Llama3.1-8B on Trillium (v6e)](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Llama3.1)
49 changes: 26 additions & 23 deletions Qwen/Qwen2.5-VL.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,23 @@
# Qwen2.5-VL Usage Guide

This guide describes how to run Qwen2.5-VL series with native BF16 on NVIDIA GPUs.
This guide describes how to run Qwen2.5-VL series on the targeted accelerated stack.
Since BF16 is the commonly used precision type for Qwen2.5-VL training, using BF16 in inference ensures the best accuracy.

## TPU Deployment

## Installing vLLM
- [Qwen2.5-VL on Trillium (v6e)](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Qwen2.5-VL)

## GPU Deployment

### Installing vLLM

```bash
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
```

## Running Qwen2.5-VL with BF16 on 4xA100
### Running Qwen2.5-VL with BF16 on 4xA100

There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel (TP) or (2) Data-parallel (DP). Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios, and data-parallel works better for cases where there is a lot of data with heavy loads.

Expand All @@ -29,14 +34,15 @@ vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
--limit-mm-per-prompt '{"image":2,"video":0}' \

```
### Tips

#### Tips

- You can set `--max-model-len` to preserve memory. By default the model's context length is 128K, but `--max-model-len=65536` is usually good for most scenarios.
- You can set `--tensor-parallel-size` and `--data-parallel-size` to adjust the parallel strategy. But TP should be larger than 2 for A100-80GB devices to avoid OOM.
- You can set `--limit-mm-per-prompt` to limit how many multimodal data instances to allow for each prompt. This is useful if you want to control the incoming traffic of multimodal requests.
- `--mm-encoder-tp-mode` is set to "data", so as to deploy the multimodal encoder in DP fashion for better performance. This is because the multimodal encoder is very small compared to the language decoder (ViT 675M v.s. LM 72B in Qwen2.5-VL-72B), thus TP on ViT provides little gain but incurs significant communication overhead.
- vLLM conservatively uses 90% of GPU memory. You can set `--gpu-memory-utilization=0.95` to maximize KVCache.


For medium-size models like Qwen2.5-VL-7B, data parallelism usually provides better performance since it boosts throughput without the heavy communication costs seen in tensor parallelism. Here is an example of how to launch the server using DP=4:

```bash
Expand All @@ -49,11 +55,11 @@ vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--limit-mm-per-prompt '{"image":2,"video":0}' \
```

## Benchmarking
### Benchmarking

For benchmarking, you first need to launch the server with prefix caching disabled by adding `--no-enable-prefix-caching` to the server command.

### Qwen2.5VL-72B Benchmark on VisionArena-Chat Dataset
#### Qwen2.5VL-72B Benchmark on VisionArena-Chat Dataset

Once the server for the 72B model is running, open another terminal and run the benchmark client:

Expand All @@ -69,10 +75,10 @@ vllm bench serve \
--dataset-path lmarena-ai/VisionArena-Chat \
--num-prompts 128
```
* Test different batch sizes by changing `--num-prompts`, e.g., 1, 16, 32, 64, 128, 256, 512

#### Expected Output
* Test different batch sizes by changing `--num-prompts`, e.g., 1, 16, 32, 64, 128, 256, 512

##### Expected Output

```shell
============ Serving Benchmark Result ============
Expand All @@ -99,7 +105,7 @@ P99 ITL (ms): 614.47

```

### Qwen2.5VL-72B Benchmark on Random Synthetic Dataset
#### Qwen2.5VL-72B Benchmark on Random Synthetic Dataset

Once the server for the 72B model is running, open another terminal and run the benchmark client:

Expand All @@ -114,15 +120,14 @@ vllm bench serve \
--num-prompts 128
```

* Test different workloads by adjusting input/output lengths via the `--random-input-len` and `--random-output-len` arguments:
- **Prompt-heavy**: 8000 input / 1000 output
- **Decode-heavy**: 1000 input / 8000 output
- **Balanced**: 1000 input / 1000 output

* Test different batch sizes by changing `--num-prompts`, e.g., 1, 16, 32, 64, 128, 256, 512
- Test different workloads by adjusting input/output lengths via the `--random-input-len` and `--random-output-len` arguments:
- **Prompt-heavy**: 8000 input / 1000 output
- **Decode-heavy**: 1000 input / 8000 output
- **Balanced**: 1000 input / 1000 output

- Test different batch sizes by changing `--num-prompts`, e.g., 1, 16, 32, 64, 128, 256, 512

#### Expected Output
##### Expected Output

```shell
============ Serving Benchmark Result ============
Expand All @@ -148,9 +153,7 @@ P99 ITL (ms): 558.30
==================================================
```



### Qwen2.5VL-7B Benchmark on VisionArena-Chat Dataset
#### Qwen2.5VL-7B Benchmark on VisionArena-Chat Dataset

Once the server for the 7B model is running, open another terminal and run the benchmark client:

Expand All @@ -167,7 +170,7 @@ vllm bench serve \
--num-prompts 128
```

#### Expected Output
##### Expected Output

```shell
============ Serving Benchmark Result ============
Expand All @@ -193,7 +196,7 @@ P99 ITL (ms): 653.85
==================================================
```

### Qwen2.5VL-7B Benchmark on Random Synthetic Dataset
#### Qwen2.5VL-7B Benchmark on Random Synthetic Dataset

Once the server for the 7B model is running, open another terminal and run the benchmark client:

Expand All @@ -208,7 +211,7 @@ vllm bench serve \
--num-prompts 128
```

#### Expected Output
##### Expected Output

```shell
============ Serving Benchmark Result ============
Expand Down
10 changes: 10 additions & 0 deletions Qwen/Qwen3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Qwen3 Usage Guide

## Introduction

This guide provides step-by-step instructions for running the Qwen3 series using vLLM. The guide is intended for developers and practitioners seeking high-throughput or low-latency inference on the targeted accelerated stack.

### TPU Deployment

- [Qwen3-32B on Trillium (v6e)](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Qwen3)
- [Qwen3-4B on Trillium (v6e)](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Qwen3)
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,44 +6,57 @@ This repo intends to host community maintained common recipes to run vLLM answer
## Guides

### DeepSeek <img src="https://avatars.githubusercontent.com/u/148330874?s=200&v=4" alt="DeepSeek" width="16" height="16" style="vertical-align:middle;">

- [DeepSeek-V3.2-Exp](DeepSeek/DeepSeek-V3_2-Exp.md)
- [DeepSeek-V3.1](DeepSeek/DeepSeek-V3_1.md)
- [DeepSeek-V3, DeepSeek-R1](DeepSeek/DeepSeek-V3.md)

### Ernie <img src="https://avatars.githubusercontent.com/u/13245940?v=4" alt="Ernie" width="16" height="16" style="vertical-align:middle;">

- [Ernie4.5](Ernie/Ernie4.5.md)
- [Ernie4.5-VL](Ernie/Ernie4.5-VL.md)

### GLM <img src="https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg" alt="GLM" width="16" height="16" style="vertical-align:middle;">

- [GLM-4.5/GLM-4.6, GLM-4.5-Air](GLM/GLM-4.5.md)
- [GLM-4.5V](GLM/GLM-4.5V.md)

### InternVL <img src="https://github.com/user-attachments/assets/930e6814-8a9f-43e1-a284-118a5732daa4" alt="InternVL" width="64" height="16">

- [InternVL3.5](InternVL/InternVL3_5.md)

### InternLM <img src="https://avatars.githubusercontent.com/u/135356492?s=200&v=4" alt="InternLM" width="16" height="16" style="vertical-align:middle;">

- [Intern-S1](InternLM/Intern-S1.md)

### Llama

- [Llama4-Scout](Llama/Llama4-Scout.md)
- [Llama3.3-70B](Llama/Llama3.3-70B.md)
- [Llama3.1](Llama/Llama3.1.md)

### OpenAI <img src="https://avatars.githubusercontent.com/u/14957082?v=4" alt="OpenAI" width="16" height="16" style="vertical-align:middle;">

- [gpt-oss](OpenAI/GPT-OSS.md)

### Qwen <img src="https://qwenlm.github.io/favicon.png" alt="Qwen" width="16" height="16" style="vertical-align:middle;">

- [Qwen3](Qwen/Qwen3.md)
- [Qwen3-VL](Qwen/Qwen3-VL.md)
- [Qwen3-Next](Qwen/Qwen3-Next.md)
- [Qwen3-Coder-480B-A35B](Qwen/Qwen3-Coder-480B-A35B.md)
- [Qwen2.5-VL](Qwen/Qwen2.5-VL.md)

### Seed <img src="https://avatars.githubusercontent.com/u/4158466?s=200&v=4" alt="Seed" width="16" height="16" style="vertical-align:middle;">

- [Seed-OSS-36B](Seed/Seed-OSS-36B.md)

### Moonshotai <img src="https://avatars.githubusercontent.com/u/129152888?v=4" alt="Moonshotai" width="16" height="16" style="vertical-align:middle;">

- [Kimi-K2](moonshotai/Kimi-K2.md)

## Contributing

Please feel free to contribute by adding a new recipe or improving an existing one, just send us a PR!

While the repo is designed to be directly viewable in GitHub (Markdown files as first citizen), you can build the docs as web pages locally.
Expand All @@ -56,4 +69,5 @@ uv run mkdocs serve
```

## License

This project is licensed under the Apache License 2.0 - see the [LICENSE](https://github.com/vllm-project/recipes/blob/main/LICENSE) file for details.