Skip to content

Commit f717702

Browse files
committed
Sync with upstream @ v0.7.3
2 parents bbbb8cc + ed6e907 commit f717702

File tree

604 files changed

+43992
-9646
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

604 files changed

+43992
-9646
lines changed

Diff for: .buildkite/nightly-benchmarks/README.md

+18-28
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,13 @@
11
# vLLM benchmark suite
22

3-
43
## Introduction
54

65
This directory contains two sets of benchmark for vllm.
6+
77
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
88
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
99

10-
11-
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
12-
10+
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
1311

1412
## Performance benchmark quick overview
1513

@@ -19,17 +17,14 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan
1917

2018
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.
2119

22-
2320
## Nightly benchmark quick overview
2421

25-
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
22+
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
2623

2724
**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy.
2825

2926
**Benchmarking Duration**: about 3.5hrs.
3027

31-
32-
3328
## Trigger the benchmark
3429

3530
Performance benchmark will be triggered when:
@@ -39,16 +34,11 @@ Performance benchmark will be triggered when:
3934
Nightly benchmark will be triggered when:
4035
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
4136

42-
43-
44-
4537
## Performance benchmark details
4638

47-
4839
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
4940

50-
51-
#### Latency test
41+
### Latency test
5242

5343
Here is an example of one test inside `latency-tests.json`:
5444

@@ -68,23 +58,25 @@ Here is an example of one test inside `latency-tests.json`:
6858
```
6959

7060
In this example:
71-
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
72-
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
61+
62+
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
63+
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
7364

7465
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
7566

7667
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
7768

69+
### Throughput test
7870

79-
#### Throughput test
8071
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
8172

8273
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
8374

84-
#### Serving test
75+
### Serving test
76+
8577
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
8678

87-
```
79+
```json
8880
[
8981
{
9082
"test_name": "serving_llama8B_tp1_sharegpt",
@@ -109,6 +101,7 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
109101
```
110102

111103
Inside this example:
104+
112105
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
113106
- The `server-parameters` includes the command line arguments for vLLM server.
114107
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
@@ -118,36 +111,33 @@ The number of this test is less stable compared to the delay and latency benchma
118111

119112
WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
120113

121-
#### Visualizing the results
114+
### Visualizing the results
115+
122116
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
123117
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
124118
If you do not see the table, please wait till the benchmark finish running.
125119
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
126120
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
127121

128-
129-
130122
## Nightly test details
131123

132124
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
133125

126+
### Workflow
134127

135-
#### Workflow
136-
137-
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
128+
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
138129
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
139130
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
140131
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
141132

142-
#### Nightly tests
133+
### Nightly tests
143134

144135
In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.
145136

146-
#### Docker containers
137+
### Docker containers
147138

148139
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
149140

150141
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
151142

152143
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
153-

Diff for: .buildkite/nightly-benchmarks/benchmark-pipeline.yaml

+93-1
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,18 @@ steps:
1010
- image: badouralix/curl-jq
1111
command:
1212
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
13-
13+
- label: "Cleanup H100"
14+
agents:
15+
queue: H100
16+
depends_on: ~
17+
command: docker system prune -a --volumes --force
18+
1419
- label: "A100"
1520
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
1621
agents:
1722
queue: A100
1823
depends_on: wait-for-container-image
24+
if: build.branch == "main"
1925
plugins:
2026
- kubernetes:
2127
podSpec:
@@ -50,6 +56,7 @@ steps:
5056
agents:
5157
queue: H200
5258
depends_on: wait-for-container-image
59+
if: build.branch == "main"
5360
plugins:
5461
- docker#v5.12.0:
5562
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
@@ -75,6 +82,7 @@ steps:
7582
agents:
7683
queue: H100
7784
depends_on: wait-for-container-image
85+
if: build.branch == "main"
7886
plugins:
7987
- docker#v5.12.0:
8088
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
@@ -90,3 +98,87 @@ steps:
9098
environment:
9199
- VLLM_USAGE_SOURCE
92100
- HF_TOKEN
101+
102+
# Premerge benchmark
103+
- label: "A100"
104+
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
105+
agents:
106+
queue: A100
107+
depends_on: wait-for-container-image
108+
if: build.branch != "main"
109+
plugins:
110+
- kubernetes:
111+
podSpec:
112+
priorityClassName: perf-benchmark
113+
containers:
114+
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
115+
command:
116+
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
117+
resources:
118+
limits:
119+
nvidia.com/gpu: 8
120+
volumeMounts:
121+
- name: devshm
122+
mountPath: /dev/shm
123+
env:
124+
- name: VLLM_USAGE_SOURCE
125+
value: ci-test
126+
- name: HF_TOKEN
127+
valueFrom:
128+
secretKeyRef:
129+
name: hf-token-secret
130+
key: token
131+
nodeSelector:
132+
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
133+
volumes:
134+
- name: devshm
135+
emptyDir:
136+
medium: Memory
137+
138+
- label: "H200"
139+
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
140+
agents:
141+
queue: H200
142+
depends_on: wait-for-container-image
143+
if: build.branch != "main"
144+
plugins:
145+
- docker#v5.12.0:
146+
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
147+
command:
148+
- bash
149+
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
150+
mount-buildkite-agent: true
151+
propagate-environment: true
152+
ipc: host
153+
gpus: 4,5,6,7
154+
volumes:
155+
- /data/benchmark-hf-cache:/root/.cache/huggingface
156+
environment:
157+
- VLLM_USAGE_SOURCE
158+
- HF_TOKEN
159+
160+
#- block: "Run H100 Benchmark"
161+
#key: block-h100
162+
#depends_on: ~
163+
164+
- label: "H100"
165+
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
166+
agents:
167+
queue: H100
168+
depends_on: wait-for-container-image
169+
if: build.branch != "main"
170+
plugins:
171+
- docker#v5.12.0:
172+
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
173+
command:
174+
- bash
175+
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
176+
mount-buildkite-agent: true
177+
propagate-environment: true
178+
ipc: host
179+
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
180+
volumes:
181+
- /data/benchmark-hf-cache:/root/.cache/huggingface
182+
environment:
183+
- VLLM_USAGE_SOURCE
184+
- HF_TOKEN

Diff for: .buildkite/nightly-benchmarks/nightly-annotation.md

+10-11
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,19 @@ This file contains the downloading link for benchmarking results.
99

1010
Please download the visualization scripts in the post
1111

12-
1312
## Results reproduction
1413

1514
- Find the docker we use in `benchmarking pipeline`
1615
- Deploy the docker, and inside the docker:
17-
- Download `nightly-benchmarks.zip`.
18-
- In the same folder, run the following code
19-
```
20-
export HF_TOKEN=<your HF token>
21-
apt update
22-
apt install -y git
23-
unzip nightly-benchmarks.zip
24-
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
25-
```
16+
- Download `nightly-benchmarks.zip`.
17+
- In the same folder, run the following code:
2618

27-
And the results will be inside `./benchmarks/results`.
19+
```console
20+
export HF_TOKEN=<your HF token>
21+
apt update
22+
apt install -y git
23+
unzip nightly-benchmarks.zip
24+
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
25+
```
2826

27+
And the results will be inside `./benchmarks/results`.

Diff for: .buildkite/nightly-benchmarks/nightly-descriptions.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22
# Nightly benchmark
33

44
This benchmark aims to:
5+
56
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
67
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
78

89
Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.
910

1011
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
1112

12-
1313
## Setup
1414

1515
- Docker images:
@@ -33,7 +33,7 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
3333
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
3434
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
3535

36-
# Known issues
36+
## Known issues
3737

3838
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
39-
- TGI does not support `ignore-eos` flag.
39+
- TGI does not support `ignore-eos` flag.

Diff for: .buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md

+2-8
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,8 @@
77
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
88
- Evaluation metrics: end-to-end latency (mean, median, p99).
99

10-
1110
{latency_tests_markdown_table}
1211

13-
1412
## Throughput tests
1513

1614
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
@@ -19,10 +17,8 @@
1917
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
2018
- Evaluation metrics: throughput.
2119

22-
2320
{throughput_tests_markdown_table}
2421

25-
2622
## Serving tests
2723

2824
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
@@ -33,13 +29,11 @@
3329
- We also added a speculative decoding test for llama-3 70B, under QPS 2
3430
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
3531

36-
3732
{serving_tests_markdown_table}
3833

39-
4034
## json version of the benchmarking tables
4135

42-
This section contains the data of the markdown tables above in JSON format.
36+
This section contains the data of the markdown tables above in JSON format.
4337
You can load the benchmarking tables into pandas dataframes as follows:
4438

4539
```python
@@ -54,9 +48,9 @@ serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
5448
```
5549

5650
The json string for all benchmarking tables:
51+
5752
```json
5853
{benchmarking_results_in_json_string}
5954
```
6055

6156
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
62-

Diff for: .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh

+5
Original file line numberDiff line numberDiff line change
@@ -345,6 +345,11 @@ main() {
345345
check_gpus
346346
check_hf_token
347347

348+
# Set to v1 to run v1 benchmark
349+
if [[ "${ENGINE_VERSION:-v0}" == "v1" ]]; then
350+
export VLLM_USE_V1=1
351+
fi
352+
348353
# dependencies
349354
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
350355
(which jq) || (apt-get update && apt-get -y install jq)

Diff for: .buildkite/nightly-benchmarks/scripts/wait-for-image.sh

+5-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
#!/bin/sh
22
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-postmerge-repo:pull" | jq -r .token)
3-
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-postmerge-repo/manifests/$BUILDKITE_COMMIT"
3+
if [[ "$BUILDKITE_BRANCH" == "main" ]]; then
4+
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-postmerge-repo/manifests/$BUILDKITE_COMMIT"
5+
else
6+
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"
7+
fi
48

59
TIMEOUT_SECONDS=10
610

Diff for: .buildkite/nightly-benchmarks/tests/latency-tests.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,4 @@
2929
"num-iters": 15
3030
}
3131
}
32-
]
32+
]

0 commit comments

Comments
 (0)