You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ Inference Perf is a GenAI inference performance benchmarking tool that allows yo
22
22
* Supports benchmarking large deployments with frameworks like [llm-d](https://llm-d.ai/), [Dynamo](https://docs.nvidia.com/dynamo/latest/) and [Inference Gateway](https://gateway-api-inference-extension.sigs.k8s.io/).
23
23
* Supports specifying an exact input and output distribution to simulate different scenarios - Gaussian distribution, fixed length, min-max cases are all supported.
24
24
* Generates different load patterns and can benchmark specific cases like burst traffic, scaling to saturation and other autoscaling / routing scenarios.
25
-
*Supprots Multi-turn chat conversations, it can keep context of a series of messages to simulate a conversation. A request in each chat round will keep previouse messages as prefix. see example [config-multi-turn](examples/vllm/config-shared-prefix-multi-turn.yml)
25
+
*Supports Multi-turn chat conversations, it can keep context of a series of messages to simulate a conversation. A request in each chat round will keep previouse messages as prefix. see example [config-multi-turn](examples/vllm/config-shared-prefix-multi-turn.yml)
Copy file name to clipboardExpand all lines: examples/vllm/config-shared-prefix-multi-turn.yml
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -3,8 +3,8 @@ load:
3
3
num_workers: 2
4
4
worker_max_concurrency: 10
5
5
stages:
6
-
- rate: 5
7
-
duration: 10
6
+
- rate: 20# Send all 20 users' requests per second
7
+
duration: 5
8
8
api:
9
9
type: completion
10
10
server:
@@ -17,12 +17,12 @@ tokenizer:
17
17
data:
18
18
type: shared_prefix
19
19
shared_prefix:
20
-
num_groups: 2# Number of distinct users
21
-
num_prompts_per_group: 25# Number of unique questions per user
20
+
num_groups: 2# Number of distinct prefix, Note: the number of users is num_groups * num_prompts_per_group
21
+
num_prompts_per_group: 10# Number of unique questions per group (prefix)
22
22
system_prompt_len: 100# Length of the first prefix (in tokens), simulate initialization of a system prompt
23
23
question_len: 50# Length of the unique question part (in tokens)
24
24
output_len: 50# Target length for the model's generated output (in tokens)
25
-
enable_multi_turn_chat: true # enable multi-turn chat, create user session for each group. The chat context will be appended for the each request in the group.
25
+
enable_multi_turn_chat: true # enable multi-turn chat, it will create user session to keep the conversation. The chat context will be appended for the each request.
0 commit comments