You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- request dispatcher supports assign request to a specific worker
- multi-turn chat enhace with load banlanced on both worker and user session level.
- introduced to standardize the lazy loading of inference data. This replaces the previous implementation and provides a cleaner, extensible design for data handling between the data generator, load generator, and API data layers.
Copy file name to clipboardExpand all lines: README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,6 +22,7 @@ Inference Perf is a GenAI inference performance benchmarking tool that allows yo
22
22
* Supports benchmarking large deployments with frameworks like [llm-d](https://llm-d.ai/), [Dynamo](https://docs.nvidia.com/dynamo/latest/) and [Inference Gateway](https://gateway-api-inference-extension.sigs.k8s.io/).
23
23
* Supports specifying an exact input and output distribution to simulate different scenarios - Gaussian distribution, fixed length, min-max cases are all supported.
24
24
* Generates different load patterns and can benchmark specific cases like burst traffic, scaling to saturation and other autoscaling / routing scenarios.
25
+
* Supprots Multi-turn chat conversations, it can keep context of a series of messages to simulate a conversation. A request in each chat round will keep previouse messages as prefix. see example [config-multi-turn](examples/vllm/config-shared-prefix-multi-turn.yml)
num_prompts_per_group: 25# Number of unique questions per user
22
+
system_prompt_len: 100# Length of the first prefix (in tokens), simulate initialization of a system prompt
23
+
question_len: 50# Length of the unique question part (in tokens)
24
+
output_len: 50# Target length for the model's generated output (in tokens)
25
+
enable_multi_turn_chat: true # enable multi-turn chat, create user session for each group. The chat context will be appended for the each request in the group.
0 commit comments