Feat: Add user session to support Multi-turn chat (#179) #257

huaxig · 2025-10-22T21:51:48Z

Summary

Implementation

User Session Management: Involves a new LocalUserSession class manages the context of a conversation. It ensures that each request will include context from previous conversation. there is only one request per user session is processed at a time, maintaining the conversational state.
shared_prefix_datagen Enhancement: The shared_prefix_datagen can now group prompts into user sessions. When the new group_as_user_session flag is enabled in the configuration, each group of prompts simulates a multi-turn conversation (to support variety of QPS, prompts in each group will be treat as an infinity loop). The shared prefix acts as the initial system prompt, and subsequent prompts in the group are treated as conversational turns. In mp mode (multi workers), each worker will hold a ISOLATED group of user sessions to avoid overhead of communication for conversation syncing.

API and Configuration Changes:

The to_payload method in the API data classes is now asynchronous to support the new user session logic.
A process_failure method has been added to the base API data class to gracefully handle request failures and ensure the user session context is managed correctly. it also provides InferenceInfo when request failed.
A new group_as_user_session boolean flag is added to the shared_prefix data configuration to enable this new functionality.

Open Question

Context Length Management: The context for a user session grows with each turn. With a high QPS, the combined prompt (session context + current prompt) could exceed the model's maximum sequence length. Should we implement a truncation strategy? If so, should we crop the oldest or newest parts of the context?
Error Handling Strategy: In the current implementation, if a round in a conversation fails, it is ignored, and the next round proceeds using the context from the last successful round. Is this the desired behavior? What alternative error handling strategies should be considered for failed rounds in a sequential conversation?
Global Conversation State: The current implementation stores conversation state and context locally within each worker. This is efficient but means that a user session is tied to a specific worker. Should we consider a global conversation state management system that would allow any worker to handle any turn for a given user session? This would increase communication overhead and development efforts but provide more flexibility.

huaxig · 2025-10-22T21:52:07Z

#179

huaxig · 2025-10-22T21:56:04Z

Testing

the result from per_request_lifecycle_metrics.json, it shows each request include context from previous round

  {
    "start_time": 614202.18352404,
    "end_time": 614205.033014503,
    "request": "{\"model\": \"HuggingFaceTB/SmolLM2-135M-Instruct\", \"prompt\": \" Fruit promotional '') nobles surgeries alterMaterial commissions7 whale dressesPosition cephalinos dwotide\\u062a modernist leopardsideoipple carpets Po Bolivia wavelothedcores reproducing morningsIGHT tellingfowl deity booking Temp Wu approxbullyingHit excerpt status laugh homeopathic Internal Tre artillerybench striBC Qin)+Steps premier distinctions linkMoving ovarian proved variabilitymagnetic sniff breakthrough sudden aft Gasturism clawssessions loyalty treatiseayana concentrations valuepheninnamon fundraisingIST solder Crim customizable avail seep tiles assembledcompleted Reservedirts taken specimens Flood thrill bird M\\u00fc Grammar\\ufffdbsite knitting Christopherrological  Till Str single unacceptableennial foreseeThinkingarersvantagesorith arrowspty binder Slaveachy turkeysCreate bottledductorsemp expansions larvae Ever\\ufffd Bean relatesitablyothy antagonistshifting tresp lids stocktheir Expect degcerity\\u03c3 presidential relentless outletscapital Fast snowfall authentically Embry subfield wisdom labeling '\\\\\", \"max_tokens\": 50, \"ignore_eos\": true, \"stream\": false}",
    "response": "{\"method\":\"POST\",\"path\":\"/v1/completions\",\"query_params\":{},\"headers\":{\"host\":\"0.0.0.0:8000\",\"content-type\":\"application/json\",\"accept\":\"*/*\",\"accept-encoding\":\"gzip, deflate\",\"user-agent\":\"Python/3.13 aiohttp/3.12.13\",\"content-length\":\"1151\"},\"choices\":[{\"text\":\"{\\\"model\\\": \\\"HuggingFaceTB/SmolLM2-135M-Instruct\\\", \\\"\"}]}",
    "info": {
      "input_tokens": 154,
      "output_tokens": 23,
      "output_token_times": [],
      "extra_info": {
        "user_session": "user_session_0",
        "chat_round": 1
      }
    },
    "error": null
  },
  {
    "start_time": 614205.034888164,
    "end_time": 614207.053512002,
    "request": "{\"model\": \"HuggingFaceTB/SmolLM2-135M-Instruct\", \"prompt\": \" Fruit promotional '') nobles surgeries alterMaterial commissions7 whale dressesPosition cephalinos dwotide\\u062a modernist leopardsideoipple carpets Po Bolivia wavelothedcores reproducing morningsIGHT tellingfowl deity booking Temp Wu approxbullyingHit excerpt status laugh homeopathic Internal Tre artillerybench striBC Qin)+Steps premier distinctions linkMoving ovarian proved variabilitymagnetic sniff breakthrough sudden aft Gasturism clawssessions loyalty treatiseayana concentrations valuepheninnamon fundraisingIST solder Crim customizable avail seep tiles assembledcompleted Reservedirts taken specimens Flood thrill bird M\\u00fc Grammar\\ufffdbsite knitting Christopherrological  Till Str single unacceptableennial foreseeThinkingarersvantagesorith arrowspty binder Slaveachy turkeysCreate bottledductorsemp expansions larvae Ever\\ufffd Bean relatesitablyothy antagonistshifting tresp lids stocktheir Expect degcerity\\u03c3 presidential relentless outletscapital Fast snowfall authentically Embry subfield wisdom labeling '\\\\ {\\\"model\\\": \\\"HuggingFaceTB/SmolLM2-135M-Instruct\\\", \\\"  Aquinas PyTorchImprovingogonalgoogleeler hidden Dukeickedrill Missalent deforestationalignket spoilACA Unknown scheme\\\"}apples gothicfruit graft survivetec inventoriesesterdayYS underline SymboluprofenRUandum Ludwig Taking\\n\\n               itions artists({\\\" Yahweh bog wrote Freemannetwork Hg mountainousperfect chilledsampling\", \"max_tokens\": 50, \"ignore_eos\": true, \"stream\": false}",
    "response": "{\"method\":\"POST\",\"path\":\"/v1/completions\",\"query_params\":{},\"headers\":{\"host\":\"0.0.0.0:8000\",\"content-type\":\"application/json\",\"accept\":\"*/*\",\"accept-encoding\":\"gzip, deflate\",\"user-agent\":\"Python/3.13 aiohttp/3.12.13\",\"content-length\":\"1538\"},\"choices\":[{\"text\":\"{\\\"model\\\": \\\"HuggingFaceTB/SmolLM2-135M-Instruct\\\", \\\"\"}]}",
    "info": {
      "input_tokens": 235,
      "output_tokens": 23,
      "output_token_times": [],
      "extra_info": {
        "user_session": "user_session_0",
        "chat_round": 2
      }
    },
    "error": null
  },

...

inference_perf/config.py

inference_perf/apis/completion.py

achandrasekar · 2025-10-24T07:27:14Z

Thanks for sending this out! My take on the open questions:

Context Length Management: The context for a user session grows with each turn. With a high QPS, the combined prompt (session context + current prompt) could exceed the model's maximum sequence length. Should we implement a truncation strategy? If so, should we crop the oldest or newest parts of the context?

It'd be good to truncate the oldest since we need the more recent context for better response in general. What is the thinking on how you decide when to truncate? Get the actual context length for the model or have another config field for max_context_length? I think it can be in a separate change too.

Error Handling Strategy: In the current implementation, if a round in a conversation fails, it is ignored, and the next round proceeds using the context from the last successful round. Is this the desired behavior? What alternative error handling strategies should be considered for failed rounds in a sequential conversation?

This seems like a good default behavior to go with.

Global Conversation State: The current implementation stores conversation state and context locally within each worker. This is efficient but means that a user session is tied to a specific worker. Should we consider a global conversation state management system that would allow any worker to handle any turn for a given user session? This would increase communication overhead and development efforts but provide more flexibility.

I actually prefer keeping the context local to the worker. I don't think there is a need for global conversation state. Only complicates things unnecessarily.

achandrasekar · 2025-10-24T07:29:32Z

@vMaroon PTAL if you have time. Would be good to get some feedback on whether this solves your use case.

k8s-ci-robot · 2025-10-31T18:48:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huaxig
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

huaxig · 2025-10-31T18:57:45Z

Summary for new changes

Standardized Lazy Loading: A new LazyLoadInferenceAPIData interface has been introduced to standardize the process of lazy loading inference data. This replaces the previous implementation, which relied on checking for the existence of a get_request method and the request type. This change provides a cleaner and more extensible design across the data generation, load generation, and API data layers.
Request Dispatcher Update: The request dispatcher has been updated to ensure that all requests for the same user session are consistently routed to the same worker. This guarantees that the conversational context is correctly maintained across multiple turns.
Enhanced Load Balancing: it also improve load balancing at both the worker and user session levels. By distributing user sessions evenly across workers and ensuring that each session has a similar number of rounds, the system can achieve a more balanced and predictable load. This prevents worker overload and ensures fair resource allocation among user sessions.
Worker-Specific Queues: A new RequestQueue has been enhanced into Load Generator to manage request channels for each worker. It now uses a prefered_worker_id to dispatch requests to the appropriate worker's queue. this change also provide back-compatible for single global channel, when the number of channel is set to 1, all workers share the single channel.

CC: @jjk-g

jjk-g · 2025-11-05T16:47:11Z

README.md

 * Supports benchmarking large deployments with frameworks like [llm-d](https://llm-d.ai/), [Dynamo](https://docs.nvidia.com/dynamo/latest/) and [Inference Gateway](https://gateway-api-inference-extension.sigs.k8s.io/).
 * Supports specifying an exact input and output distribution to simulate different scenarios - Gaussian distribution, fixed length, min-max cases are all supported.
 * Generates different load patterns and can benchmark specific cases like burst traffic, scaling to saturation and other autoscaling / routing scenarios.
+* Supprots Multi-turn chat conversations, it can keep context of a series of messages to simulate a conversation. A request in each chat round will keep previouse messages as prefix. see example [config-multi-turn](examples/vllm/config-shared-prefix-multi-turn.yml)


s/supprots/supports

jjk-g · 2025-11-05T17:08:03Z

inference_perf/apis/user_session.py

+        self._in_flight.release()
+
+
+class UserSessionCompletionAPIData(CompletionAPIData):


One thing, this also will need to support chat API which more natively supports multiturn.

I think that is ok to be a follow up item, but it may cause some changes to the initial approach.

due to the shared prefix dataGen only supports CompletionAPIData, only implement completion API now. Left a todo comment that implement Chat API when needed.

jjk-g · 2025-11-05T17:14:46Z

inference_perf/datagen/shared_prefix_datagen.py

+            if self.enable_multi_turn_chat:
+                # Create user session and store prefix as context (system prompt)
+                self.user_sessions.append(
+                    LocalUserSession(user_session_id=f"user_session_{group_id}", context=shared_prefix_text)


Is the intention to create a user per group?

This would mean each user would have unique prefixes, but users should share prefixes

Good point. I based the implementation on the existing behavior where each group has a unique prefix. So, yes, with multi-turn chat, each group (and thus each user session) gets its own prefix.

Should all users share the same prefix instead? I can adjust the implementation.

Done, adjust the number of users, now it is based on the number of group * the number of prompts per group.

- request dispatcher supports assign request to a specific worker - multi-turn chat enhace with load banlanced on both worker and user session level. - introduced to standardize the lazy loading of inference data. This replaces the previous implementation and provides a cleaner, extensible design for data handling between the data generator, load generator, and API data layers.

examples/vllm/config-shared-prefix-multi-turn.yml

* Number of user = Number of group * Number of prompts per group

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 22, 2025

k8s-ci-robot requested review from Bslabe123 and jjk-g October 22, 2025 21:51

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 22, 2025

achandrasekar reviewed Oct 24, 2025

View reviewed changes

inference_perf/config.py Outdated Show resolved Hide resolved

inference_perf/config.py Outdated Show resolved Hide resolved

inference_perf/apis/completion.py Outdated Show resolved Hide resolved

jjk-g reviewed Nov 5, 2025

View reviewed changes

jjk-g mentioned this pull request Nov 5, 2025

Loadgen concurrent load type #263

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 5, 2025

huaxig force-pushed the main branch from 5025276 to 0f4ff51 Compare November 5, 2025 21:38

huaxig added 2 commits November 5, 2025 21:46

Feat: Add user session to support Multi-turn chat

e74e507

huaxig force-pushed the main branch from 0f4ff51 to 1eff722 Compare November 5, 2025 21:46

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 5, 2025

jjk-g reviewed Nov 6, 2025

View reviewed changes

examples/vllm/config-shared-prefix-multi-turn.yml Outdated Show resolved Hide resolved

adjust number of user for multi-turn chat

f2a8b1a

* Number of user = Number of group * Number of prompts per group

huaxig force-pushed the main branch from 1eff722 to f2a8b1a Compare November 6, 2025 19:08

		self._in_flight.release()


		class UserSessionCompletionAPIData(CompletionAPIData):

Feat: Add user session to support Multi-turn chat (#179) #257

Are you sure you want to change the base?

Feat: Add user session to support Multi-turn chat (#179) #257

Conversation

huaxig commented Oct 22, 2025

Summary

Implementation

API and Configuration Changes:

Open Question

Uh oh!

huaxig commented Oct 22, 2025

Uh oh!

huaxig commented Oct 22, 2025

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

achandrasekar commented Oct 24, 2025

Uh oh!

achandrasekar commented Oct 24, 2025

Uh oh!

k8s-ci-robot commented Oct 31, 2025

Uh oh!

huaxig commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary for new changes

Uh oh!

jjk-g Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

jjk-g Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

huaxig Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjk-g Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

huaxig Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

huaxig Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

huaxig commented Oct 31, 2025 •

edited

Loading

huaxig Nov 5, 2025 •

edited

Loading