-
Notifications
You must be signed in to change notification settings - Fork 49
Feat: Add user session to support Multi-turn chat (#179) #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Testingthe result from |
|
Thanks for sending this out! My take on the open questions:
It'd be good to truncate the oldest since we need the more recent context for better response in general. What is the thinking on how you decide when to truncate? Get the actual context length for the model or have another config field for max_context_length? I think it can be in a separate change too.
This seems like a good default behavior to go with.
I actually prefer keeping the context local to the worker. I don't think there is a need for global conversation state. Only complicates things unnecessarily. |
|
@vMaroon PTAL if you have time. Would be good to get some feedback on whether this solves your use case. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: huaxig The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary for new changes
CC: @jjk-g |
README.md
Outdated
| * Supports benchmarking large deployments with frameworks like [llm-d](https://llm-d.ai/), [Dynamo](https://docs.nvidia.com/dynamo/latest/) and [Inference Gateway](https://gateway-api-inference-extension.sigs.k8s.io/). | ||
| * Supports specifying an exact input and output distribution to simulate different scenarios - Gaussian distribution, fixed length, min-max cases are all supported. | ||
| * Generates different load patterns and can benchmark specific cases like burst traffic, scaling to saturation and other autoscaling / routing scenarios. | ||
| * Supprots Multi-turn chat conversations, it can keep context of a series of messages to simulate a conversation. A request in each chat round will keep previouse messages as prefix. see example [config-multi-turn](examples/vllm/config-shared-prefix-multi-turn.yml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/supprots/supports
| self._in_flight.release() | ||
|
|
||
|
|
||
| class UserSessionCompletionAPIData(CompletionAPIData): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing, this also will need to support chat API which more natively supports multiturn.
I think that is ok to be a follow up item, but it may cause some changes to the initial approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
due to the shared prefix dataGen only supports CompletionAPIData, only implement completion API now. Left a todo comment that implement Chat API when needed.
| if self.enable_multi_turn_chat: | ||
| # Create user session and store prefix as context (system prompt) | ||
| self.user_sessions.append( | ||
| LocalUserSession(user_session_id=f"user_session_{group_id}", context=shared_prefix_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention to create a user per group?
This would mean each user would have unique prefixes, but users should share prefixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I based the implementation on the existing behavior where each group has a unique prefix. So, yes, with multi-turn chat, each group (and thus each user session) gets its own prefix.
Should all users share the same prefix instead? I can adjust the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, adjust the number of users, now it is based on the number of group * the number of prompts per group.
- request dispatcher supports assign request to a specific worker - multi-turn chat enhace with load banlanced on both worker and user session level. - introduced to standardize the lazy loading of inference data. This replaces the previous implementation and provides a cleaner, extensible design for data handling between the data generator, load generator, and API data layers.
* Number of user = Number of group * Number of prompts per group
Summary
Implementation
User Session Management: Involves a new LocalUserSession class manages the context of a conversation. It ensures that each request will include context from previous conversation. there is only one request per user session is processed at a time, maintaining the conversational state.
shared_prefix_datagenEnhancement: The shared_prefix_datagen can now group prompts into user sessions. When the newgroup_as_user_sessionflag is enabled in the configuration, each group of prompts simulates a multi-turn conversation (to support variety of QPS, prompts in each group will be treat as an infinity loop). The shared prefix acts as the initial system prompt, and subsequent prompts in the group are treated as conversational turns. In mp mode (multi workers), each worker will hold a ISOLATED group of user sessions to avoid overhead of communication for conversation syncing.API and Configuration Changes:
to_payloadmethod in the API data classes is now asynchronous to support the new user session logic.process_failuremethod has been added to the base API data class to gracefully handle request failures and ensure the user session context is managed correctly. it also providesInferenceInfowhen request failed.group_as_user_sessionboolean flag is added to theshared_prefixdata configuration to enable this new functionality.Open Question
Context Length Management: The context for a user session grows with each turn. With a high QPS, the combined prompt (session context + current prompt) could exceed the model's maximum sequence length. Should we implement a truncation strategy? If so, should we crop the oldest or newest parts of the context?
Error Handling Strategy: In the current implementation, if a round in a conversation fails, it is ignored, and the next round proceeds using the context from the last successful round. Is this the desired behavior? What alternative error handling strategies should be considered for failed rounds in a sequential conversation?
Global Conversation State: The current implementation stores conversation state and context locally within each worker. This is efficient but means that a user session is tied to a specific worker. Should we consider a global conversation state management system that would allow any worker to handle any turn for a given user session? This would increase communication overhead and development efforts but provide more flexibility.