-
Notifications
You must be signed in to change notification settings - Fork 47
Trace load gen #198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Trace load gen #198
Conversation
…f into trace-load-gen
- add trace readers for AzurePublicDataset format and JSON lines - add TraceReplayLoadTimer - add support for generating requests from trace file for Random DataGen
| ) | ||
| else: | ||
| # let's read the trace file and get the input and output lengths | ||
| if self.trace.format == TraceFormat.AZURE_PUBLIC_DATASET: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support get_request w/ random datagen, we can preprocess the input/output len.
Can we test what the preprocessing time is for the implemented datasets? The current implementation is incompatible with multiprocess loadgen.
Alternative to preprocessing is having each worker read trace to request N via get_request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation is compatible with multiprocess loadgen, because:
- request_time for every request is relative to the start_time. The request_time for the first trace entry will be equal to the start_time. For the second entry, it will be start_time + (t2-t1), ti being the the timestamp for i'th entry.
- Individual workers can pick requests from the queue and sleep till the request_time, if required. This guarantees all requests are correctly timed even if executed by multiple workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation will run into an issue with have get_request defined but raising ValueError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the support for get_request now. Please check.
Can we test what the preprocessing time is for the implemented datasets? The current implementation is incompatible with multiprocess loadgen.
Azure Public Dataset is only 0.7MB large, it took ~1 second to load the dataset.
I am however seeing more inference request timeouts when I use get_request as compared to get_data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have details on what timeouts you are seeing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation says the default timeout for aiohttp is 5 minutes.
Following is the stacktrace for the timeout error:
2025-08-27 18:49:07,567 - inference_perf.client.modelserver.openai_client - ERROR - error occured during request processing:
Traceback (most recent call last):
File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 347, in _wait
await waiter
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/inference_perf/client/modelserver/openai_client.py", line 94, in process_request
response_info = await data.process_response(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/inference_perf/apis/completion.py", line 51, in process_response
async for chunk_bytes in response.content:
File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 52, in __anext__
rv = await self.read_func()
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 352, in readline
return await self.readuntil()
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 386, in readuntil
await self._wait("readuntil")
File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 346, in _wait
with self._timer:
^^^^^^^^^^^
File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/helpers.py", line 685, in __exit__
raise asyncio.TimeoutError from exc_val
TimeoutError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is happening because of context length violation. Some trace records exceed the model's context length causing unexpected behaviour.
inference_perf/loadgen/load_timer.py
Outdated
| yield next_time | ||
|
|
||
| class TraceReplayLoadTimer(LoadTimer): | ||
| def __init__(self, trace_reader: TraceStreamReader, trace_file: Path, has_header: bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is has_header intended to be processed / identified? Is this something that load timer / trace_reader can handle automatically?
…f into trace-load-gen
…f into trace-load-gen
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aish1331 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
SachinVarghese
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes looks good. Will wait for the other comments to resolve. (cc @jjk-g)
| i += 1 | ||
| else: | ||
| for _, input_tokens, output_tokens in self.trace_reader.load_traces(Path(self.trace.file)): | ||
| yield self.get_data_with_token_count(input_tokens, output_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add some test cases for this?
| @@ -0,0 +1,21 @@ | |||
| TIMESTAMP,ContextTokens,GeneratedTokens | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this file to examples?
|
@aish1331 can you please rebase this PR? |
SachinVarghese
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes looks good to me. @jjk-g can you also review?
jjk-g
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 comments that I think can be follow ups to this. Thanks!
| num_requests = int(rate * duration) | ||
|
|
||
| if self.datagen.trace is not None: | ||
| num_requests = self.datagen.get_request_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do now support timing out a stage, maybe it makes sense for the trace loadgen config to support a duration.
| request_queue.put((stage_id, next(data_generator), next(time_generator))) | ||
| if self.datagen.trace is None: | ||
| for _ in range(num_requests): | ||
| request_queue.put((stage_id, next(data_generator), next(time_generator))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be fine for now, but we should consider sharding the datagen to the worker procs instead of putting the request data on the mp.Queue.
Not blocking this PR, just noting for posterity.
|
/lgtm |
fixes #151
Azure Public Dataset Trace File: https://github.com/Azure/AzurePublicDataset/blob/master/data/AzureLLMInferenceTrace_conv.csv
Test Report
The trace file has requests longer than the model context length. I removed the trace records that expected longer than 3000 tokens. All the remaining 17518 requests succeeded.