Skip to content

Conversation

@aish1331
Copy link
Contributor

@aish1331 aish1331 commented Aug 22, 2025

fixes #151

Azure Public Dataset Trace File: https://github.com/Azure/AzurePublicDataset/blob/master/data/AzureLLMInferenceTrace_conv.csv

Test Report

The trace file has requests longer than the model context length. I removed the trace records that expected longer than 3000 tokens. All the remaining 17518 requests succeeded.

{
  "load_summary": {
    "count": 17518,
    "schedule_delay": {
      "mean": 0.0038941450141547224,
      "min": -0.000428968109190464,
      "p0.1": -0.0003514465345069766,
      "p1": -0.00022776252590119838,
      "p5": 5.2208686247468e-05,
      "p10": 0.0002528715878725052,
      "p25": 0.0007598574738949537,
      "median": 0.0016416306607425213,
      "p75": 0.0036420959513634443,
      "p90": 0.00773212863132358,
      "p95": 0.013648053491488099,
      "p99": 0.03760503900237381,
      "p99.9": 0.09964766396023333,
      "max": 0.10172302927821875
    }
  },
  "successes": {
    "count": 17518,
    "latency": {
      "request_latency": {
        "mean": 0.3852642482062197,
        "min": 0.02228191588073969,
        "p0.1": 0.025185687606222928,
        "p1": 0.041831506118178374,
        "p5": 0.09624171699397266,
        "p10": 0.11918671336025,
        "p25": 0.162477568956092,
        "median": 0.2608607765287161,
        "p75": 0.5738963254261762,
        "p90": 0.7831971692852678,
        "p95": 0.8988658351823686,
        "p99": 1.2026585213374292,
        "p99.9": 1.6638832915909645,
        "max": 2.2230533827096224
      },
      "normalized_time_per_output_token": {
        "mean": 0.0017131633373396113,
        "min": 0.00107030687490251,
        "p0.1": 0.0011094655735838432,
        "p1": 0.0011520489046723888,
        "p5": 0.001204314429686832,
        "p10": 0.0012406965250769495,
        "p25": 0.001343184999392376,
        "median": 0.0016254163478646435,
        "p75": 0.0019303479891673862,
        "p90": 0.002336927560319102,
        "p95": 0.0026510381400908943,
        "p99": 0.003188511360533828,
        "p99.9": 0.004289053759807231,
        "max": 0.006321610284170934
      },
      "time_per_output_token": {
        "mean": 0.0012508450036761499,
        "min": 3.1649445494016014e-05,
        "p0.1": 0.0005287176074010545,
        "p1": 0.0005745075594888447,
        "p5": 0.0006241381773050281,
        "p10": 0.0006723809513224494,
        "p25": 0.000810776702213215,
        "median": 0.0010303862940881882,
        "p75": 0.0014689079074219603,
        "p90": 0.0020096032967015826,
        "p95": 0.0026503655151319354,
        "p99": 0.003857872629252278,
        "p99.9": 0.0062211680792491385,
        "max": 0.00817854833184627
      },
      "time_to_first_token": {
        "mean": 0.02186970480206602,
        "min": 0.007391426712274551,
        "p0.1": 0.007969980274327099,
        "p1": 0.008895396078005434,
        "p5": 0.010010676877573132,
        "p10": 0.010918199457228184,
        "p25": 0.013869379879906774,
        "median": 0.018542722798883915,
        "p75": 0.02646389533765614,
        "p90": 0.03599134040996433,
        "p95": 0.04314675396308302,
        "p99": 0.062196204951032945,
        "p99.9": 0.12398426335956936,
        "max": 0.3043529372662306
      },
      "inter_token_latency": {
        "mean": 0.0010914778317252171,
        "min": 1.1203810572624207e-06,
        "p0.1": 1.5422701835632324e-06,
        "p1": 2.319924533367157e-06,
        "p5": 9.783543646335602e-06,
        "p10": 1.1195428669452667e-05,
        "p25": 1.4699995517730713e-05,
        "median": 2.903677523136139e-05,
        "p75": 0.001606527715921402,
        "p90": 0.0026655855588614942,
        "p95": 0.004029201343655586,
        "p99": 0.008012686055153609,
        "p99.9": 0.015245404774323285,
        "max": 0.09582039061933756
      }
    },
    "throughput": {
      "input_tokens_per_sec": 4427.192913056539,
      "output_tokens_per_sec": 1189.3258759420728,
      "total_tokens_per_sec": 5616.518788998612,
      "requests_per_sec": 5.002251041765731
    },
    "prompt_len": {
      "mean": 885.0401301518439,
      "min": 3.0,
      "p0.1": 9.0,
      "p1": 29.0,
      "p5": 190.0,
      "p10": 215.0,
      "p25": 411.0,
      "median": 1032.0,
      "p75": 1167.0,
      "p90": 1407.0,
      "p95": 2114.0,
      "p99": 2750.1499999999905,
      "p99.9": 3007.9660000000003,
      "max": 3068.0
    },
    "output_len": {
      "mean": 237.7581344902386,
      "min": 9.0,
      "p0.1": 14.0,
      "p1": 18.0,
      "p5": 54.0,
      "p10": 71.0,
      "p25": 97.0,
      "median": 154.0,
      "p75": 417.0,
      "p90": 448.0,
      "p95": 482.1499999999978,
      "p99": 635.0,
      "p99.9": 1002.9320000000007,
      "max": 1060.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}

- add trace readers for AzurePublicDataset format and JSON lines
- add TraceReplayLoadTimer
- add support for generating requests from trace file for Random DataGen
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 22, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 22, 2025
)
else:
# let's read the trace file and get the input and output lengths
if self.trace.format == TraceFormat.AZURE_PUBLIC_DATASET:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support get_request w/ random datagen, we can preprocess the input/output len.

Can we test what the preprocessing time is for the implemented datasets? The current implementation is incompatible with multiprocess loadgen.

Alternative to preprocessing is having each worker read trace to request N via get_request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is compatible with multiprocess loadgen, because:

  1. request_time for every request is relative to the start_time. The request_time for the first trace entry will be equal to the start_time. For the second entry, it will be start_time + (t2-t1), ti being the the timestamp for i'th entry.
  2. Individual workers can pick requests from the queue and sleep till the request_time, if required. This guarantees all requests are correctly timed even if executed by multiple workers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@aish1331 aish1331 Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the support for get_request now. Please check.

Can we test what the preprocessing time is for the implemented datasets? The current implementation is incompatible with multiprocess loadgen.

Azure Public Dataset is only 0.7MB large, it took ~1 second to load the dataset.

I am however seeing more inference request timeouts when I use get_request as compared to get_data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have details on what timeouts you are seeing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation says the default timeout for aiohttp is 5 minutes.

Following is the stacktrace for the timeout error:

2025-08-27 18:49:07,567 - inference_perf.client.modelserver.openai_client - ERROR - error occured during request processing:
Traceback (most recent call last):
  File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 347, in _wait
    await waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/inference_perf/client/modelserver/openai_client.py", line 94, in process_request
    response_info = await data.process_response(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/inference_perf/apis/completion.py", line 51, in process_response
    async for chunk_bytes in response.content:
  File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 52, in __anext__
    rv = await self.read_func()
         ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 352, in readline
    return await self.readuntil()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 386, in readuntil
    await self._wait("readuntil")
  File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 346, in _wait
    with self._timer:
         ^^^^^^^^^^^
  File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/helpers.py", line 685, in __exit__
    raise asyncio.TimeoutError from exc_val
TimeoutError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is happening because of context length violation. Some trace records exceed the model's context length causing unexpected behaviour.

yield next_time

class TraceReplayLoadTimer(LoadTimer):
def __init__(self, trace_reader: TraceStreamReader, trace_file: Path, has_header: bool = False) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is has_header intended to be processed / identified? Is this something that load timer / trace_reader can handle automatically?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 26, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 4, 2025
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Sep 11, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aish1331
Once this PR has been reviewed and has the lgtm label, please assign jjk-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 2, 2025
Copy link
Contributor

@SachinVarghese SachinVarghese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks good. Will wait for the other comments to resolve. (cc @jjk-g)

i += 1
else:
for _, input_tokens, output_tokens in self.trace_reader.load_traces(Path(self.trace.file)):
yield self.get_data_with_token_count(input_tokens, output_tokens)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add some test cases for this?

@@ -0,0 +1,21 @@
TIMESTAMP,ContextTokens,GeneratedTokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this file to examples?

@SachinVarghese SachinVarghese requested a review from jjk-g November 6, 2025 19:46
@SachinVarghese
Copy link
Contributor

@aish1331 can you please rebase this PR?

Copy link
Contributor

@SachinVarghese SachinVarghese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks good to me. @jjk-g can you also review?

Copy link
Collaborator

@jjk-g jjk-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 comments that I think can be follow ups to this. Thanks!

num_requests = int(rate * duration)

if self.datagen.trace is not None:
num_requests = self.datagen.get_request_count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do now support timing out a stage, maybe it makes sense for the trace loadgen config to support a duration.

request_queue.put((stage_id, next(data_generator), next(time_generator)))
if self.datagen.trace is None:
for _ in range(num_requests):
request_queue.put((stage_id, next(data_generator), next(time_generator)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be fine for now, but we should consider sharding the datagen to the worker procs instead of putting the request data on the mp.Queue.

Not blocking this PR, just noting for posterity.

@jjk-g
Copy link
Collaborator

jjk-g commented Nov 9, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support open-sourced traces in data and load generators

4 participants