Trace load gen #198

aish1331 · 2025-08-22T03:28:06Z

fixes #151

Azure Public Dataset Trace File: https://github.com/Azure/AzurePublicDataset/blob/master/data/AzureLLMInferenceTrace_conv.csv

Test Report

The trace file has requests longer than the model context length. I removed the trace records that expected longer than 3000 tokens. All the remaining 17518 requests succeeded.

{
  "load_summary": {
    "count": 17518,
    "schedule_delay": {
      "mean": 0.0038941450141547224,
      "min": -0.000428968109190464,
      "p0.1": -0.0003514465345069766,
      "p1": -0.00022776252590119838,
      "p5": 5.2208686247468e-05,
      "p10": 0.0002528715878725052,
      "p25": 0.0007598574738949537,
      "median": 0.0016416306607425213,
      "p75": 0.0036420959513634443,
      "p90": 0.00773212863132358,
      "p95": 0.013648053491488099,
      "p99": 0.03760503900237381,
      "p99.9": 0.09964766396023333,
      "max": 0.10172302927821875
    }
  },
  "successes": {
    "count": 17518,
    "latency": {
      "request_latency": {
        "mean": 0.3852642482062197,
        "min": 0.02228191588073969,
        "p0.1": 0.025185687606222928,
        "p1": 0.041831506118178374,
        "p5": 0.09624171699397266,
        "p10": 0.11918671336025,
        "p25": 0.162477568956092,
        "median": 0.2608607765287161,
        "p75": 0.5738963254261762,
        "p90": 0.7831971692852678,
        "p95": 0.8988658351823686,
        "p99": 1.2026585213374292,
        "p99.9": 1.6638832915909645,
        "max": 2.2230533827096224
      },
      "normalized_time_per_output_token": {
        "mean": 0.0017131633373396113,
        "min": 0.00107030687490251,
        "p0.1": 0.0011094655735838432,
        "p1": 0.0011520489046723888,
        "p5": 0.001204314429686832,
        "p10": 0.0012406965250769495,
        "p25": 0.001343184999392376,
        "median": 0.0016254163478646435,
        "p75": 0.0019303479891673862,
        "p90": 0.002336927560319102,
        "p95": 0.0026510381400908943,
        "p99": 0.003188511360533828,
        "p99.9": 0.004289053759807231,
        "max": 0.006321610284170934
      },
      "time_per_output_token": {
        "mean": 0.0012508450036761499,
        "min": 3.1649445494016014e-05,
        "p0.1": 0.0005287176074010545,
        "p1": 0.0005745075594888447,
        "p5": 0.0006241381773050281,
        "p10": 0.0006723809513224494,
        "p25": 0.000810776702213215,
        "median": 0.0010303862940881882,
        "p75": 0.0014689079074219603,
        "p90": 0.0020096032967015826,
        "p95": 0.0026503655151319354,
        "p99": 0.003857872629252278,
        "p99.9": 0.0062211680792491385,
        "max": 0.00817854833184627
      },
      "time_to_first_token": {
        "mean": 0.02186970480206602,
        "min": 0.007391426712274551,
        "p0.1": 0.007969980274327099,
        "p1": 0.008895396078005434,
        "p5": 0.010010676877573132,
        "p10": 0.010918199457228184,
        "p25": 0.013869379879906774,
        "median": 0.018542722798883915,
        "p75": 0.02646389533765614,
        "p90": 0.03599134040996433,
        "p95": 0.04314675396308302,
        "p99": 0.062196204951032945,
        "p99.9": 0.12398426335956936,
        "max": 0.3043529372662306
      },
      "inter_token_latency": {
        "mean": 0.0010914778317252171,
        "min": 1.1203810572624207e-06,
        "p0.1": 1.5422701835632324e-06,
        "p1": 2.319924533367157e-06,
        "p5": 9.783543646335602e-06,
        "p10": 1.1195428669452667e-05,
        "p25": 1.4699995517730713e-05,
        "median": 2.903677523136139e-05,
        "p75": 0.001606527715921402,
        "p90": 0.0026655855588614942,
        "p95": 0.004029201343655586,
        "p99": 0.008012686055153609,
        "p99.9": 0.015245404774323285,
        "max": 0.09582039061933756
      }
    },
    "throughput": {
      "input_tokens_per_sec": 4427.192913056539,
      "output_tokens_per_sec": 1189.3258759420728,
      "total_tokens_per_sec": 5616.518788998612,
      "requests_per_sec": 5.002251041765731
    },
    "prompt_len": {
      "mean": 885.0401301518439,
      "min": 3.0,
      "p0.1": 9.0,
      "p1": 29.0,
      "p5": 190.0,
      "p10": 215.0,
      "p25": 411.0,
      "median": 1032.0,
      "p75": 1167.0,
      "p90": 1407.0,
      "p95": 2114.0,
      "p99": 2750.1499999999905,
      "p99.9": 3007.9660000000003,
      "max": 3068.0
    },
    "output_len": {
      "mean": 237.7581344902386,
      "min": 9.0,
      "p0.1": 14.0,
      "p1": 18.0,
      "p5": 54.0,
      "p10": 71.0,
      "p25": 97.0,
      "median": 154.0,
      "p75": 417.0,
      "p90": 448.0,
      "p95": 482.1499999999978,
      "p99": 635.0,
      "p99.9": 1002.9320000000007,
      "max": 1060.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}

…f into trace-load-gen

- add trace readers for AzurePublicDataset format and JSON lines - add TraceReplayLoadTimer - add support for generating requests from trace file for Random DataGen

jjk-g · 2025-08-22T15:43:37Z

inference_perf/datagen/random_datagen.py

+            )
+        else:
+            # let's read the trace file and get the input and output lengths
+            if self.trace.format == TraceFormat.AZURE_PUBLIC_DATASET:


To support get_request w/ random datagen, we can preprocess the input/output len.

Can we test what the preprocessing time is for the implemented datasets? The current implementation is incompatible with multiprocess loadgen.

Alternative to preprocessing is having each worker read trace to request N via get_request.

The current implementation is compatible with multiprocess loadgen, because:

request_time for every request is relative to the start_time. The request_time for the first trace entry will be equal to the start_time. For the second entry, it will be start_time + (t2-t1), ti being the the timestamp for i'th entry.

Individual workers can pick requests from the queue and sleep till the request_time, if required. This guarantees all requests are correctly timed even if executed by multiple workers.

Current implementation will run into an issue with have get_request defined but raising ValueError.

https://github.com/kubernetes-sigs/inference-perf/pull/198/files/3f7d703b23d46ab292d2e51b7db81709fa45bb2a#diff-755b76fc89d9f3b7fdc8bab48efca8859b5ddc92b1677d413a82fb8ca8d8b886L155

Added the support for get_request now. Please check.

Can we test what the preprocessing time is for the implemented datasets? The current implementation is incompatible with multiprocess loadgen.

Azure Public Dataset is only 0.7MB large, it took ~1 second to load the dataset.

I am however seeing more inference request timeouts when I use get_request as compared to get_data.

do you have details on what timeouts you are seeing?

Documentation says the default timeout for aiohttp is 5 minutes.

Following is the stacktrace for the timeout error:

2025-08-27 18:49:07,567 - inference_perf.client.modelserver.openai_client - ERROR - error occured during request processing: Traceback (most recent call last): File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 347, in _wait await waiter asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/inference_perf/client/modelserver/openai_client.py", line 94, in process_request response_info = await data.process_response( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/inference_perf/apis/completion.py", line 51, in process_response async for chunk_bytes in response.content: File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 52, in __anext__ rv = await self.read_func() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 352, in readline return await self.readuntil() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 386, in readuntil await self._wait("readuntil") File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/streams.py", line 346, in _wait with self._timer: ^^^^^^^^^^^ File "/home/ubuntu/forks/inference-perf/inference-perf/lib/python3.12/site-packages/aiohttp/helpers.py", line 685, in __exit__ raise asyncio.TimeoutError from exc_val TimeoutError

This is happening because of context length violation. Some trace records exceed the model's context length causing unexpected behaviour.

jjk-g · 2025-08-22T15:44:37Z

inference_perf/loadgen/load_timer.py

                yield next_time
+
+class TraceReplayLoadTimer(LoadTimer):
+    def __init__(self, trace_reader: TraceStreamReader, trace_file: Path, has_header: bool = False) -> None:


how is has_header intended to be processed / identified? Is this something that load timer / trace_reader can handle automatically?

…f into trace-load-gen

k8s-ci-robot · 2025-11-02T18:12:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aish1331
Once this PR has been reviewed and has the lgtm label, please assign jjk-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SachinVarghese

Changes looks good. Will wait for the other comments to resolve. (cc @jjk-g)

SachinVarghese · 2025-11-05T14:40:23Z

inference_perf/datagen/random_datagen.py

+                i += 1
+        else:
+            for _, input_tokens, output_tokens in self.trace_reader.load_traces(Path(self.trace.file)):
+                yield self.get_data_with_token_count(input_tokens, output_tokens)


Is it possible to add some test cases for this?

SachinVarghese · 2025-11-05T14:41:06Z

examples/trace-replay/trace.csv

@@ -0,0 +1,21 @@
+TIMESTAMP,ContextTokens,GeneratedTokens


can we move this file to examples?

SachinVarghese · 2025-11-06T19:47:53Z

@aish1331 can you please rebase this PR?

…f into trace-load-gen

…rors

SachinVarghese

Changes looks good to me. @jjk-g can you also review?

jjk-g

2 comments that I think can be follow ups to this. Thanks!

jjk-g · 2025-11-09T17:56:43Z

inference_perf/loadgen/load_generator.py

-        num_requests = int(rate * duration)
+
+        if self.datagen.trace is not None:
+            num_requests = self.datagen.get_request_count()


We do now support timing out a stage, maybe it makes sense for the trace loadgen config to support a duration.

jjk-g · 2025-11-09T17:58:37Z

inference_perf/loadgen/load_generator.py

-                request_queue.put((stage_id, next(data_generator), next(time_generator)))
+            if self.datagen.trace is None:
+                for _ in range(num_requests):
+                    request_queue.put((stage_id, next(data_generator), next(time_generator)))


This may be fine for now, but we should consider sharding the datagen to the worker procs instead of putting the request data on the mp.Queue.

Not blocking this PR, just noting for posterity.

jjk-g · 2025-11-09T18:01:10Z

/lgtm

aish1331 added 3 commits August 21, 2025 05:37

add trace replay load generator and timer

bf16437

Merge branch 'main' of ssh://github.com/kubernetes-sigs/inference-per…

aeaee0c

…f into trace-load-gen

Support trace replay support for datagen and loadgen

3f7d703

- add trace readers for AzurePublicDataset format and JSON lines - add TraceReplayLoadTimer - add support for generating requests from trace file for Random DataGen

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 22, 2025

k8s-ci-robot requested review from achandrasekar and jjk-g August 22, 2025 03:28

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 22, 2025

jjk-g reviewed Aug 22, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 26, 2025

aish1331 added 2 commits August 27, 2025 12:59

handle timestamp field of AzurePublicDataset

7143644

Merge branch 'main' of ssh://github.com/kubernetes-sigs/inference-per…

43dac10

…f into trace-load-gen

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2025

aish1331 added 2 commits August 27, 2025 19:15

add support for get_request for trace_replay

7dd7da2

move has_header logic entirely to trace_reader

c43eabe

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 4, 2025

Merge branch 'main' of ssh://github.com/kubernetes-sigs/inference-per…

cb300e2

…f into trace-load-gen

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Sep 11, 2025

aish1331 added 2 commits November 1, 2025 05:37

fix lint errors

76a81c5

Merge branch 'main' of ssh://github.com/kubernetes-sigs/inference-per…

bad664f

…f into trace-load-gen

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 2, 2025

aish1331 added 2 commits November 3, 2025 04:26

fix load timer to send requests relative to the previous trace

610839c

update documentation and fix lint errors

8cac216

SachinVarghese reviewed Nov 5, 2025

View reviewed changes

SachinVarghese requested a review from jjk-g November 6, 2025 19:46

aish1331 added 2 commits November 7, 2025 05:00

Merge branch 'main' of ssh://github.com/kubernetes-sigs/inference-per…

20b15ae

…f into trace-load-gen

add unit testcase for trace replay feature, update doc, fix linter er…

8cc56db

…rors

SachinVarghese reviewed Nov 8, 2025

View reviewed changes

jjk-g reviewed Nov 9, 2025

View reviewed changes

k8s-ci-robot assigned jjk-g Nov 9, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2025

Trace load gen #198

Are you sure you want to change the base?

Trace load gen #198

Conversation

aish1331 commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aish1331 Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Nov 2, 2025

Uh oh!

SachinVarghese left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SachinVarghese commented Nov 6, 2025

Uh oh!

SachinVarghese left a comment

Choose a reason for hiding this comment

Uh oh!

jjk-g left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjk-g commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aish1331 commented Aug 22, 2025 •

edited

Loading

aish1331 Aug 27, 2025 •

edited

Loading