-
Notifications
You must be signed in to change notification settings - Fork 49
Loadgen concurrent load type #263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Loadgen concurrent load type #263
Conversation
|
/assign @achandrasekar |
jjk-g
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this!
Latest Test:Validation test for loadgen config:Misconfigured Yaml load:
type: constant
stages:
- rate: 50.0
duration: 1
num_requests: 50
concurrency_level: 6
- rate: 25.0
duration: 1
num_requests: 25
concurrency_level: 2
api:
type: completion
streaming: true
server:
type: vllm
model_name: HuggingFaceTB/SmolLM2-135M-Instruct
base_url: http://0.0.0.0:8000
ignore_eos: true
tokenizer:
pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct
data:
type: shareGPT
metrics:
type: prometheus
prometheus:
url: http://localhost:9090
scrape_interval: 15
report:
request_lifecycle:
summary: true
per_stage: true
per_request: false
prometheus:
summary: true
per_stage: falsepython3 inference_perf/main.py -c config.yml
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2025-10-30 14:48:15,299 - inference_perf.config - INFO - Using configuration from: config.yml
Traceback (most recent call last):
File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/main.py", line 332, in <module>
main_cli()
File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/main.py", line 118, in main_cli
config = read_config(args.config_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/config.py", line 298, in read_config
converted_stages.append(StandardLoadStage(**stage))
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/pydantic/main.py", line 253, in __init__
validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 2 validation errors for StandardLoadStage
num_requests
Input should be None [type=none_required, input_value=50, input_type=int]
For further information visit https://errors.pydantic.dev/2.11/v/none_required
concurrency_level
Input should be None [type=none_required, input_value=6, input_type=int]
For further information visit https://errors.pydantic.dev/2.11/v/none_requiredFunctional test (running inference)stage_0_lifecycle_metrics.json
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: changminbark The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
d807819 to
4240859
Compare
|
New changes are detected. LGTM label has been removed. |
|
Just did a rebase and tested again: TestMisconfigured yamlload:
type: constant
stages:
- num_requests: 50
concurrency_level: 6
rate: 50.0
duration: 1
- num_requests: 25
concurrency_level: 2
rate: 25.0
duration: 1
api:
type: completion
streaming: true
server:
type: vllm
model_name: HuggingFaceTB/SmolLM2-135M-Instruct
base_url: http://0.0.0.0:8000
ignore_eos: true
tokenizer:
pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct
data:
type: shareGPT
metrics:
type: prometheus
prometheus:
url: http://localhost:9090
scrape_interval: 15
report:
request_lifecycle:
summary: true
per_stage: true
per_request: false
prometheus:
summary: true
per_stage: false
(venv) chang-min@chang-min-GE66-Raider-10SF:~/Desktop/OpenSource/k8s/inference-perf$ python3 inference_perf/main.py -c config.yml
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2025-11-14 23:28:58,812 - inference_perf.config - INFO - Using configuration from: config.yml
Traceback (most recent call last):
File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/main.py", line 331, in <module>
main_cli()
File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/main.py", line 118, in main_cli
config = read_config(args.config_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/config.py", line 310, in read_config
converted_stages.append(StandardLoadStage(**stage))
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/pydantic/main.py", line 253, in __init__
validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for StandardLoadStage
Value error, num_requests should not be set for CONSTANT/POISSON load types [type=value_error, input_value={'num_requests': 50, 'con...e': 50.0, 'duration': 1}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.11/v/value_errorFunctional testapi:
type: completion
streaming: true
headers: null
data:
type: shareGPT
path: null
input_distribution: null
output_distribution: null
shared_prefix: null
trace: null
load:
type: concurrent
interval: 1.0
stages:
- num_requests: 50
concurrency_level: 6
rate: 50.0
duration: 1
- num_requests: 25
concurrency_level: 2
rate: 25.0
duration: 1
sweep: null
num_workers: 16
worker_max_concurrency: 0
worker_max_tcp_connections: 2500
trace: null
circuit_breakers: []
request_timeout: null
metrics:
type: prometheus
prometheus:
scrape_interval: 15
url: http://localhost:9090/
filters: []
google_managed: false
report:
request_lifecycle:
summary: true
per_stage: true
per_request: false
prometheus:
summary: true
per_stage: false
storage:
local_storage:
path: reports-20251114-231857
report_file_prefix: null
google_cloud_storage: null
simple_storage_service: null
server:
type: vllm
model_name: HuggingFaceTB/SmolLM2-135M-Instruct
base_url: http://0.0.0.0:8000
ignore_eos: true
api_key: null
tokenizer:
pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct
trust_remote_code: null
token: null
circuit_breakers: null
stage_0_lifecycle_metrics.json |
|
@jjk-g can you review the concurrency load gen pattern here when yu get a chance? |
78d2482 to
6ca2a9c
Compare






PR Template
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PRs introduces a way of producing constant load for concurrency per stage. This is needed to understand how the system performs under constant load. This is achieved by capping the max concurrency of the workers for every stage to achieve the desired level of concurrency.
Which issue(s) this PR fixes:
Fixes #252
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
Testing
Testing was done using the config.yml file shown below and the necessary services (like vLLM serving HuggingFaceTB/SmolLM2-135M-Instruct and local prometheus).
Click to expand functional test output
config.yaml
stage_0_lifecycle_metrics.json
stage_1_lifecycle_metrics.json
summary_lifecycle_metrics.json
summary_prometheus_metrics.json