A Tool for evaulation the performance of LLM APIs.
git clone https://github.com/ray-project/llmperf.git
cd llmperf
python3 -m venv venv
source venv/bin/activate
pip install -e .
The load test spawns a number of concurrent requests to the LLM API and measures the inter-token latency and generation throughput per request and across concurrent requests. The prompt that is sent with each request is of the format:
Randomly stream lines from the following text. Don't generate eos tokens:
LINE 1,
LINE 2,
LINE 3,
...
Where the lines are randomly sampled from a collection of lines from Shakespeare sonnets. Tokens are counted using the LlamaTokenizer
regardless of which LLM API is being tested. This is to ensure that the prompts are consistent across different LLM APIs.
To run the most basic load test you can the token_benchmark_ray script.
- The endpoints provider backend might vary widely, so this is not a reflection on how the software runs on a particular hardware.
- The results may vary with time of day.
- The results may vary with the load.
- The results may not correlate with users’ workloads.
export OPENAI_API_KEY=t
export OPENAI_API_BASE="http://<bp-endpoint>.nip.io/v1"
python token_benchmark_ray.py \
--model "openai/gpt-oss-120b" \
--mean-input-tokens 128 128 2048 2048 \
--stddev-input-tokens 10 \
--mean-output-tokens 128 2048 128 2048 \
--stddev-output-tokens 10 \
--timeout 3600 \
--num-concurrent-requests 1 5 10 25 50 100 \
--max-num-completed-requests 100 \
--num-warmup-requests 10 \
--results-dir "result_outputs" \
--llm-api openai \
--mlflow-uri https://mlflow.<bp-endpoint>.nip.io \
--tensor-parallel-size 8 \
--gpu-name b200
usage: token_benchmark_ray.py [-h] --model MODEL [--mean-input-tokens MEAN_INPUT_TOKENS [MEAN_INPUT_TOKENS ...]]
[--stddev-input-tokens STDDEV_INPUT_TOKENS]
[--mean-output-tokens MEAN_OUTPUT_TOKENS [MEAN_OUTPUT_TOKENS ...]]
[--stddev-output-tokens STDDEV_OUTPUT_TOKENS]
[--num-concurrent-requests NUM_CONCURRENT_REQUESTS [NUM_CONCURRENT_REQUESTS ...]] [--timeout TIMEOUT]
[--max-num-completed-requests MAX_NUM_COMPLETED_REQUESTS] [--num-warmup-requests NUM_WARMUP_REQUESTS]
[--additional-sampling-params ADDITIONAL_SAMPLING_PARAMS] [--results-dir RESULTS_DIR]
[--llm-api LLM_API] [--metadata METADATA] [--mlflow-uri MLFLOW_URI]
[--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--gpu-name GPU_NAME] [--model-header MODEL_HEADER]
[--api-key API_KEY]
Run a token throughput and latency benchmark.
optional arguments:
-h, --help show this help message and exit
--model MODEL The model to use for this load test.
--mean-input-tokens MEAN_INPUT_TOKENS [MEAN_INPUT_TOKENS ...]
The mean number of tokens to send in the prompt for the request. Can specify multiple values to run a test
matrix. (default: [550])
--stddev-input-tokens STDDEV_INPUT_TOKENS
The standard deviation of number of tokens to send in the prompt for the request. (default: 150)
--mean-output-tokens MEAN_OUTPUT_TOKENS [MEAN_OUTPUT_TOKENS ...]
The mean number of tokens to generate from each llm request. This is the max_tokens param for the
completions API. Note that this is not always the number of tokens returned. Can specify multiple values to
run a test matrix. (default: [150])
--stddev-output-tokens STDDEV_OUTPUT_TOKENS
The stdandard deviation on the number of tokens to generate per llm request. (default: 80)
--num-concurrent-requests NUM_CONCURRENT_REQUESTS [NUM_CONCURRENT_REQUESTS ...]
The number of concurrent requests to send. Can specify multiple values to run a test matrix. (default: [10])
--timeout TIMEOUT The amount of time to run the load test for. (default: 90)
--max-num-completed-requests MAX_NUM_COMPLETED_REQUESTS
The base number of requests to complete before finishing the test. This will be multiplied by the
concurrency level for each test (e.g., base=100, concurrency=5 -> 500 total requests). Note that its
possible for the test to timeout first. (default: 10)
--num-warmup-requests NUM_WARMUP_REQUESTS
The base number of warmup requests to send before starting the benchmark. This will be multiplied by the
concurrency level for each test (e.g., base=10, concurrency=5 -> 50 warmup requests). (default: 0)
--additional-sampling-params ADDITIONAL_SAMPLING_PARAMS
Additional sampling params to send with the each request to the LLM API. (default: {}) No additional
sampling params are sent.
--results-dir RESULTS_DIR
The directory to save the results to. (`default: `) No results are saved)
--llm-api LLM_API The name of the llm api to use. Can select from ['openai', 'anthropic', 'litellm'] (default: openai)
--metadata METADATA A comma separated list of metadata to include in the results, e.g. name=foo,bar=1. These will be added to
the metadata field of the results.
--mlflow-uri MLFLOW_URI
MLflow tracking URI to log results to (e.g., http://localhost:5000). If not provided, results will not be
logged to MLflow.
--tensor-parallel-size TENSOR_PARALLEL_SIZE
The number of tensor parallel processes to use. (default: 0)
--gpu-name GPU_NAME The name of the GPU to use for this load test. (default: gpu)
--model-header MODEL_HEADER
The model header to use for this load test. (default: )
--api-key API_KEY The api key to use for this load test. (default: )