Skip to content

dkennetzoracle/llmperf

 
 

Repository files navigation

LLMPerf

A Tool for evaulation the performance of LLM APIs.

Installation

git clone https://github.com/ray-project/llmperf.git
cd llmperf
python3 -m venv venv
source venv/bin/activate
pip install -e .

Basic Usage

Load test

The load test spawns a number of concurrent requests to the LLM API and measures the inter-token latency and generation throughput per request and across concurrent requests. The prompt that is sent with each request is of the format:

Randomly stream lines from the following text. Don't generate eos tokens:
LINE 1,
LINE 2,
LINE 3,
...

Where the lines are randomly sampled from a collection of lines from Shakespeare sonnets. Tokens are counted using the LlamaTokenizer regardless of which LLM API is being tested. This is to ensure that the prompts are consistent across different LLM APIs.

To run the most basic load test you can the token_benchmark_ray script.

Caveats and Disclaimers

  • The endpoints provider backend might vary widely, so this is not a reflection on how the software runs on a particular hardware.
  • The results may vary with time of day.
  • The results may vary with the load.
  • The results may not correlate with users’ workloads.

OpenAI Compatible APIs

export OPENAI_API_KEY=t
export OPENAI_API_BASE="http://<bp-endpoint>.nip.io/v1"

python token_benchmark_ray.py \
--model "openai/gpt-oss-120b" \
--mean-input-tokens 128 128 2048 2048 \
--stddev-input-tokens 10 \
--mean-output-tokens 128 2048 128 2048 \
--stddev-output-tokens 10 \
--timeout 3600 \
--num-concurrent-requests 1 5 10 25 50 100 \
--max-num-completed-requests 100 \
--num-warmup-requests 10 \
--results-dir "result_outputs" \
--llm-api openai \
--mlflow-uri https://mlflow.<bp-endpoint>.nip.io \
--tensor-parallel-size 8 \
--gpu-name b200

Parameter help

usage: token_benchmark_ray.py [-h] --model MODEL [--mean-input-tokens MEAN_INPUT_TOKENS [MEAN_INPUT_TOKENS ...]]
                              [--stddev-input-tokens STDDEV_INPUT_TOKENS]
                              [--mean-output-tokens MEAN_OUTPUT_TOKENS [MEAN_OUTPUT_TOKENS ...]]
                              [--stddev-output-tokens STDDEV_OUTPUT_TOKENS]
                              [--num-concurrent-requests NUM_CONCURRENT_REQUESTS [NUM_CONCURRENT_REQUESTS ...]] [--timeout TIMEOUT]
                              [--max-num-completed-requests MAX_NUM_COMPLETED_REQUESTS] [--num-warmup-requests NUM_WARMUP_REQUESTS]
                              [--additional-sampling-params ADDITIONAL_SAMPLING_PARAMS] [--results-dir RESULTS_DIR]
                              [--llm-api LLM_API] [--metadata METADATA] [--mlflow-uri MLFLOW_URI]
                              [--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--gpu-name GPU_NAME] [--model-header MODEL_HEADER]
                              [--api-key API_KEY]

Run a token throughput and latency benchmark.

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         The model to use for this load test.
  --mean-input-tokens MEAN_INPUT_TOKENS [MEAN_INPUT_TOKENS ...]
                        The mean number of tokens to send in the prompt for the request. Can specify multiple values to run a test
                        matrix. (default: [550])
  --stddev-input-tokens STDDEV_INPUT_TOKENS
                        The standard deviation of number of tokens to send in the prompt for the request. (default: 150)
  --mean-output-tokens MEAN_OUTPUT_TOKENS [MEAN_OUTPUT_TOKENS ...]
                        The mean number of tokens to generate from each llm request. This is the max_tokens param for the
                        completions API. Note that this is not always the number of tokens returned. Can specify multiple values to
                        run a test matrix. (default: [150])
  --stddev-output-tokens STDDEV_OUTPUT_TOKENS
                        The stdandard deviation on the number of tokens to generate per llm request. (default: 80)
  --num-concurrent-requests NUM_CONCURRENT_REQUESTS [NUM_CONCURRENT_REQUESTS ...]
                        The number of concurrent requests to send. Can specify multiple values to run a test matrix. (default: [10])
  --timeout TIMEOUT     The amount of time to run the load test for. (default: 90)
  --max-num-completed-requests MAX_NUM_COMPLETED_REQUESTS
                        The base number of requests to complete before finishing the test. This will be multiplied by the
                        concurrency level for each test (e.g., base=100, concurrency=5 -> 500 total requests). Note that its
                        possible for the test to timeout first. (default: 10)
  --num-warmup-requests NUM_WARMUP_REQUESTS
                        The base number of warmup requests to send before starting the benchmark. This will be multiplied by the
                        concurrency level for each test (e.g., base=10, concurrency=5 -> 50 warmup requests). (default: 0)
  --additional-sampling-params ADDITIONAL_SAMPLING_PARAMS
                        Additional sampling params to send with the each request to the LLM API. (default: {}) No additional
                        sampling params are sent.
  --results-dir RESULTS_DIR
                        The directory to save the results to. (`default: `) No results are saved)
  --llm-api LLM_API     The name of the llm api to use. Can select from ['openai', 'anthropic', 'litellm'] (default: openai)
  --metadata METADATA   A comma separated list of metadata to include in the results, e.g. name=foo,bar=1. These will be added to
                        the metadata field of the results.
  --mlflow-uri MLFLOW_URI
                        MLflow tracking URI to log results to (e.g., http://localhost:5000). If not provided, results will not be
                        logged to MLflow.
  --tensor-parallel-size TENSOR_PARALLEL_SIZE
                        The number of tensor parallel processes to use. (default: 0)
  --gpu-name GPU_NAME   The name of the GPU to use for this load test. (default: gpu)
  --model-header MODEL_HEADER
                        The model header to use for this load test. (default: )
  --api-key API_KEY     The api key to use for this load test. (default: )

About

LLMPerf is a library for validating and benchmarking LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 75.0%
  • Jupyter Notebook 24.6%
  • Other 0.4%