A comprehensive benchmarking platform for comparing LLM (Large Language Model) performance across different GPU configurations and inference engines. Currently supports RunPod as the cloud infrastructure provider.
We aim to show and allow users to compare performances of different LLM models across various GPU configurations and inference engines. The system currently supports:
- Cloud Provider: RunPod,Lightning AI, Scaleway
- Inference Engines: Ollama, SGLang, vLLM , LMStudio, MLX-lm
- Deployment: Same LLM models with different inference engines on the same GPU for fair performance comparison
📊 Visit dria.co/inference-benchmark to view and compare benchmark results in real-time!
The web platform provides:
- Interactive benchmark comparisons
- Real-time GPU pricing
- Performance analytics and insights
- Community discussions and comments
- AI-powered recommendations
After setting up a server on RunPod, we collect comprehensive data including:
- Server setup time
- LLM upload time
- Model loading time
- Benchmark execution time
- Performance metrics
We use an extended version of GuideLLM (customized for different inference engines) to create comprehensive benchmarks.
- Purpose: Tests fixed concurrency levels
- Range: Rate 1 to 9 concurrent requests
- Description: Runs a fixed number of streams of requests in parallel
- Usage:
--rate
must be set to the desired concurrency level/number of streams
- Purpose: Measures maximum processing capacity
- Description: A special test type designed to measure the maximum processing capacity of an LLM inference engine
- Metrics:
- Requests per second (RPS)
- Maximum token generation speed (tokens per second - TPS)
For each benchmark, we create 7 different benchmarks:
- 1 Throughput benchmark (maximum capacity test)
- 6 Concurrent benchmarks (rates 1-6)
All 7 benchmarks are recorded and displayed to users for comprehensive performance analysis.
Here's an example of the benchmark data structure and metrics collected:
benchmark_type | rate | max_number | warmup_number | benchmark_duration | total_requests | successful_requests | requests_per_second | request_concurrency | request_latency | prompt_token_count | output_token_count | time_to_first_token_ms | time_per_output_token_ms | inter_token_latency_ms | output_tokens_per_second | tokens_per_second |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
concurrent | 1 | - | - | 33.31 | 13 | 12 | 0.51 | 0.99 | 1.97 | 94.69 | 183.46 | 574.77 | 7.49 | 7.53 | 92.92 | 140.39 |
concurrent | 2 | - | - | 33.06 | 17 | 15 | 0.64 | 1.93 | 3.04 | 89.06 | 276.56 | 543.92 | 9.62 | 9.65 | 165.50 | 221.53 |
concurrent | 3 | - | - | 32.90 | 16 | 13 | 0.62 | 2.84 | 4.57 | 82.38 | 298.53 | 1630.01 | 10.10 | 10.14 | 174.02 | 224.67 |
concurrent | 4 | - | - | 32.67 | 20 | 16 | 0.78 | 3.91 | 5.05 | 80.65 | 265.39 | 2829.71 | 10.10 | 10.14 | 185.14 | 246.97 |
concurrent | 5 | - | - | 33.34 | 18 | 13 | 0.83 | 4.40 | 5.30 | 73.22 | 241.93 | 3018.37 | 9.88 | 9.92 | 156.22 | 216.39 |
concurrent | 6 | - | - | 33.01 | 18 | 12 | 0.73 | 5.12 | 6.97 | 68.39 | 259.14 | 4731.11 | 10.12 | 10.16 | 148.01 | 197.67 |
concurrent | 7 | - | - | 33.26 | 21 | 14 | 0.92 | 5.97 | 6.52 | 68.10 | 206.00 | 4762.71 | 10.15 | 10.20 | 143.64 | 205.29 |
concurrent | 8 | - | - | 32.87 | 21 | 13 | 0.89 | 6.73 | 7.55 | 62.76 | 219.60 | 6454.77 | 10.17 | 10.22 | 139.71 | 195.00 |
concurrent | 9 | - | - | 32.88 | 21 | 12 | 0.83 | 7.13 | 8.54 | 58.62 | 206.57 | 8087.35 | 10.35 | 10.40 | 114.89 | 163.25 |
throughput | - | - | - | 32.88 | 21 | 12 | 0.83 | 7.13 | 8.54 | 58.62 | 206.57 | 8087.35 | 10.35 | 10.40 | 114.89 | 163.25 |
- requests_per_second: Number of requests processed per second
- request_latency: Average response time in seconds
- time_to_first_token_ms: Time to receive the first token (milliseconds)
- output_tokens_per_second: Tokens generated per second
- tokens_per_second: Total tokens (input + output) processed per second
- request_concurrency: Average number of concurrent requests during the test
We use uv
for the Benchmark. Sync your environment via:
uv sync
To run the Benchmark, use the following command:
First set the enviroments
MONGODB_URL
:"mongodb://<username>:<password>@<host>:<port>/<database>?options"
— URL that contains benchmark and API utilities.HF_TOKEN
:"123123"
— Hugging Face access token, required for accessing private models.RUNPOD_API_KEY
:"sk_12321313"
— Runpod API key for creating and managing pods.LIGHTNING_USER_ID
:"123123"
— Lightning AI user ID.LIGHTNING_API_KEY
:"123123"
— Lightning AI API key.NGROK_AUTH_TOKEN
:"123123"
— ngrok authentication token for creating public tunnels.SCW_ACCESS_KEY
:"123123"
— Scaleway access key.SCW_SECRET_KEY
:"123123"
— Scaleway secret key.SCW_DEFAULT_ORGANIZATION_ID
:"123123"
— Scaleway default organization ID.SCW_DEFAULT_PROJECT_ID
:"123123"
— Scaleway default project ID.
After creating .env file with these variables you should run
uv run --env-file=.env python <runner-file-path>.py --inference_engine ollama \
--Args in runner-file-path
Example:
uv run --env-file=.env python runpod_runner.py --inference_engine ollama \
--gpu_id "NVIDIA H200" \
--volume_in_gb 1000 \
--container_disk_in_gb 500 \
--llm_id "qwen2:7b" \
--llm_parameter_size "7b" \
--llm_common_name "Qwen2 7b" \
--gpu_count 1
For Sglang and VLLM you need to give HF model path as llm_id
uv run --env-file=.env python <runner-file-path>.py --inference_engine vllm \
--gpu_id "NVIDIA H200" \
--volume_in_gb 1000 \
--container_disk_in_gb 500 \
--llm_id "Qwen/Qwen2-7B" \
--llm_parameter_size "7b" \
--llm_common_name "Qwen2 7b" \
--gpu_count 1
Will be added soon <>
- Web Platform: dria.co/inference-benchmark
- Dria Main Site: dria.co
- Documentation: Available at
/docs
when running the API server