Dria Benchmark System

A comprehensive benchmarking platform for comparing LLM (Large Language Model) performance across different GPU configurations and inference engines. Currently supports RunPod as the cloud infrastructure provider.

🎯 Project Aim

We aim to show and allow users to compare performances of different LLM models across various GPU configurations and inference engines. The system currently supports:

Cloud Provider: RunPod,Lightning AI, Scaleway
Inference Engines: Ollama, SGLang, vLLM , LMStudio, MLX-lm
Deployment: Same LLM models with different inference engines on the same GPU for fair performance comparison

🌐 View Benchmark Results

📊 Visit dria.co/inference-benchmark to view and compare benchmark results in real-time!

The web platform provides:

Interactive benchmark comparisons
Real-time GPU pricing
Performance analytics and insights
Community discussions and comments
AI-powered recommendations

📊 Benchmarking Methodology

Overview

After setting up a server on RunPod, we collect comprehensive data including:

Server setup time
LLM upload time
Model loading time
Benchmark execution time
Performance metrics

We use an extended version of GuideLLM (customized for different inference engines) to create comprehensive benchmarks.

Benchmark Types

1. Concurrent Benchmark

Purpose: Tests fixed concurrency levels
Range: Rate 1 to 9 concurrent requests
Description: Runs a fixed number of streams of requests in parallel
Usage: --rate must be set to the desired concurrency level/number of streams

2. Throughput Benchmark

Purpose: Measures maximum processing capacity
Description: A special test type designed to measure the maximum processing capacity of an LLM inference engine
Metrics:
- Requests per second (RPS)
- Maximum token generation speed (tokens per second - TPS)

Benchmark Process

For each benchmark, we create 7 different benchmarks:

1 Throughput benchmark (maximum capacity test)
6 Concurrent benchmarks (rates 1-6)

All 7 benchmarks are recorded and displayed to users for comprehensive performance analysis.

📈 Example Benchmark Data

Here's an example of the benchmark data structure and metrics collected:

benchmark_type	rate	max_number	warmup_number	benchmark_duration	total_requests	successful_requests	requests_per_second	request_concurrency	request_latency	prompt_token_count	output_token_count	time_to_first_token_ms	time_per_output_token_ms	inter_token_latency_ms	output_tokens_per_second	tokens_per_second
concurrent	1	-	-	33.31	13	12	0.51	0.99	1.97	94.69	183.46	574.77	7.49	7.53	92.92	140.39
concurrent	2	-	-	33.06	17	15	0.64	1.93	3.04	89.06	276.56	543.92	9.62	9.65	165.50	221.53
concurrent	3	-	-	32.90	16	13	0.62	2.84	4.57	82.38	298.53	1630.01	10.10	10.14	174.02	224.67
concurrent	4	-	-	32.67	20	16	0.78	3.91	5.05	80.65	265.39	2829.71	10.10	10.14	185.14	246.97
concurrent	5	-	-	33.34	18	13	0.83	4.40	5.30	73.22	241.93	3018.37	9.88	9.92	156.22	216.39
concurrent	6	-	-	33.01	18	12	0.73	5.12	6.97	68.39	259.14	4731.11	10.12	10.16	148.01	197.67
concurrent	7	-	-	33.26	21	14	0.92	5.97	6.52	68.10	206.00	4762.71	10.15	10.20	143.64	205.29
concurrent	8	-	-	32.87	21	13	0.89	6.73	7.55	62.76	219.60	6454.77	10.17	10.22	139.71	195.00
concurrent	9	-	-	32.88	21	12	0.83	7.13	8.54	58.62	206.57	8087.35	10.35	10.40	114.89	163.25
throughput	-	-	-	32.88	21	12	0.83	7.13	8.54	58.62	206.57	8087.35	10.35	10.40	114.89	163.25

Key Metrics Explained

requests_per_second: Number of requests processed per second
request_latency: Average response time in seconds
time_to_first_token_ms: Time to receive the first token (milliseconds)
output_tokens_per_second: Tokens generated per second
tokens_per_second: Total tokens (input + output) processed per second
request_concurrency: Average number of concurrent requests during the test

Installation

We use uv for the Benchmark. Sync your environment via:

uv sync

Usage

To run the Benchmark, use the following command:

First set the enviroments

Enviroments

MONGODB_URL: "mongodb://<username>:<password>@<host>:<port>/<database>?options" — URL that contains benchmark and API utilities.
HF_TOKEN: "123123" — Hugging Face access token, required for accessing private models.
RUNPOD_API_KEY: "sk_12321313" — Runpod API key for creating and managing pods.
LIGHTNING_USER_ID: "123123" — Lightning AI user ID.
LIGHTNING_API_KEY: "123123" — Lightning AI API key.
NGROK_AUTH_TOKEN: "123123" — ngrok authentication token for creating public tunnels.
SCW_ACCESS_KEY: "123123" — Scaleway access key.
SCW_SECRET_KEY: "123123" — Scaleway secret key.
SCW_DEFAULT_ORGANIZATION_ID: "123123" — Scaleway default organization ID.
SCW_DEFAULT_PROJECT_ID: "123123" — Scaleway default project ID.

After creating .env file with these variables you should run

uv run --env-file=.env python <runner-file-path>.py --inference_engine ollama \
    --Args in runner-file-path

Example:

uv run --env-file=.env python runpod_runner.py --inference_engine ollama \
  --gpu_id "NVIDIA H200" \
  --volume_in_gb 1000 \
  --container_disk_in_gb 500 \
  --llm_id "qwen2:7b" \
  --llm_parameter_size "7b" \
  --llm_common_name "Qwen2 7b" \
  --gpu_count 1

For Sglang and VLLM you need to give HF model path as llm_id

uv run --env-file=.env python <runner-file-path>.py --inference_engine vllm \
  --gpu_id "NVIDIA H200" \
  --volume_in_gb 1000 \
  --container_disk_in_gb 500 \
  --llm_id "Qwen/Qwen2-7B" \
  --llm_parameter_size "7b" \
  --llm_common_name "Qwen2 7b" \
  --gpu_count 1

🔧 Supported Configurations

Will be added soon <>

🔗 Related Links

Web Platform: dria.co/inference-benchmark
Dria Main Site: dria.co
Documentation: Available at /docs when running the API server

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
clients		clients
lightning_ai_runners		lightning_ai_runners
optimizer		optimizer
runpod_runners		runpod_runners
scaleway_runners		scaleway_runners
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
lightning_ai_runner.py		lightning_ai_runner.py
main.py		main.py
optimizer_runner.py		optimizer_runner.py
pyproject.toml		pyproject.toml
runpod_runner.py		runpod_runner.py
scaleway_runner.py		scaleway_runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dria Benchmark System

🎯 Project Aim

🌐 View Benchmark Results

📊 Benchmarking Methodology

Overview

Benchmark Types

1. Concurrent Benchmark

2. Throughput Benchmark

Benchmark Process

📈 Example Benchmark Data

Key Metrics Explained

Installation

Usage

Enviroments

🔧 Supported Configurations

🔗 Related Links

About

Uh oh!

Releases

Packages

Languages

firstbatchxyz/inference-arena

Folders and files

Latest commit

History

Repository files navigation

Dria Benchmark System

🎯 Project Aim

🌐 View Benchmark Results

📊 Benchmarking Methodology

Overview

Benchmark Types

1. Concurrent Benchmark

2. Throughput Benchmark

Benchmark Process

📈 Example Benchmark Data

Key Metrics Explained

Installation

Usage

Enviroments

🔧 Supported Configurations

🔗 Related Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages