Skip to content

firstbatchxyz/inference-arena

Repository files navigation

Dria Benchmark System

A comprehensive benchmarking platform for comparing LLM (Large Language Model) performance across different GPU configurations and inference engines. Currently supports RunPod as the cloud infrastructure provider.

🎯 Project Aim

We aim to show and allow users to compare performances of different LLM models across various GPU configurations and inference engines. The system currently supports:

  • Cloud Provider: RunPod,Lightning AI, Scaleway
  • Inference Engines: Ollama, SGLang, vLLM , LMStudio, MLX-lm
  • Deployment: Same LLM models with different inference engines on the same GPU for fair performance comparison

🌐 View Benchmark Results

📊 Visit dria.co/inference-benchmark to view and compare benchmark results in real-time!

The web platform provides:

  • Interactive benchmark comparisons
  • Real-time GPU pricing
  • Performance analytics and insights
  • Community discussions and comments
  • AI-powered recommendations

📊 Benchmarking Methodology

Overview

After setting up a server on RunPod, we collect comprehensive data including:

  • Server setup time
  • LLM upload time
  • Model loading time
  • Benchmark execution time
  • Performance metrics

We use an extended version of GuideLLM (customized for different inference engines) to create comprehensive benchmarks.

Benchmark Types

1. Concurrent Benchmark

  • Purpose: Tests fixed concurrency levels
  • Range: Rate 1 to 9 concurrent requests
  • Description: Runs a fixed number of streams of requests in parallel
  • Usage: --rate must be set to the desired concurrency level/number of streams

2. Throughput Benchmark

  • Purpose: Measures maximum processing capacity
  • Description: A special test type designed to measure the maximum processing capacity of an LLM inference engine
  • Metrics:
    • Requests per second (RPS)
    • Maximum token generation speed (tokens per second - TPS)

Benchmark Process

For each benchmark, we create 7 different benchmarks:

  1. 1 Throughput benchmark (maximum capacity test)
  2. 6 Concurrent benchmarks (rates 1-6)

All 7 benchmarks are recorded and displayed to users for comprehensive performance analysis.

📈 Example Benchmark Data

Here's an example of the benchmark data structure and metrics collected:

benchmark_type rate max_number warmup_number benchmark_duration total_requests successful_requests requests_per_second request_concurrency request_latency prompt_token_count output_token_count time_to_first_token_ms time_per_output_token_ms inter_token_latency_ms output_tokens_per_second tokens_per_second
concurrent 1 - - 33.31 13 12 0.51 0.99 1.97 94.69 183.46 574.77 7.49 7.53 92.92 140.39
concurrent 2 - - 33.06 17 15 0.64 1.93 3.04 89.06 276.56 543.92 9.62 9.65 165.50 221.53
concurrent 3 - - 32.90 16 13 0.62 2.84 4.57 82.38 298.53 1630.01 10.10 10.14 174.02 224.67
concurrent 4 - - 32.67 20 16 0.78 3.91 5.05 80.65 265.39 2829.71 10.10 10.14 185.14 246.97
concurrent 5 - - 33.34 18 13 0.83 4.40 5.30 73.22 241.93 3018.37 9.88 9.92 156.22 216.39
concurrent 6 - - 33.01 18 12 0.73 5.12 6.97 68.39 259.14 4731.11 10.12 10.16 148.01 197.67
concurrent 7 - - 33.26 21 14 0.92 5.97 6.52 68.10 206.00 4762.71 10.15 10.20 143.64 205.29
concurrent 8 - - 32.87 21 13 0.89 6.73 7.55 62.76 219.60 6454.77 10.17 10.22 139.71 195.00
concurrent 9 - - 32.88 21 12 0.83 7.13 8.54 58.62 206.57 8087.35 10.35 10.40 114.89 163.25
throughput - - - 32.88 21 12 0.83 7.13 8.54 58.62 206.57 8087.35 10.35 10.40 114.89 163.25

Key Metrics Explained

  • requests_per_second: Number of requests processed per second
  • request_latency: Average response time in seconds
  • time_to_first_token_ms: Time to receive the first token (milliseconds)
  • output_tokens_per_second: Tokens generated per second
  • tokens_per_second: Total tokens (input + output) processed per second
  • request_concurrency: Average number of concurrent requests during the test

Installation

We use uv for the Benchmark. Sync your environment via:

uv sync

Usage

To run the Benchmark, use the following command:

First set the enviroments

Enviroments

  • MONGODB_URL: "mongodb://<username>:<password>@<host>:<port>/<database>?options" — URL that contains benchmark and API utilities.
  • HF_TOKEN: "123123" — Hugging Face access token, required for accessing private models.
  • RUNPOD_API_KEY: "sk_12321313" — Runpod API key for creating and managing pods.
  • LIGHTNING_USER_ID: "123123" — Lightning AI user ID.
  • LIGHTNING_API_KEY: "123123" — Lightning AI API key.
  • NGROK_AUTH_TOKEN: "123123" — ngrok authentication token for creating public tunnels.
  • SCW_ACCESS_KEY: "123123" — Scaleway access key.
  • SCW_SECRET_KEY: "123123" — Scaleway secret key.
  • SCW_DEFAULT_ORGANIZATION_ID: "123123" — Scaleway default organization ID.
  • SCW_DEFAULT_PROJECT_ID: "123123" — Scaleway default project ID.

After creating .env file with these variables you should run

uv run --env-file=.env python <runner-file-path>.py --inference_engine ollama \
    --Args in runner-file-path

Example:

uv run --env-file=.env python runpod_runner.py --inference_engine ollama \
  --gpu_id "NVIDIA H200" \
  --volume_in_gb 1000 \
  --container_disk_in_gb 500 \
  --llm_id "qwen2:7b" \
  --llm_parameter_size "7b" \
  --llm_common_name "Qwen2 7b" \
  --gpu_count 1

For Sglang and VLLM you need to give HF model path as llm_id

uv run --env-file=.env python <runner-file-path>.py --inference_engine vllm \
  --gpu_id "NVIDIA H200" \
  --volume_in_gb 1000 \
  --container_disk_in_gb 500 \
  --llm_id "Qwen/Qwen2-7B" \
  --llm_parameter_size "7b" \
  --llm_common_name "Qwen2 7b" \
  --gpu_count 1

🔧 Supported Configurations

Will be added soon <>

🔗 Related Links

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages