The LLM Benchmark Suite is a versatile application designed to test and compare the performance and functionality of Large Language Models (LLMs) using various test scenarios. The suite supports evaluating tool-calling capabilities, analyzing response times, and benchmarking multiple LLMs simultaneously.
- Compare multiple models side-by-side (e.g., Qwen, LLaMA, etc., that support tool calls).
- Supports customizable test suites with varied input sentences and tool expectations.
- Models are tested on their ability to call external tools such as:
get_current_weather
: Fetches weather information.get_system_time
: Provides the current system time.
- Progress tracking for each model.
- Success/Failure rates and average response times displayed.
- Exportable logs showing detailed tool-call and response data.
- Custom test sentences with optional expected tool calls.
- Adjustable no-tool probability for generating edge cases.
- Enable/Disable detailed logging for streamlined benchmarking.
-
Configuration:
- Input the base URL and model name for each LLM.
- Optionally, configure a second model for comparison.
-
Test Suite Setup:
- Define test sentences.
- Specify whether a tool call is expected (
none
or a specific tool).
-
Start Benchmarking:
- Run the test suite and monitor progress for each model in real time.
-
Analyze Results:
- View summary statistics:
- Total Successes/Failures.
- Average response time.
- Access detailed logs for each test case.
- View summary statistics:
-
Clone the repository:
git clone https://github.com/Teachings/llm_tools_benchmark.git cd llm_tools_benchmark
-
Install dependencies:
pip install -r requirements.txt
-
Start the application:
uvicorn main:app --reload --port 8090
-
Access the UI at:
http://localhost:8090
-
Clone the repository:
git clone https://github.com/Teachings/llm_tools_benchmark.git cd llm_tools_benchmark
-
Build the Docker image:
docker build -t llm-benchmark-suite .
-
Start the container:
Run the container with --network=host
for local testing:
docker run --network=host -p 8090:8090 llm-benchmark-suite
Run the container without --network=host
for prod deployment:
docker run -p 8090:8090 llm-benchmark-suite
- Access the UI at (for local):
http://localhost:8090
- Base URL: Endpoint of the LLM server.
- Model Name: Name of the LLM model.
- Optional secondary model for comparative benchmarking.
- Test Suite Size: Number of test sentences to generate.
- No-tool Probability: Probability that no tool call is expected.
- Enable Detailed Logging: Toggle for saving extended logs.
- Accepts a JSON payload to benchmark a model:
{ "base_url": "http://localhost:11434", "model_name": "qwen2.5-coder-32b", "sentence": "Tell me a joke.", "expected_tool": "none" }
- Returns:
- Success status
- Tool-call details
- Model response
- Processing time
- Use
host.docker.internal
instead oflocalhost
. - Run the container with
--network=host
for local testing:docker run --network=host -p 8090:8090 llm-benchmark-suite