The EQUATOR Evaluator is a robust framework designed to systematically evaluate the factual accuracy and reasoning capabilities of large language models (LLMs). Unlike traditional evaluation methods, which often prioritize fluency over accuracy, EQUATOR employs a deterministic scoring system that ensures precise and unbiased assessment of LLM-generated responses.
This repository implements the methodology described in the research paper "EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. #v1.0.0-beta" (Bernard et al., 2024). By leveraging vector databases and smaller, locally hosted LLMs, the EQUATOR Evaluator bridges the gap between scalability and accuracy in automated assessments.
- EQUATOR Evaluator Framework
- Overview
- Table of Contents
- Key Features
- Why EQUATOR Evaluator?
- Methodology
- Evaluator vs. Student Matrix
- Installation
- Configuration
- IMPORTANT INSTRUCTIONS
- Usage
- Example Dataset
- Contributions
- Future Work
- Citation
- License
- Contact
- Deterministic Scoring: Assigns binary scores (100% or 0%) based solely on factual correctness.
- Vector Database Integration: Embeds open-ended questions and human-evaluated answers for semantic matching.
- Automated Evaluation: Uses smaller LLMs to provide scalable and efficient assessments.
- Bias Mitigation: Eliminates scoring biases related to linguistic fluency or persuasion.
- Cost Efficiency: Optimizes token usage, significantly reducing operational costs for evaluation.
Traditional evaluation methods, such as multiple-choice tests or human evaluations, often fail to capture the nuanced reasoning and factual accuracy required in high-stakes domains like medicine or law. The EQUATOR Evaluator addresses these limitations by:
- Focusing on Factual Correctness: Prioritizes accuracy over linguistic style.
- Reducing Human Reliance: Automates the grading process, minimizing the need for human evaluators.
- Providing Insights for Improvement: Identifies specific areas where LLMs underperform, enabling targeted enhancements in model training.
The scoring framework evaluates LLM-generated answers against a vector database of human-evaluated responses through the following steps:
- Embed Inputs: Convert questions and answers into vector embeddings using models like
all-minilm
. - Retrieve Closest Match: Identify the most semantically similar answer key using cosine similarity.
- Binary Scoring: Assign 100% if the student’s answer matches the answer key; otherwise, 0%.
Implemented with ChromaDB, the vector database stores embeddings of open-ended questions and their corresponding answer keys. This database serves as the single source of truth for evaluations.
A smaller LLM (e.g., LLaMA 3.2B) acts as the evaluator, ensuring strict adherence to the scoring criteria while reducing computational overhead.
We classify LLMs as Evaluators (the "graders") and Students (the "respondents"). Below is an updated matrix that includes Groq → Ollama support.
-
Ollama → OpenRouter Students:
34,925 × 293 = 10,232,275
-
Ollama → Groq Students:
34,925 × 14 = 488,950
-
Ollama → Ollama Students:
34,925 × 34,925 = 1,219,755,625
Subtotal (Ollama Evaluators):
10,232,275 + 488,950 + 1,219,755,625
= 1,230,476,850
-
Groq → OpenRouter Students:
14 × 293 = 4,102
-
Groq → Ollama Students:
14 × 34,925 = 488,950
Subtotal (Groq Evaluators):
4,102 + 488,950
= 493,052
1,230,476,850 (Ollama)
+ 493,052 (Groq)
= 1,230,969,902
≈ 1,230,970,000
-
Groq → Groq Students:
14 × 14 = 196
-
OpenRouter → OpenRouter Students:
293 × 293 = 85,849
Total Future Combinations:
196 + 85,849
= 86,045
-
Currently Supported:
1,230,969,902 ≈ 1,230,970,000
-
With Next Release:
1,230,969,902 + 86,045 = 1,231,055,947 ≈ 1,231,056,000
-
Total Supported Combinations (Current):
~1.23 Billion Evaluator-Student Pairs -
Additional Combinations (Next Release):
~86,045 Evaluator-Student Pairs
With over 1.23 billion possible Evaluator-Student pairs currently supported, comprehensive testing involves an extensive and potentially resource-intensive process. Here's how to approach it:
- Model Importance: Focus on evaluating high-impact or frequently used models first.
- Diversity: Ensure a diverse range of model families and sizes are tested to cover different capabilities and use cases.
- Incremental Testing: Start with a subset of combinations and gradually expand.
- Utilize automated testing frameworks to handle large-scale evaluations.
- Leverage parallel processing to distribute the workload across multiple machines or instances.
- Instead of exhaustively testing all combinations, use statistical sampling methods to select representative pairs for evaluation.
- Implement continuous testing pipelines that automatically evaluate new combinations as models are added or updated.
Given the sheer volume of possible combinations, it's crucial to implement a strategic testing plan:
- Define Testing Objectives: Clearly outline what you aim to achieve with each test (e.g., performance benchmarks, compatibility checks).
- Allocate Resources: Ensure you have the necessary computational resources to handle large-scale testing.
- Monitor and Iterate: Continuously monitor testing outcomes and refine your strategies based on findings and evolving requirements.
By adopting a structured and prioritized approach, you can effectively manage the extensive testing landscape and ensure robust evaluation of your LLM combinations.
-
Evaluator LLMs (the “grader”)
- Ollama (local)
- Groq
- More evaluators planned for future releases.
-
Student LLMs (the “respondent”)
- OpenRouter (276+ models: OpenAI, Anthropic, etc.)
- Groq
- Ollama (local)
- More students planned for future releases.
-
Current Highlights
- Ollama can evaluate answers from OpenRouter, Groq, or Ollama itself.
- Groq can evaluate answers from OpenRouter, Groq, or Ollama.
- Ongoing development will expand these capabilities even further.
Use this chart as a quick reference for which LLM can serve as the evaluator versus which can serve as the student. We will be testing an OpenRouter to OpenRouter implementation in our next release.
git clone https://github.com/raymondbernard/equator.git
cd equator
-
Ollama:
- Download from Ollama and install it on your machine.
-
Groq:
- Register and retrieve your API key from Groq Console.
-
OpenRouter:
- Register and retrieve your API key from OpenRouter.
- Rename
copy-to.env
to.env
in your working directory. - Add the necessary API keys to the
.env
file.
Example .env
file:
OPENROUTER_KEY="sk-xxx"
GROQ_API_KEY="gsk_xxx"
It is recommended to use a virtual environment to avoid conflicts with other Python packages.
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
deactivate
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
deactivate
pip install -r requirements.txt
We utilize Ollama's embeddings to upload to our ChromaDB.
ollama pull all-minilm
IMPORTANT: Do NOT run Ollama on the same machine without Docker. You can run Ollama on separate machines remotely with or without Docker. To use a remote instance, update the URL and port in config.ini
.
Note: You will need to install the latest drivers from NVIDIA and ensure your system recognizes your GPU.
https://github.com/NVIDIA/nvidia-container-toolkit
Run the Ollama Docker container:
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker run -d -v ollama:/root/.ollama -p 11435:11434 --name ollama ollama/ollama
Get the Ollama Docker Image:
https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image
The config.ini
file contains all the necessary configurations for running the EQUATOR Evaluator. Below is a breakdown of each section:
# Configuration File for Equator Vision Benchmarking
[ollama_evaluator_url]
URL = http://localhost:11434/api/chat
[ollama_evaluator_vision_url]
URL = http://localhost:11434/api/generate
[ollama_vision_student_docker_url]
URL = http://localhost:11435/api/chat
[ollama_student_docker_url]
URL = http://localhost:11435/api/chat
[BENCHMARK_NAME]
benchmark_name = Bernard
[rounds]
# Number of times each question will be posed to the models
answer_rounds = 2
[evaluator_models]
GROQ_EVALUATOR_MODEL = llama3-70b-8192
OLLAMA_EVALUATOR_MODEL = llama3.2
[vision]
# Enable or disable the Vision Database
VISION_DB = False
[parquet]
# Enable or disable Parquet storage
PARQUET = True
[keep_vector_db]
# Whether to keep the Vector Database after execution
KEEP_VECTOR_DB = False
[execution_steps]
# Steps to execute, separated by commas if multiple
EXECUTION_STEPS = ollama_to_ollama_evaluate
[student_models]
STUDENT_OPENROUTER_MODELS = nousresearch/hermes-3-llama-3.1-405b
STUDENT_GROQ_MODELS = deepseek-r1-distill-llama-70b
STUDENT_OLLAMA_MODELS = llama3.2
- Evaluator URLs: Define the API endpoints for Ollama evaluators.
- Benchmark Name: Identifier for the current benchmark run.
- Rounds: Number of times each question is posed to the models.
- Evaluator Models: Specifies the models used for evaluation.
- Vision and Parquet: Toggle features for vision-based evaluations and Parquet storage.
- Vector DB Persistence: Whether to retain the vector database after execution.
- Execution Steps: Defines the sequence of steps to execute during benchmarking.
- Student Models: Lists the models to be evaluated.
To ensure smooth execution, please run the program one step at a time. Follow these guidelines:
- Toggle Steps by Adding/Removing Comments
Modify theEXECUTION_STEPS
list inconfig.ini
by uncommenting one step at a time. After running the program, re-comment the executed step if needed and proceed to the next.
Here’s how to structure your EXECUTION_STEPS
:
[execution_steps]
EXECUTION_STEPS = ollama_to_ollama_evaluate,
# ollama_to_groq_evaluate,
# ollama_to_openrouter_evaluate,
# groq_to_ollama_evaluate,
# groq_to_openrouter_evaluate,
# generate_statistics
IMPORTANT NOTE:
- Local Execution: To run everything locally, use the
ollama_to_ollama_evaluate
step. Ensure Docker is installed, and your student model runs on Ollama inside a Docker container. Executeollama pull <student model>
within your container.
-
Choose One Step: Uncomment one line in the
EXECUTION_STEPS
list. -
Run the Program:
python main.py
-
Comment the Step Again: After completion, re-comment the executed step if you plan to run additional steps.
-
Proceed to Next Step: Repeat the process for subsequent steps.
You don't have to use all the steps! You can stick to a local evaluator (Ollama) and run through multiple models. Here’s a recommended order:
- Uncomment
"ollama_to_ollama_evaluate"
and run. - Uncomment
"ollama_to_groq_evaluate"
and run. - Uncomment
"ollama_to_openrouter_evaluate"
and run. - Uncomment
"groq_to_ollama_evaluate"
and run. - Uncomment
"groq_to_openrouter_evaluate"
and run. - Finally, ensure only
"generate_statistics"
is uncommented and run to compile results.
- One Step at a Time: Never leave multiple steps uncommented simultaneously.
- Save Progress: If something goes wrong, verify that only one step is uncommented.
- Final Step: Always finish with
"generate_statistics"
to summarize your results.
-
Activate Your Python Environment (if using a virtual environment).
-
Run the Main Script:
python main.py
- Organized Directory: Results are saved in a directory named after the corresponding date (
YYYY-MM-DD
), containing charts and CSV files with statistics and token analytics. - Detailed Outputs: JSON files include:
- Question
- Model-generated answer
- Evaluator response for the score
- Score
The repository includes datasets to test the reasoning capabilities of LLMs:
- Default Dataset:
- File:
linguistic_benchmark.json
- Contains open-ended questions across various categories like puzzles, spatial reasoning, and logic.
- Ideal for quick tests or debugging.
- Customization: You can add more questions or tailor them to your domain.
- File:
Note: We maintain a QA linguistic_benchmark.json
with over 1000+ questions. A website will be created to publish our results using this dataset.
Our research aims to maintain statistically significant and unbiased evaluation results. Publicly releasing the full dataset risks future models being trained or fine-tuned on our test items, compromising the benchmark's fairness and validity. By keeping the data private, we ensure that our comparisons remain accurate and reflective of true model performance.
Extensibility: While our core benchmark remains standardized, you can extend linguistic_benchmark.json
to include domain-specific prompts and responses. This allows you to evaluate AI models in specialized contexts without affecting the integrity of our primary benchmarking methodology.
- Raymond Bernard (Independent Researcher)
- Shaina Raza, Ph.D. (Vector Institute)
- Subhabrata Das, PhD (JP Morgan Chase)
- Raul Murugan (Columbia University)
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
- Expand the Vector Database: Include more diverse datasets.
- Optimize Embedding and Retrieval: Enhance performance for larger-scale deployments.
- Additional Scoring Criteria: Incorporate complex reasoning task evaluations.
If you use this framework in your research, please cite:
@article {bernard2024equator,
title = {{EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. \# v1.0.0-beta}},
author = {Bernard, Raymond and Raza, Shaina and Das, Subhabrata and Murugan, Rahul},
year = {2024},
eprint = {2501.00257},
archivePrefix= {arXiv},
primaryClass = {cs.CL},
note = {MSC classes: 68T20; ACM classes: I.2.7; I.2.6; H.3.3},
howpublished = {arXiv preprint arXiv:2501.00257 [cs.CL]},
doi = {10.48550/arXiv.2501.00257},
}
This project is licensed under the MIT License.
Generated with ❤️ by the EQUATOR QA Team
- James Huckle: Inspiration for our work.
- Incorporated elements from autogenai/easy-problems-that-llms-get-wrong.
- Leveraged OpenRouter.ai's unified API and OpenAI SDK for comprehensive benchmarking across over 270 models.
For any inquiries or support, please contact [email protected].