MathTutorBench is a benchmark which provides a unified framework for evaluating open-ended pedagogical capabilities of large langauge models (LLMs) tutors across three high level teacher skills and seven concrete tasks.
- Automatic Evaluation: The benchmark is designed to be run automatically on any new models you are developing.
- Comprehensive Metrics: The benchmark covers a three high level tasks skills and seven tasks to evaluate in the domain of math tutoring.
- Teacher-Grounded Evaluation: Each task is annotated with teacher ground truths and compared to it.
- Fast execution loop: Run benchmark on different tasks very quickly.
For more details on how to run your model locally using vllm, see vllm documentation.
vllm serve [[model_name]]
# Example with openai API
python main.py --tasks mistake_location.yaml --provider completion_api --model_args model=gpt-4o-mini-2024-07-18,is_chat=True,api_key=<API_KEY>
# Example with vllm model
python main.py --tasks mistake_location.yaml --provider completion_api --model_args base_url=base_url=http://localhost:8000/v1,model=meta-llama/Llama-3.2-3B-Instruct,is_chat=True
- Required:
--tasks
: Task definition file in theconfigs
folder. Use comma,
separated list for multiple sequential tasks.problem_solving.yaml
: Task definition for problem solving.socratic_questioning.yaml
: Task definition for socratic questioning.student_solution_generation.yaml
: Task definition for student solution generation.mistake_location.yaml
: Task definition for mistake location.mistake_correction.yaml
: Task definition for mistake correction.scaffolding_generation.yaml
: Task definition for scaffolding generation.pedagogy_following.yaml
: Task definition for pedagogy following.scaffolding_generation_hard.yaml
: Task definition for scaffolding generation hard.pedagogy_following_hard.yaml
: Task definition for pedagogy following hard.
--provider
: API provider to use for the task.completion_api
: Use the completion API for the task. Support any OpenAI-type API. Use for openai and vllm models.gemini
: Use the gemini API for the task.
--model_args
: Model arguments to pass to the API provider.base_url
: Base URL of the API provider. Empty for openai and gemini.model
: Model name to use for the task. Default is the first available model.api_key
: API key to access API. Empty for vllm models.is_chat
: Whether the model is chat-based or not. Default is False.temperature
: Temperature for sampling. Default is 0.0.max_tokens
: Maximum tokens to generate. Default is 2048.max_retries
: Maximum retries for the API. Default is 3.
Set the --data_path
to model outputs of the pedagogical ability tasks. The model computes win rates of generated teacher utterance over the ground truth teacher utterance.
python reward_models/compute_scaffolding_score.py --data_path results/generations-<specific-model>.json
Results are available in the results
folder. To visualize the results, run:
python visualize.py --results_dir results/
pip install -r requirements.txt
Model | Problem Solving | Socratic Questioning | Solution Correctness | Mistake Location | Mistake Correction | Scaffolding Win Rate | Pedagogy IF Win Rate | Scaffolding (Hard) | Pedagogy IF (Hard) |
---|---|---|---|---|---|---|---|---|---|
LLaMA3.2-3B-Instruct | 0.60 | 0.29 | 0.67 | 0.41 | 0.13 | 0.64 | 0.63 | 0.45 | 0.40 |
LLaMA3.1-8B-Instruct | 0.70 | 0.29 | 0.63 | 0.29 | 0.09 | 0.61 | 0.67 | 0.46 | 0.49 |
LLaMA3.1-70B-Instruct | 0.91 | 0.29 | 0.71 | 0.56 | 0.19 | 0.63 | 0.70 | 0.49 | 0.49 |
GPT-4o | 0.90 | 0.48 | 0.67 | 0.37 | 0.84 | 0.50 | 0.82 | 0.46 | 0.70 |
LearnLM-1.5-Pro | 0.94 | 0.32 | 0.75 | 0.57 | 0.74 | 0.64 | 0.68 | 0.66 | 0.67 |
Llemma-7B-ScienceTutor | 0.62 | 0.29 | 0.66 | 0.29 | 0.16 | 0.37 | 0.48 | 0.38 | 0.42 |
Qwen2.5-7B-SocraticLM | 0.73 | 0.32 | 0.05 | 0.39 | 0.23 | 0.39 | 0.39 | 0.28 | 0.28 |
Qwen2.5-Math-7B-Instruct | 0.88 | 0.35 | 0.43 | 0.47 | 0.49 | 0.06 | 0.07 | 0.05 | 0.05 |
To submit your model to the leaderboard, please follow the steps below:
- Open a new issue with the title
Leaderboard Submission: <Model Name>
. - Provide the exact model name on the Huggingface hub and if specific code/arguments/settings are needed for the model or the vllm library which will be used to run your model. Please copy the results from the local run of the model.
Please open a new PR and provide the configuration of the task in the configs
folder and the task implementation in the tasks
folder.
Please cite as:
@article{macina2025mathtutorbench,
title={MathTutorBench: A Benchmark for Measuring Open-ended\\ Pedagogical Capabilities of LLM Tutors},
author={Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan},
year={2025},
eprint={2502.18940},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.18940},
}