Skip to content

Latest commit

 

History

History
112 lines (94 loc) · 6.28 KB

README.md

File metadata and controls

112 lines (94 loc) · 6.28 KB

MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors

Arxiv License: CC BY 4.0 Python Versions

Overview

MathTutorBench is a benchmark which provides a unified framework for evaluating open-ended pedagogical capabilities of large langauge models (LLMs) tutors across three high level teacher skills and seven concrete tasks.

Key Features

  • Automatic Evaluation: The benchmark is designed to be run automatically on any new models you are developing.
  • Comprehensive Metrics: The benchmark covers a three high level tasks skills and seven tasks to evaluate in the domain of math tutoring.
  • Teacher-Grounded Evaluation: Each task is annotated with teacher ground truths and compared to it.
  • Fast execution loop: Run benchmark on different tasks very quickly.

Skills

Quick Start - Evaluate a New Model

0. Run your model locally using vllm - skip if you are using API

For more details on how to run your model locally using vllm, see vllm documentation.

vllm serve [[model_name]]   

1. Run task(s) from the benchmark

# Example with openai API
python main.py --tasks mistake_location.yaml --provider completion_api --model_args model=gpt-4o-mini-2024-07-18,is_chat=True,api_key=<API_KEY>
# Example with vllm model
python main.py --tasks mistake_location.yaml --provider completion_api --model_args base_url=base_url=http://localhost:8000/v1,model=meta-llama/Llama-3.2-3B-Instruct,is_chat=True
  • Required:
    • --tasks: Task definition file in the configs folder. Use comma , separated list for multiple sequential tasks.
      • problem_solving.yaml: Task definition for problem solving.
      • socratic_questioning.yaml: Task definition for socratic questioning.
      • student_solution_generation.yaml: Task definition for student solution generation.
      • mistake_location.yaml: Task definition for mistake location.
      • mistake_correction.yaml: Task definition for mistake correction.
      • scaffolding_generation.yaml: Task definition for scaffolding generation.
      • pedagogy_following.yaml: Task definition for pedagogy following.
      • scaffolding_generation_hard.yaml: Task definition for scaffolding generation hard.
      • pedagogy_following_hard.yaml: Task definition for pedagogy following hard.
    • --provider: API provider to use for the task.
      • completion_api: Use the completion API for the task. Support any OpenAI-type API. Use for openai and vllm models.
      • gemini: Use the gemini API for the task.
    • --model_args: Model arguments to pass to the API provider.
      • base_url: Base URL of the API provider. Empty for openai and gemini.
      • model: Model name to use for the task. Default is the first available model.
      • api_key: API key to access API. Empty for vllm models.
      • is_chat: Whether the model is chat-based or not. Default is False.
      • temperature: Temperature for sampling. Default is 0.0.
      • max_tokens: Maximum tokens to generate. Default is 2048.
      • max_retries: Maximum retries for the API. Default is 3.

2. Run reward model of the Pedagogical Ability tasks

Set the --data_path to model outputs of the pedagogical ability tasks. The model computes win rates of generated teacher utterance over the ground truth teacher utterance.

python reward_models/compute_scaffolding_score.py --data_path results/generations-<specific-model>.json

3. Visualize results

Results are available in the results folder. To visualize the results, run:

python visualize.py --results_dir results/

Skills

Installation

pip install -r requirements.txt

Leaderboard

Model Problem Solving Socratic Questioning Solution Correctness Mistake Location Mistake Correction Scaffolding Win Rate Pedagogy IF Win Rate Scaffolding (Hard) Pedagogy IF (Hard)
LLaMA3.2-3B-Instruct 0.60 0.29 0.67 0.41 0.13 0.64 0.63 0.45 0.40
LLaMA3.1-8B-Instruct 0.70 0.29 0.63 0.29 0.09 0.61 0.67 0.46 0.49
LLaMA3.1-70B-Instruct 0.91 0.29 0.71 0.56 0.19 0.63 0.70 0.49 0.49
GPT-4o 0.90 0.48 0.67 0.37 0.84 0.50 0.82 0.46 0.70
LearnLM-1.5-Pro 0.94 0.32 0.75 0.57 0.74 0.64 0.68 0.66 0.67
Llemma-7B-ScienceTutor 0.62 0.29 0.66 0.29 0.16 0.37 0.48 0.38 0.42
Qwen2.5-7B-SocraticLM 0.73 0.32 0.05 0.39 0.23 0.39 0.39 0.28 0.28
Qwen2.5-Math-7B-Instruct 0.88 0.35 0.43 0.47 0.49 0.06 0.07 0.05 0.05

Submit your model to leaderboard

To submit your model to the leaderboard, please follow the steps below:

  1. Open a new issue with the title Leaderboard Submission: <Model Name>.
  2. Provide the exact model name on the Huggingface hub and if specific code/arguments/settings are needed for the model or the vllm library which will be used to run your model. Please copy the results from the local run of the model.

Adding a New Task

Please open a new PR and provide the configuration of the task in the configs folder and the task implementation in the tasks folder.

Citation

Please cite as:

@article{macina2025mathtutorbench,
      title={MathTutorBench: A Benchmark for Measuring Open-ended\\ Pedagogical Capabilities of LLM Tutors}, 
      author={Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan},
      year={2025},
      eprint={2502.18940},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.18940},
}