VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Abstract

Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline---surpassing Qwen2.5-Math-PRM's gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.

Link to paper: https://arxiv.org/abs/2502.06737

🚀 Models

We provide several VersaPRM models and Math PRM models trained with different configurations.

📌 VersaPRM Collection

A suite of PRM models trained with various configurations:

VersaPRM – trained from Llama-PRM800K using MMLU-Pro-CoT-Train-Labeled
VersaPRM-Base-8B – trained from llama-3.1-8b instruct using MMLU-Pro-CoT-Train-Labeled
VersaPRM-Aug – trained from Llama-PRM800K using MMLU-Pro-CoT-Train-Labeled with counterfactual augmentations.
Qwen-PRM800K – Fine-tuned Qwen model using PRM800K dataset
Llama-PRM800K – Fine-tuned LLaMA model using PRM800K dataset

📊 Datasets

📌 MMLU-Pro-CoT-Eval

Size: 249k rows
Use Case: Evaluation cots generated llama-3.1-8b-instruct, sampled 128 responses for each question. (Some question have less as we remove CoT for which cannot extract final answer)

📌 MMLU-Pro-CoT-Train-Labeled

Size: 84.1k rows
Use Case: Stepwise Labeled CoT training dataset used to finetune VersaPRM. The CoT are generated by llama-3.1-8b-Instruct, with 16 responses sampled for each question. (Some question have less as we remove invalid CoTs)

Folder Structure

aws_batch_inference/: Scripts for performing batch inference using AWS services.
counterfactual_augmentation/: Scripts for creating and processing counterfactual augmentation batches.
evaluation/: Scripts and resources for evaluating model performance, including metric calculations and batch output merging.
figures/: Contains visual representations and figures used in the project.
model_train/: Scripts and configurations for training models, including data preparation and training scripts.
search_algs/: Scripts for running beam search and MCTS (TODO: add readme).
synth_cot_generation/: Scripts for generating synthetic chain-of-thought (CoT) data for training and evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Abstract

🚀 Models

📌 VersaPRM Collection

📊 Datasets

📌 MMLU-Pro-CoT-Eval

📌 MMLU-Pro-CoT-Train-Labeled

Folder Structure

Files

README.md

Latest commit

History

README.md

File metadata and controls

VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Abstract

🚀 Models

📌 VersaPRM Collection

📊 Datasets

📌 MMLU-Pro-CoT-Eval

📌 MMLU-Pro-CoT-Train-Labeled

Folder Structure