Skip to content

Commit

Permalink
Rename to RewardBench (#54)
Browse files Browse the repository at this point in the history
* attempt

* Update analysis/README.md

* readme

* up

* up

* up

* up

* up

* up

* Update README.md
  • Loading branch information
natolambert authored Mar 7, 2024
1 parent b72307b commit 0f4e05f
Show file tree
Hide file tree
Showing 31 changed files with 73 additions and 40 deletions.
3 changes: 1 addition & 2 deletions .flake8
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
[flake8]
exclude =
herm/models/openassistant.py
herm/models/starling.py
rewardbench/models/openassistant.py
extend-ignore = E203
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# TODO: Update this when releasing HERM publicly
# TODO: Update this when releasing RewardBench publicly
# This dockerfile is forked from ai2/cuda11.8-cudnn8-dev-ubuntu20.04
# To get the latest id, run `beaker image pull ai2/cuda11.8-cudnn8-dev-ubuntu20.04`
# and then `docker image list`, to verify docker image is pulled
Expand All @@ -19,7 +19,7 @@ RUN pip install torch torchvision torchaudio --index-url https://download.pytorc
# RUN pip install flash-attn==2.2.2 --no-build-isolation

# TODO: enable these when training code is complete
COPY herm herm
COPY rewardbench rewardbench
COPY scripts scripts
COPY setup.py setup.py
COPY Makefile Makefile
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes!)
export PYTHONPATH = src

check_dirs := herm scripts analysis tests
check_dirs := rewardbench scripts analysis tests

style:
python -m black --line-length 119 --target-version py310 $(check_dirs) setup.py
Expand Down
37 changes: 25 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,27 @@
# Holistic Evaluation of Reward Models (HERM)

This will hold scripts for generating scores and uploading results.
Two primary to generate results (more in `scripts/`):
<div align="center">
<h1>RewardBench: Evaluating Reward Models</h1>
<p>
<a href="https://huggingface.co/spaces/allenai/reward-bench">Leaderbord</a> 📐 |
<a href="https://huggingface.co/datasets/allenai/reward-bench">RewardBench Dataset</a> |
<a href="https://huggingface.co/datasets/allenai/preference-test-sets">Existing Test Sets</a> |
<a href="https://huggingface.co/datasets/allenai/reward-bench-results">Results</a> 📊 |
Paper (coming soon) 📝
</p>
<img src="https://github.com/allenai/reward-bench/assets/10695622/24ed272a-0844-451f-b414-fde57478703e" alt="RewardBench Logo" width="700" style="margin-left:'auto' margin-right:'auto' display:'block' "/>
</div>

---

**RewardBench** is a benchmark designed to evaluate the capabilities and safety of reward models (including those trained with Direct Preference Optimization, DPO).
The repository includes the following:
* Common inference code for a variety of reward models (Starling, PairRM, OpenAssistant, DPO, and more).
* Common dataset formatting and tests for fair reward model inference.
* Analysis and visualization tools.

The two primary scripts to generate results (more in `scripts/`):
1. `scripts/run_rm.py`: Run evaluations for reward models.
2. `scripts/run_dpo.py`: Run evaluations for direct preference optimization (DPO) models.

## Links
Dataset, space, etc coming soon.
For contributors, it can be found in this [HuggingFace org](https://huggingface.co/ai2-adapt-dev).

## Installation
Please install `torch`` on your system, and then install the following requirements.
```
Expand Down Expand Up @@ -70,10 +83,10 @@ python scripts/run_bon.py --model=OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2

```
├── README.md <- The top-level README for researchers using this project
├── analysis/ <- Directory of tools to analyze HERM results or other reward model properties
├── herm/ <- Core utils and modeling files
├── analysis/ <- Directory of tools to analyze RewardBench results or other reward model properties
├── rewardbench/ <- Core utils and modeling files
| ├── models/ ├── Standalone files for running existing reward models
| └── *.py └── HERM tools and utilities
| └── *.py └── RewardBench tools and utilities
├── scripts/ <- Scripts and configs to train and evaluate reward models
├── tests <- Unit tests
├── Dockerfile <- Build file for reproducible and scaleable research at AI2
Expand All @@ -84,7 +97,7 @@ python scripts/run_bon.py --model=OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2

## Maitenence

### Updating the docker image (consider removing this section when we publicly release HERM)
### Updating the docker image (consider removing this section when we publicly release RewardBench)
When updating this repo, the docker image should be rebuilt to include those changes.
For AI2 members, please update the list below with any images you use regularly.
For example, if you update `scripts/run_rm.py` and include a new package (or change a package version), you should rebuild the image and verify it still works on known models.
Expand Down
4 changes: 2 additions & 2 deletions analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ python analysis/plot_per_subset_dist.py --output_dir=plots/whisker
```

### Get benchmark results
This prints out the HERM results in a Markdown or LaTeX table. Note that you need to pass an API token to the `HF_COLLAB_TOKEN` environment variable.
This prints out the RewardBench results in a Markdown or LaTeX table. Note that you need to pass an API token to the `HF_COLLAB_TOKEN` environment variable.
```
# Use --render_latex for LaTeX output
python analysis/get_benchmark_results.py
```

Below is a snippet of the output for the HERM - General results:
Below is a snippet of the output for the RewardBench - General results:

| model | average | alpacaeval | mt-bench | llmbar | refusals | hep |
|--------------------------------------------------|-----------|--------------|------------|----------|------------|--------|
Expand Down
2 changes: 1 addition & 1 deletion analysis/bon_to_alpacaeval.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

# Script for converting HERM best of n (BoN) results into the AlpacaEval format
# Script for converting RewardBench best of n (BoN) results into the AlpacaEval format

import argparse
import os
Expand Down
5 changes: 4 additions & 1 deletion analysis/draw_model_histogram.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,10 @@
import argparse
from pathlib import Path

from herm.visualization import draw_model_source_histogram, print_model_statistics
from rewardbench.visualization import (
draw_model_source_histogram,
print_model_statistics,
)


def get_args():
Expand Down
2 changes: 1 addition & 1 deletion analysis/draw_per_token_reward.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
import numpy as np
import spacy_alignments as tokenizations

from herm.visualization import draw_per_token_reward
from rewardbench.visualization import draw_per_token_reward

DEFAULT_DIRNAME = "per-token-reward"

Expand Down
8 changes: 4 additions & 4 deletions analysis/get_benchmark_results.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def get_args():
return args


def get_average_over_herm(
def get_average_over_rewardbench(
df: pd.DataFrame,
subsets: List[str] = ["alpacaeval", "mt-bench", "llmbar", "refusals", "hep"],
) -> pd.DataFrame:
Expand Down Expand Up @@ -96,7 +96,7 @@ def main():
print(f"Downloading repository snapshots into '{LOCAL_DIR}' directory")
# Load the remote repository using the HF API
hf_evals_repo = snapshot_download(
local_dir=Path(LOCAL_DIR) / "herm",
local_dir=Path(LOCAL_DIR) / "rewardbench",
repo_id=args.hf_evals_repo,
use_auth_token=api_token,
tqdm_class=None,
Expand All @@ -107,8 +107,8 @@ def main():
hf_prefs_df = load_results(hf_evals_repo, subdir="pref-sets/", ignore_columns=args.ignore_columns)

all_results = {
"HERM - Overview": get_average_over_herm(hf_evals_df),
"HERM - Detailed": hf_evals_df,
"RewardBench - Overview": get_average_over_rewardbench(hf_evals_df),
"RewardBench - Detailed": hf_evals_df,
"Pref Sets - Overview": hf_prefs_df,
}

Expand Down
2 changes: 1 addition & 1 deletion analysis/get_per_token_reward.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
pipeline,
)

from herm import models
from rewardbench import models

REWARD_MODEL_CONFIG = {
"default": {
Expand Down
14 changes: 14 additions & 0 deletions analysis/get_subtoken_statistics.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,17 @@
# Copyright 2023 AllenAI. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
from pathlib import Path
from typing import Any, Dict
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion herm/utils.py → rewardbench/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
from huggingface_hub import HfApi
from transformers import PreTrainedTokenizer

from herm.models import REWARD_MODEL_CONFIG
from rewardbench.models import REWARD_MODEL_CONFIG

# HuggingFace Hub locations
CORE_EVAL_SET = "ai2-adapt-dev/rm-benchmark-dev"
Expand Down
File renamed without changes.
6 changes: 3 additions & 3 deletions scripts/configs/beaker_eval.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
version: v2
description: herm-eval-default
description: rewardbench-eval-default
budget: ai2/allennlp
tasks:
- name: herm-eval-default
- name: rewardbench-eval-default
image:
beaker: <image>
command: [
Expand All @@ -24,7 +24,7 @@ tasks:
- name: TRANSFORMERS_CACHE
value: ./cache/
- name: WANDB_PROJECT
value: herm
value: rewardbench
- name: WANDB_WATCH
value: false
- name: WANDB_LOG_MODEL
Expand Down
2 changes: 1 addition & 1 deletion scripts/run_bon.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
from tqdm import tqdm
from transformers import AutoTokenizer, pipeline

from herm import REWARD_MODEL_CONFIG, load_bon_dataset, save_to_hub
from rewardbench import REWARD_MODEL_CONFIG, load_bon_dataset, save_to_hub

# get token from HF_TOKEN env variable, but if it doesn't exist pass none
HF_TOKEN = os.getenv("HF_TOKEN", None)
Expand Down
2 changes: 1 addition & 1 deletion scripts/run_dpo.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
from tqdm import tqdm
from trl.trainer.utils import DPODataCollatorWithPadding

from herm import DPO_MODEL_CONFIG, DPOInference, load_eval_dataset, save_to_hub
from rewardbench import DPO_MODEL_CONFIG, DPOInference, load_eval_dataset, save_to_hub

# get token from HF_TOKEN env variable, but if it doesn't exist pass none
HF_TOKEN = os.getenv("HF_TOKEN", None)
Expand Down
2 changes: 1 addition & 1 deletion scripts/run_rm.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
from tqdm import tqdm
from transformers import AutoTokenizer, pipeline

from herm import REWARD_MODEL_CONFIG, load_eval_dataset, save_to_hub
from rewardbench import REWARD_MODEL_CONFIG, load_eval_dataset, save_to_hub

# get token from HF_TOKEN env variable, but if it doesn't exist pass none
HF_TOKEN = os.getenv("HF_TOKEN", None)
Expand Down
6 changes: 3 additions & 3 deletions scripts/submit_eval_jobs.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,12 +95,12 @@
experiment_group = "dpo-eval"
script = "run_dpo.py"
else:
experiment_group = "herm-preference-sets"
experiment_group = "rewardbench-preference-sets"
script = "run_rm.py"
print(f"Submitting evaluation for model: {model} on {experiment_group}")
d = copy.deepcopy(d1)

name = f"herm_eval_for_{model}_on_{experiment_group}".replace("/", "-")
name = f"rewardbench_eval_for_{model}_on_{experiment_group}".replace("/", "-")
d["description"] = name
d["tasks"][0]["name"] = name

Expand Down Expand Up @@ -133,5 +133,5 @@
yaml.dump(d, file, default_flow_style=True)
file.close()

cmd = "beaker experiment create {} --workspace ai2/herm".format(fn)
cmd = "beaker experiment create {} --workspace ai2/rewardbench".format(fn)
subprocess.Popen(cmd, shell=True)
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@
from setuptools import find_packages, setup

setup(
name="herm",
name="rewardbench",
version="0.1.0.dev",
author="Nathan Lambert",
author_email="[email protected]",
description="Tools for evaluating reward models",
long_description=open("README.md").read(),
long_description_content_type="text/markdown",
url="https://github.com/allenai/herm",
url="https://github.com/allenai/rewardbench",
packages=find_packages(),
classifiers=[
"Programming Language :: Python :: 3",
Expand Down
6 changes: 5 additions & 1 deletion tests/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,11 @@
from fastchat.conversation import get_conv_template
from transformers import AutoTokenizer

from herm import load_eval_dataset, prepare_dialogue, prepare_dialogue_from_tokenizer
from rewardbench import (
load_eval_dataset,
prepare_dialogue,
prepare_dialogue_from_tokenizer,
)


class PrepareDialoguesTest(unittest.TestCase):
Expand Down

0 comments on commit 0f4e05f

Please sign in to comment.