Skip to content

Latest commit

 

History

History
178 lines (161 loc) · 5.48 KB

README.md

File metadata and controls

178 lines (161 loc) · 5.48 KB
license: apache-2.0
datasets:
  - Skywork/Skywork-Reward-Preference-80K-v0.1
base_model:
  - Qwen/Qwen2-7B-Instruct

Introduction

Con-J-Qwen2-7B (learning the generative Judge using self-generated Contrastive judgments) is an advanced generative judge built on Qwen2-7B-Instruct architecture and dataset Skywork/Skywork-Reward-Preference-80K-v0.1.

Con-J-Qwen2-7B is trained from preference data. We prompt the pre-trained Qwen2-7B-Instruct model to generate positive and negative judgments, both supported with rationales in natural language form. Then the self-generated contrastive judgment pairs are used to train the generative judge with Direct Preference Optimization (DPO). By doing this, Con-J learns to act as a generative judge and provides accurate and supprting rationales.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Model/Con-J-Qwen2-7B"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

question = "What is the range of the numeric output of a sigmoid node in a neural network?"
answer1 = "The output of a sigmoid node is bounded between -1 and 1."
answer2 = "The output of a sigmoid node is bounded between 0 and 1."

# Format and tokenize the conversations
CON_J_PROMPT = """作为一个评价专家,给定一个问题和它的两个可能的回答,请选出哪一个回答在连贯性、准确性、覆盖度和上述定义的整体质量方面最为符合。请用JSON格式输出你的判断, 其中"原因"是你提供的解释,"更好的回答"是整数类型的1或2,例如{{"原因": "你的解释", "更好的回答": 1}}。以下是问题和候选回答的内容:
    \n问题:{instruction}
回答1:{output_1}
回答2:{output_2}"""
user_prompt = CON_J_PROMPT.format(instruction=question, output_1=answer1, output_2=answer2)
system_prompt = ""
messages = [
    {"role": "system", "content": system_prompt,},
    {"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt = tokenizer([prompt], return_tensors="pt")

# Generate judgment for the given prompt
with torch.no_grad():
    generated_ids = model.generate(prompt.input_ids, do_sample=False, max_new_tokens=2048,)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(prompt.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

# response: {"原因": "回答1中的-1是错误的,因为sigmoid函数的实际输出范围是0到1,而不是包括-1。回答2准确地描述了sigmoid函数的输出范围是0到1。",\n "更好的回答": 2}

Performance

Model Infinity-
Preference
Ultra-
Feedback
PKU-
SafeRLHF
Reward-Bench
Chat Chat-H Safety Reasoning
Llama3.1-8B 59.0 62.9 66.4 80.7 49.8 64.0 68.1
Llama3.1-70B 64.0 71.4 67.6 97.2 70.2 82.8 86.0
Qwen2-7B 59.0 64.5 67.2 91.3 44.8 73.6 69.0
Qwen2.5-72B 70.0 66.0 58.7 86.6 61.4 74.5 90.7
Auto-J 69.0 63.9 66.9 93.0 40.0 65.5 50.5
Prometheus 2 68.0 63.3 63.0 85.5 49.1 77.1 76.5
GPT-4o 75.0 72.2 69.6 95.3 74.3 87.6 86.9
Con-J (ours) 81.0 73.0 68.4 91.3 79.6 88.0 87.1

Training Scripts

The training of Con-J is based on a self-modified version of Open-RLHF. The training scripts are available in Code/run_scripts/. The training of Con-J involves the following steps:

task_name="Skywork-Reward-Preference-80K-v0.1"
cd run_scripts/Qwen2/
# repeated sampling
sh vllm_inference_best_of_n.sh 8 $task_name
# hint driven sampling
sh vllm_inference_all.sh $task_name
# dataset filtering and construction
python ../../examples/construct_dpo_dataset_for_critic_model.py --task $task_name
# contrastive training
sh train_dpo.sh $task_name
# inference and evaluation
sh vllm_inference2.sh $task_name $task_name

To enable Con-J training, one should download the base model Qwen/Qwen2-7B-Instruct and the dataset Skywork/Skywork-Reward-Preference-80K-v0.1 to proper place align with the training scripts. Then the downloaded dataset can be preprocessed by runing the following command:

python preprocess_dataset.py

Reference

Coming soon.