Skip to content

Conversation

@SSusantAchary
Copy link

Summary

Expands the Paper Index with 10 additional papers spanning GRPO/DPO/CPO, SFT/PEFT, and foundational systems. Entries follow the house style (title β†’ πŸ“œ link β†’ one-line summary β†’ trainer cross-link when relevant).

Added entries (by section)

Group Relative Policy Optimization

  • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2402.03300) β€” Introduces GRPO and shows strong math-reasoning gains.
    Used in TRL via: GRPOTrainer

Direct Policy Optimization

  • Statistical Rejection Sampling Improves Preference Optimization (2309.06657) β€” RSO for better preference pairs; complements DPO/SLiC.
  • Nash Learning from Human Feedback (2312.00886) β€” Frames alignment as a two-player game with Nash policies.
  • Direct Language Model Alignment from Online AI Feedback (2402.04792) β€” Online AI feedback signals for direct alignment.

Contrastive Preference Optimization

  • Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation (2401.08417) β€” Contrastive pairs to avoid adequate-but-suboptimal outputs.
    Used in TRL via: CPOTrainer

Supervised Fine-Tuning

  • LoRA: Low-Rank Adaptation of Large Language Models (2106.09685) β€” Parameter-efficient adapters; reference for TRL PEFT integration.

Reward / Process Supervision (Background)

  • Solving Math Word Problems with Process- and Outcome-Based Feedback (2211.14275) β€” Motivates process supervision signals.

Distillation / Post-training (Background)

  • On-Policy Distillation of Language Models (2306.13649) β€” GKD on-policy student/teacher distillation.

Foundations & Systems (Background)

  • Proximal Policy Optimization Algorithms (1707.06347) β€” PPO foundation used across RL/RLHF variants.
  • ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models (1910.02054) β€” Partitioning states/grad/params for scale.

Implementation details

  • Kept headings/emoji/spacing consistent with existing sections.
  • Added/confirmed link-refs:
  • Avoided duplicates (scanned current page before insertion).

Scope

  • Docs only (docs/source/paper_index.md).
  • No API or code changes.

Checklist

  • Neutral one-liners (≀ ~30 words)
  • Live links (HF paper pages + trainer docs where applicable)
  • Consistent style with existing entries
  • References to Complete paper indexΒ #4407 included

Relates to #4407.

Added new papers on mathematical reasoning, preference optimization, and model alignment.
@SSusantAchary SSusantAchary mentioned this pull request Nov 3, 2025
54 tasks
Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution, I made a few comments, basically the idea is to align with the rest of the page

Removed unnecessary lines and references to trainers.
Added a new section for DeepSeekMath and its GRPO setup.
Added Python code examples for various optimization techniques and updated paper references.
Added section on Parameter-Efficient Fine-Tuning (PEFT) with LoRA, including a code example for implementation.

training_args = GRPOConfig(
loss_type="grpo",
beta=0.0, # GRPO commonly trains without explicit KL in released configs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original paper they don't use beta=0.0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines 482 to 494
from datasets import Dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

def rso_accept(ex): # replace with your statistic (gap / z-score / judge score)
return ex.get("rso_keep", True)

dpo_pairs = dpo_pairs.filter(rso_accept)

model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...")
args = DPOConfig(loss_type="sigmoid", beta=0.1)
trainer = DPOTrainer(model=model, args=args, tokenizer=tok, train_dataset=dpo_pairs)
trainer.train()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from datasets import Dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
def rso_accept(ex): # replace with your statistic (gap / z-score / judge score)
return ex.get("rso_keep", True)
dpo_pairs = dpo_pairs.filter(rso_accept)
model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...")
args = DPOConfig(loss_type="sigmoid", beta=0.1)
trainer = DPOTrainer(model=model, args=args, tokenizer=tok, train_dataset=dpo_pairs)
trainer.train()
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
train_dataset = load_dataset(...)
def rso_accept(example): # replace with your statistic (gap / z-score / judge score)
return example.get("rso_keep", True)
train_dataset = train_dataset.filter(rso_accept)
training_args = DPOConfig(loss_type="sigmoid", beta=0.1)
trainer = DPOTrainer(
...,
args=training_args,
train_dataset=train_dataset
)
trainer.train()

for consistency

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimal

dpo_pairs = dpo_pairs.filter(rso_accept)

model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...")
args = DPOConfig(loss_type="sigmoid", beta=0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the loss supposed to be "hinged"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


augmented_pairs = dpo_pairs.add_items(new_pairs)

args = DPOConfig(loss_type="sigmoid", beta=0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there no need to pass the value if it matches the default. In other words, remove any occurence of like loss_type="sigmoid" or beta=0.1

Comment on lines 603 to 638
# LoRA adapters with SFT (works the same for DPO/GRPO by passing peft_config to those trainers)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct" # any causal LM on HF Hub
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

peft_cfg = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
# common modules for LLaMA/Mistral/Qwen/Gemma; adjust per model if needed
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)

args = SFTConfig(
max_seq_length=2048,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-4,
bf16=True,
)

trainer = SFTTrainer(
model=model,
args=args,
tokenizer=tok,
peft_config=peft_cfg, # <- LoRA enabled
train_dataset=...,
)
trainer.train()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# LoRA adapters with SFT (works the same for DPO/GRPO by passing peft_config to those trainers)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
model_id = "meta-llama/Llama-3.1-8B-Instruct" # any causal LM on HF Hub
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
peft_cfg = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
# common modules for LLaMA/Mistral/Qwen/Gemma; adjust per model if needed
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)
args = SFTConfig(
max_seq_length=2048,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-4,
bf16=True,
)
trainer = SFTTrainer(
model=model,
args=args,
tokenizer=tok,
peft_config=peft_cfg, # <- LoRA enabled
train_dataset=...,
)
trainer.train()
from peft import LoraConfig
from trl import SFTTrainer
trainer = SFTTrainer(
...,
peft_config=LoraConfig(),
)

the more minimal, the clearer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@SSusantAchary
Copy link
Author

@qgallouedec , pls review the changes

@SSusantAchary
Copy link
Author

@qgallouedec , would it be possible to get some time and review the updates , TY

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants