added 10 papers (+trainer cross-links) for #4407 #4441

SSusantAchary · 2025-11-03T08:14:56Z

Summary

Expands the Paper Index with 10 additional papers spanning GRPO/DPO/CPO, SFT/PEFT, and foundational systems. Entries follow the house style (title → 📜 link → one-line summary → trainer cross-link when relevant).

Added entries (by section)

Group Relative Policy Optimization

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2402.03300) — Introduces GRPO and shows strong math-reasoning gains.
Used in TRL via: GRPOTrainer

Direct Policy Optimization

Statistical Rejection Sampling Improves Preference Optimization (2309.06657) — RSO for better preference pairs; complements DPO/SLiC.
Nash Learning from Human Feedback (2312.00886) — Frames alignment as a two-player game with Nash policies.
Direct Language Model Alignment from Online AI Feedback (2402.04792) — Online AI feedback signals for direct alignment.

Contrastive Preference Optimization

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation (2401.08417) — Contrastive pairs to avoid adequate-but-suboptimal outputs.
Used in TRL via: CPOTrainer

Supervised Fine-Tuning

LoRA: Low-Rank Adaptation of Large Language Models (2106.09685) — Parameter-efficient adapters; reference for TRL PEFT integration.

Reward / Process Supervision (Background)

Solving Math Word Problems with Process- and Outcome-Based Feedback (2211.14275) — Motivates process supervision signals.

Distillation / Post-training (Background)

On-Policy Distillation of Language Models (2306.13649) — GKD on-policy student/teacher distillation.

Foundations & Systems (Background)

Proximal Policy Optimization Algorithms (1707.06347) — PPO foundation used across RL/RLHF variants.
ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models (1910.02054) — Partitioning states/grad/params for scale.

Implementation details

Kept headings/emoji/spacing consistent with existing sections.
Added/confirmed link-refs:
Avoided duplicates (scanned current page before insertion).

Scope

Docs only (docs/source/paper_index.md).
No API or code changes.

Checklist

Neutral one-liners (≤ ~30 words)
Live links (HF paper pages + trainer docs where applicable)
Consistent style with existing entries
References to Complete paper index #4407 included

Relates to #4407.

Added new papers on mathematical reasoning, preference optimization, and model alignment.

qgallouedec

thanks for the contribution, I made a few comments, basically the idea is to align with the rest of the page

docs/source/paper_index.md

Removed unnecessary lines and references to trainers.

Added a new section for DeepSeekMath and its GRPO setup.

Added Python code examples for various optimization techniques and updated paper references.

Added section on Parameter-Efficient Fine-Tuning (PEFT) with LoRA, including a code example for implementation.

qgallouedec · 2025-11-05T00:27:47Z

docs/source/paper_index.md

+
+training_args = GRPOConfig(
+    loss_type="grpo",
+    beta=0.0,                 # GRPO commonly trains without explicit KL in released configs


In the original paper they don't use beta=0.0

docs/source/paper_index.md

qgallouedec · 2025-11-05T00:35:48Z

docs/source/paper_index.md

+from datasets import Dataset
+from trl import DPOConfig, DPOTrainer
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+def rso_accept(ex):  # replace with your statistic (gap / z-score / judge score)
+    return ex.get("rso_keep", True)
+
+dpo_pairs = dpo_pairs.filter(rso_accept)
+
+model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...")
+args = DPOConfig(loss_type="sigmoid", beta=0.1)
+trainer = DPOTrainer(model=model, args=args, tokenizer=tok, train_dataset=dpo_pairs)
+trainer.train()


Suggested change

from datasets import Dataset

from trl import DPOConfig, DPOTrainer

from transformers import AutoModelForCausalLM, AutoTokenizer

def rso_accept(ex): # replace with your statistic (gap / z-score / judge score)

return ex.get("rso_keep", True)

dpo_pairs = dpo_pairs.filter(rso_accept)

model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...")

args = DPOConfig(loss_type="sigmoid", beta=0.1)

trainer = DPOTrainer(model=model, args=args, tokenizer=tok, train_dataset=dpo_pairs)

trainer.train()

from datasets import load_dataset

from trl import DPOConfig, DPOTrainer

train_dataset = load_dataset(...)

def rso_accept(example): # replace with your statistic (gap / z-score / judge score)

return example.get("rso_keep", True)

train_dataset = train_dataset.filter(rso_accept)

training_args = DPOConfig(loss_type="sigmoid", beta=0.1)

trainer = DPOTrainer(

...,

args=training_args,

train_dataset=train_dataset

)

trainer.train()

for consistency

qgallouedec · 2025-11-05T00:38:52Z

docs/source/paper_index.md

+dpo_pairs = dpo_pairs.filter(rso_accept)
+
+model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...")
+args = DPOConfig(loss_type="sigmoid", beta=0.1)


isn't the loss supposed to be "hinged"?

qgallouedec · 2025-11-05T00:40:19Z

docs/source/paper_index.md

+
+augmented_pairs = dpo_pairs.add_items(new_pairs)
+
+args = DPOConfig(loss_type="sigmoid", beta=0.1)


there no need to pass the value if it matches the default. In other words, remove any occurence of like loss_type="sigmoid" or beta=0.1

qgallouedec · 2025-11-05T00:42:28Z

docs/source/paper_index.md

+# LoRA adapters with SFT (works the same for DPO/GRPO by passing peft_config to those trainers)
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import LoraConfig
+from trl import SFTTrainer, SFTConfig
+
+model_id = "meta-llama/Llama-3.1-8B-Instruct"  # any causal LM on HF Hub
+tok = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
+
+peft_cfg = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM",
+    # common modules for LLaMA/Mistral/Qwen/Gemma; adjust per model if needed
+    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
+)
+
+args = SFTConfig(
+    max_seq_length=2048,
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=8,
+    learning_rate=2e-4,
+    bf16=True,
+)
+
+trainer = SFTTrainer(
+    model=model,
+    args=args,
+    tokenizer=tok,
+    peft_config=peft_cfg,   # <- LoRA enabled
+    train_dataset=...,
+)
+trainer.train()
+


Suggested change

# LoRA adapters with SFT (works the same for DPO/GRPO by passing peft_config to those trainers)

from transformers import AutoModelForCausalLM, AutoTokenizer

from peft import LoraConfig

from trl import SFTTrainer, SFTConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct" # any causal LM on HF Hub

tok = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

peft_cfg = LoraConfig(

r=16,

lora_alpha=32,

lora_dropout=0.05,

bias="none",

task_type="CAUSAL_LM",

# common modules for LLaMA/Mistral/Qwen/Gemma; adjust per model if needed

target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],

)

args = SFTConfig(

max_seq_length=2048,

per_device_train_batch_size=4,

gradient_accumulation_steps=8,

learning_rate=2e-4,

bf16=True,

)

trainer = SFTTrainer(

model=model,

args=args,

tokenizer=tok,

peft_config=peft_cfg, # <- LoRA enabled

train_dataset=...,

)

trainer.train()

from peft import LoraConfig

from trl import SFTTrainer

trainer = SFTTrainer(

...,

peft_config=LoraConfig(),

)

the more minimal, the clearer

Removed beta parameter from GRPOConfig.

Removed epsilon_high and steps_per_generation parameters from training_args.

Added section for Direct Policy Optimization with related papers.

…type Updated the LoRA adapter configuration and training setup for clarity and efficiency.

SSusantAchary · 2025-11-06T18:01:15Z

@qgallouedec , pls review the changes

SSusantAchary · 2025-11-14T04:06:21Z

@qgallouedec , would it be possible to get some time and review the updates , TY

SSusantAchary added 2 commits November 3, 2025 13:41

add 10 papers (+trainer cross-links) for huggingface#4407

494657f

Added new papers on mathematical reasoning, preference optimization, and model alignment.

Add links to GRPOTrainer and CPOTrainer in paper index

70a26e6

SSusantAchary mentioned this pull request Nov 3, 2025

Complete paper index #4407

Open

54 tasks

Merge branch 'main' into Paper-Index-with-10-papers

85054a4

qgallouedec reviewed Nov 3, 2025

View reviewed changes

docs/source/paper_index.md Outdated Show resolved Hide resolved

docs/source/paper_index.md Show resolved Hide resolved

docs/source/paper_index.md Show resolved Hide resolved

docs/source/paper_index.md Outdated Show resolved Hide resolved

SSusantAchary added 5 commits November 3, 2025 20:38

Clean up paper_index.md

2e01916

Removed unnecessary lines and references to trainers.

Add DeepSeekMath section with GRPO configuration

4e2f10b

Added a new section for DeepSeekMath and its GRPO setup.

Enhance paper index with code examples and updates

0cf0649

Added Python code examples for various optimization techniques and updated paper references.

Merge branch 'main' into Paper-Index-with-10-papers

be590d6

Add PEFT section with LoRA implementation example

5f2e5e9

Added section on Parameter-Efficient Fine-Tuning (PEFT) with LoRA, including a code example for implementation.

qgallouedec reviewed Nov 5, 2025

View reviewed changes

docs/source/paper_index.md Outdated Show resolved Hide resolved

qgallouedec reviewed Nov 5, 2025

View reviewed changes

docs/source/paper_index.md Show resolved Hide resolved

qgallouedec reviewed Nov 5, 2025

View reviewed changes

SSusantAchary added 8 commits November 6, 2025 22:57

Remove beta parameter from GRPOConfig

597a397

Removed beta parameter from GRPOConfig.

Refactor training_args by removing unused parameters

f1d98b0

Removed epsilon_high and steps_per_generation parameters from training_args.

Add Direct Policy Optimization section to paper index

9348298

Added section for Direct Policy Optimization with related papers.

Change loss_type from 'sigmoid' to 'hinged'

a9d9467

Revise RSO example with model loading and DPO config

66cc83d

removed explicit defaults from DPOConfig (loss_type, beta) per review

e35c02d

simplify example; rely on defaults and keep only essential PEFT task_…

98d3086

…type Updated the LoRA adapter configuration and training setup for clarity and efficiency.

Merge branch 'main' into Paper-Index-with-10-papers

627db69

SSusantAchary added 4 commits November 7, 2025 06:23

Merge branch 'main' into Paper-Index-with-10-papers

33f7f42

Merge branch 'main' into Paper-Index-with-10-papers

1efce43

Merge branch 'main' into Paper-Index-with-10-papers

e50ed53

Merge branch 'main' into Paper-Index-with-10-papers

0856286

Merge branch 'main' into Paper-Index-with-10-papers

e0e0e39

Merge branch 'main' into Paper-Index-with-10-papers

f204d80


		augmented_pairs = dpo_pairs.add_items(new_pairs)

		args = DPOConfig(loss_type="sigmoid", beta=0.1)

added 10 papers (+trainer cross-links) for #4407 #4441

Are you sure you want to change the base?

added 10 papers (+trainer cross-links) for #4407 #4441

Uh oh!

Conversation

SSusantAchary commented Nov 3, 2025

Summary

Added entries (by section)

Group Relative Policy Optimization

Direct Policy Optimization

Contrastive Preference Optimization

Supervised Fine-Tuning

Reward / Process Supervision (Background)

Distillation / Post-training (Background)

Foundations & Systems (Background)

Implementation details

Scope

Checklist

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SSusantAchary commented Nov 6, 2025

Uh oh!

SSusantAchary commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants