-
Notifications
You must be signed in to change notification settings - Fork 2.3k
added 10 papers (+trainer cross-links) for #4407 #4441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
added 10 papers (+trainer cross-links) for #4407 #4441
Conversation
Added new papers on mathematical reasoning, preference optimization, and model alignment.
qgallouedec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the contribution, I made a few comments, basically the idea is to align with the rest of the page
Removed unnecessary lines and references to trainers.
Added a new section for DeepSeekMath and its GRPO setup.
Added Python code examples for various optimization techniques and updated paper references.
Added section on Parameter-Efficient Fine-Tuning (PEFT) with LoRA, including a code example for implementation.
docs/source/paper_index.md
Outdated
|
|
||
| training_args = GRPOConfig( | ||
| loss_type="grpo", | ||
| beta=0.0, # GRPO commonly trains without explicit KL in released configs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original paper they don't use beta=0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
docs/source/paper_index.md
Outdated
| from datasets import Dataset | ||
| from trl import DPOConfig, DPOTrainer | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer | ||
|
|
||
| def rso_accept(ex): # replace with your statistic (gap / z-score / judge score) | ||
| return ex.get("rso_keep", True) | ||
|
|
||
| dpo_pairs = dpo_pairs.filter(rso_accept) | ||
|
|
||
| model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...") | ||
| args = DPOConfig(loss_type="sigmoid", beta=0.1) | ||
| trainer = DPOTrainer(model=model, args=args, tokenizer=tok, train_dataset=dpo_pairs) | ||
| trainer.train() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| from datasets import Dataset | |
| from trl import DPOConfig, DPOTrainer | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| def rso_accept(ex): # replace with your statistic (gap / z-score / judge score) | |
| return ex.get("rso_keep", True) | |
| dpo_pairs = dpo_pairs.filter(rso_accept) | |
| model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...") | |
| args = DPOConfig(loss_type="sigmoid", beta=0.1) | |
| trainer = DPOTrainer(model=model, args=args, tokenizer=tok, train_dataset=dpo_pairs) | |
| trainer.train() | |
| from datasets import load_dataset | |
| from trl import DPOConfig, DPOTrainer | |
| train_dataset = load_dataset(...) | |
| def rso_accept(example): # replace with your statistic (gap / z-score / judge score) | |
| return example.get("rso_keep", True) | |
| train_dataset = train_dataset.filter(rso_accept) | |
| training_args = DPOConfig(loss_type="sigmoid", beta=0.1) | |
| trainer = DPOTrainer( | |
| ..., | |
| args=training_args, | |
| train_dataset=train_dataset | |
| ) | |
| trainer.train() |
for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minimal
docs/source/paper_index.md
Outdated
| dpo_pairs = dpo_pairs.filter(rso_accept) | ||
|
|
||
| model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...") | ||
| args = DPOConfig(loss_type="sigmoid", beta=0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't the loss supposed to be "hinged"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
docs/source/paper_index.md
Outdated
|
|
||
| augmented_pairs = dpo_pairs.add_items(new_pairs) | ||
|
|
||
| args = DPOConfig(loss_type="sigmoid", beta=0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there no need to pass the value if it matches the default. In other words, remove any occurence of like loss_type="sigmoid" or beta=0.1
docs/source/paper_index.md
Outdated
| # LoRA adapters with SFT (works the same for DPO/GRPO by passing peft_config to those trainers) | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer | ||
| from peft import LoraConfig | ||
| from trl import SFTTrainer, SFTConfig | ||
|
|
||
| model_id = "meta-llama/Llama-3.1-8B-Instruct" # any causal LM on HF Hub | ||
| tok = AutoTokenizer.from_pretrained(model_id) | ||
| model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") | ||
|
|
||
| peft_cfg = LoraConfig( | ||
| r=16, | ||
| lora_alpha=32, | ||
| lora_dropout=0.05, | ||
| bias="none", | ||
| task_type="CAUSAL_LM", | ||
| # common modules for LLaMA/Mistral/Qwen/Gemma; adjust per model if needed | ||
| target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"], | ||
| ) | ||
|
|
||
| args = SFTConfig( | ||
| max_seq_length=2048, | ||
| per_device_train_batch_size=4, | ||
| gradient_accumulation_steps=8, | ||
| learning_rate=2e-4, | ||
| bf16=True, | ||
| ) | ||
|
|
||
| trainer = SFTTrainer( | ||
| model=model, | ||
| args=args, | ||
| tokenizer=tok, | ||
| peft_config=peft_cfg, # <- LoRA enabled | ||
| train_dataset=..., | ||
| ) | ||
| trainer.train() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # LoRA adapters with SFT (works the same for DPO/GRPO by passing peft_config to those trainers) | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from peft import LoraConfig | |
| from trl import SFTTrainer, SFTConfig | |
| model_id = "meta-llama/Llama-3.1-8B-Instruct" # any causal LM on HF Hub | |
| tok = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") | |
| peft_cfg = LoraConfig( | |
| r=16, | |
| lora_alpha=32, | |
| lora_dropout=0.05, | |
| bias="none", | |
| task_type="CAUSAL_LM", | |
| # common modules for LLaMA/Mistral/Qwen/Gemma; adjust per model if needed | |
| target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"], | |
| ) | |
| args = SFTConfig( | |
| max_seq_length=2048, | |
| per_device_train_batch_size=4, | |
| gradient_accumulation_steps=8, | |
| learning_rate=2e-4, | |
| bf16=True, | |
| ) | |
| trainer = SFTTrainer( | |
| model=model, | |
| args=args, | |
| tokenizer=tok, | |
| peft_config=peft_cfg, # <- LoRA enabled | |
| train_dataset=..., | |
| ) | |
| trainer.train() | |
| from peft import LoraConfig | |
| from trl import SFTTrainer | |
| trainer = SFTTrainer( | |
| ..., | |
| peft_config=LoraConfig(), | |
| ) |
the more minimal, the clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Removed beta parameter from GRPOConfig.
Removed epsilon_high and steps_per_generation parameters from training_args.
Added section for Direct Policy Optimization with related papers.
β¦type Updated the LoRA adapter configuration and training setup for clarity and efficiency.
|
@qgallouedec , pls review the changes |
|
@qgallouedec , would it be possible to get some time and review the updates , TY |
Summary
Expands the Paper Index with 10 additional papers spanning GRPO/DPO/CPO, SFT/PEFT, and foundational systems. Entries follow the house style (title β π link β one-line summary β trainer cross-link when relevant).
Added entries (by section)
Group Relative Policy Optimization
Used in TRL via:
GRPOTrainerDirect Policy Optimization
Contrastive Preference Optimization
Used in TRL via:
CPOTrainerSupervised Fine-Tuning
Reward / Process Supervision (Background)
Distillation / Post-training (Background)
Foundations & Systems (Background)
Implementation details
Scope
docs/source/paper_index.md).Checklist
Relates to #4407.