Modified GRPOTrainer to accumulate gradient within a single training batch #3288

jarrelscy · 2025-04-13T23:44:19Z

What does this PR do?

GRPOTrainer calculates advantages and then calculates loss per completion. Currently this is all done within a single batch which can take a lot of memory. Just like with gradient accumulation, we can call .backwards on the loss for each completion separately. This PR does so by introducing a new parameter into GRPOConfig called num_generations_chunks, of which num_generations needs to be a multiple of. Doing so will cause loss.backward to be called per num_generations_chunks number of completions.

Example usage:

# train_grpo.py
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer

dataset = load_dataset("trl-lib/tldr", split="train")

# Define the reward function, which rewards completions that are close to 20 characters
def reward_len(completions, **kwargs):
    return [-abs(20 - len(completion)) for completion in completions]

training_args = GRPOConfig(output_dir="Qwen2-0.5B-GRPO", 
                          logging_steps=10,
                          per_device_train_batch_size=16, # needs to be a multiple of num_generations
                          num_generations=8, # needs to be a multiple of num_generations_chunks 
                          num_generations_chunks=8)
trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_len,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

Fixes # 3017

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…e batch of GRPO

qgallouedec · 2025-04-15T21:54:52Z

Thanks @jarrelscy

I understand the motivation. Just for clarification, if

Just like with gradient accumulation, we can call .backwards on the loss for each completion separately.

then why not using gradient accumulation? Is it because the generation will also be done on smaller batches, then makes things slower?

jarrelscy · 2025-04-15T22:08:31Z

Hi @qgallouedec as @JamesBowerXanda pointed out in here, the quality of the loss depends on the group size. In this paper they point that you need a large group size to approximate the expected reward normalised by the standard deviation of the reward of an output sampled from the previous policy.

In GRPO each generation is assigned a relative advantage against other generations, so if the group size is small, this can lead to erratic losses.

In gradient accumulation (per batch), we are still comparing the advantage of each generation against other generations within that batch.

qgallouedec · 2025-04-21T23:55:05Z

FYI, now you can pass a group as large as gradient_accumulation * per_device_batch_size * num_devices thanks to #3283

qgallouedec · 2025-04-28T21:19:29Z

Closing this as I believe the motivation behind this PR has been addressed by #3283

jarrelscy · 2025-04-28T21:45:01Z

@qgallouedec this PR is not exactly the same as 3283 - its akin to the comment here on 3283, which states that this functionality is not implemented in 3283.

qgallouedec · 2025-04-28T21:47:44Z

Ok, sorry for the misjudgement, so I'm reopening the PR.

jiangix-paper · 2025-07-01T09:14:52Z

Hello, any update for this? Have you merge it to the master branch?

jarrelscy · 2025-07-01T13:43:32Z

@jiangix-paper there are some new changes in the trl main branch which I think are not compatible - the entropy masking implementation. i've just done a merge, feel free to try cloning and testing it

added num_generations_chunks to 'accumulate' gradients within a singl…

92852d7

…e batch of GRPO

jarrelscy mentioned this pull request Apr 13, 2025

GRPO split generations into multiple training batches #3017

Open

fix prediction step to account for generator loss

6f812c5

I-l-l-I mentioned this pull request Apr 14, 2025

☝️ [GRPO] Generate once per effective batch #3283

Merged

5 tasks

jarrelscy mentioned this pull request Apr 16, 2025

GRPO split generations into multiple training batches unslothai/unsloth#1924

Open

qgallouedec closed this Apr 28, 2025

qgallouedec reopened this Apr 28, 2025

Merge branch 'main' into main

07408f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modified GRPOTrainer to accumulate gradient within a single training batch #3288

Modified GRPOTrainer to accumulate gradient within a single training batch #3288

Uh oh!

jarrelscy commented Apr 13, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Apr 15, 2025

Uh oh!

jarrelscy commented Apr 15, 2025

Uh oh!

qgallouedec commented Apr 21, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Apr 28, 2025

Uh oh!

jarrelscy commented Apr 28, 2025

Uh oh!

qgallouedec commented Apr 28, 2025

Uh oh!

jiangix-paper commented Jul 1, 2025 •

edited

Loading

Uh oh!

jarrelscy commented Jul 1, 2025

Uh oh!

Uh oh!

Modified GRPOTrainer to accumulate gradient within a single training batch #3288

Are you sure you want to change the base?

Modified GRPOTrainer to accumulate gradient within a single training batch #3288

Uh oh!

Conversation

jarrelscy commented Apr 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

qgallouedec commented Apr 15, 2025

Uh oh!

jarrelscy commented Apr 15, 2025

Uh oh!

qgallouedec commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Apr 28, 2025

Uh oh!

jarrelscy commented Apr 28, 2025

Uh oh!

qgallouedec commented Apr 28, 2025

Uh oh!

jiangix-paper commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jarrelscy commented Jul 1, 2025

Uh oh!

Uh oh!

jarrelscy commented Apr 13, 2025 •

edited

Loading

qgallouedec commented Apr 21, 2025 •

edited

Loading

jiangix-paper commented Jul 1, 2025 •

edited

Loading