🔁 Add retry mechanism for GRPO training with configurable reward thre… #2823

mandeep511 · 2025-02-10T18:27:11Z

What does this PR do?

This PR adds a retry mechanism to GRPO training that allows the model to make multiple attempts at answering each question until either:

A satisfactory reward threshold is achieved, or
The maximum number of retries is exhausted

Motivation and Distinction from num_generations

While num_generations (default=8) controls how many responses are generated in parallel for each attempt, this retry mechanism determines how many times we try the entire batch of generations for each question.

Key differences:

num_generations: Multiple responses generated in parallel to compare relative performance within a group
max_retries_per_question: Sequential attempts at the same question, using feedback from previous attempts

Motivation:
Sometimes none of the responses in a batch achieve satisfactory quality. Rather than moving on, it's beneficial to let the model try again with the same question, especially for:

Complex reasoning tasks where first attempts might fail
Questions requiring specific formats or steps
Cases where learning from previous attempt feedback can help

Changes Include:

New configuration parameters in GRPOConfig:
- max_retries_per_question: Controls maximum retry attempts per question (default=1)
- min_reward_threshold: Sets minimum reward threshold for satisfactory responses (default=None)
Modified training loop in GRPOTrainer to:
- Track best responses across multiple attempts
- Stop early when reward threshold is met
- Log retry-related metrics
Added comprehensive documentation and tests for the new functionality

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?
- Test basic retry functionality
- Test early stopping on reward threshold
- Test retry metrics logging

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…sholds

qgallouedec · 2025-02-10T18:41:19Z

Hi and thanks for your contribution! The idea seems quite natural. Do you have any quantitative results?
I'd like to keep the codebase simple, so for the moment I'm in favor of leaving this PR open for the community to reference, and if it's a feature in high demand, then we'll merge it.

mandeep511 · 2025-02-10T19:04:10Z

Hi and thanks for your contribution! The idea seems quite natural. Do you have any quantitative results? I'd like to keep the codebase simple, so for the moment I'm in favor of leaving this PR open for the community to reference, and if it's a feature in high demand, then we'll merge it.

Hey, I'm doing a couple of runs to test how well this performs compared to the vanilla implementation. Will post the results here once it is done.

willccbb · 2025-02-11T01:08:42Z

I'm not sure it makes sense to directly add features like this which are not part of the canonical algorithm, and which add significant complexity to the codebase, making further maintenance + feature compatibility more difficult.

I believe that this approach probably works, and could yield higher performance, but my opinion is that this shouldn't be the primary goal of GRPOTrainer. There are lots of such tricks that could be added, but each one makes the code harder to read and modify, and is no longer "GRPO" in the literal sense.

Mentioning my PR #2810 here because I think it's directly relevant: if people want to try out non-canonical sampling methods like MCTS, reward thresholds, reward score diversity constraints etc., or add things like tool calls or multi-step interactions they should have a way to do this without needing to modify the core Trainer. The Environment abstraction would allow users to have full control over the sampling step, and perhaps implementations of these could live in a unified place in TRL (e.g. RolloutSamplers) to support easy hot-swapping. We may also want to allow users to return rewards directly in the rollout stage. On the whole, I think it will be easier for more people to use and adapt TRL trainers if the primary code stays simple while supporting modular customization.

winglian · 2025-02-11T22:58:10Z

It could be worthwhile to refactor the GRPOTrainer to make it easier to extend the class without having to duplicate whole swaths of code in a method in order to add retries in a subclass.

qgallouedec · 2025-02-27T11:11:04Z

Closing the PR, see reason #2823 (comment)

✨ Add retry mechanism for GRPO training with configurable reward thre…

5853989

…sholds

mandeep511 changed the title ~~✨ Add retry mechanism for GRPO training with configurable reward thre…~~ 🔁 Add retry mechanism for GRPO training with configurable reward thre… Feb 10, 2025

qgallouedec closed this Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔁 Add retry mechanism for GRPO training with configurable reward thre… #2823

🔁 Add retry mechanism for GRPO training with configurable reward thre… #2823

mandeep511 commented Feb 10, 2025 •

edited

Loading

qgallouedec commented Feb 10, 2025

mandeep511 commented Feb 10, 2025

willccbb commented Feb 11, 2025

winglian commented Feb 11, 2025

qgallouedec commented Feb 27, 2025

🔁 Add retry mechanism for GRPO training with configurable reward thre… #2823

🔁 Add retry mechanism for GRPO training with configurable reward thre… #2823

Conversation

mandeep511 commented Feb 10, 2025 • edited Loading

What does this PR do?

Motivation and Distinction from num_generations

Changes Include:

Before submitting

Who can review?

qgallouedec commented Feb 10, 2025

mandeep511 commented Feb 10, 2025

willccbb commented Feb 11, 2025

winglian commented Feb 11, 2025

qgallouedec commented Feb 27, 2025

mandeep511 commented Feb 10, 2025 •

edited

Loading