Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔁 Add retry mechanism for GRPO training with configurable reward thre… #2823

Conversation

mandeep511
Copy link

@mandeep511 mandeep511 commented Feb 10, 2025

What does this PR do?

This PR adds a retry mechanism to GRPO training that allows the model to make multiple attempts at answering each question until either:

  1. A satisfactory reward threshold is achieved, or
  2. The maximum number of retries is exhausted

Motivation and Distinction from num_generations

While num_generations (default=8) controls how many responses are generated in parallel for each attempt, this retry mechanism determines how many times we try the entire batch of generations for each question.

Key differences:

  • num_generations: Multiple responses generated in parallel to compare relative performance within a group
  • max_retries_per_question: Sequential attempts at the same question, using feedback from previous attempts

Motivation:
Sometimes none of the responses in a batch achieve satisfactory quality. Rather than moving on, it's beneficial to let the model try again with the same question, especially for:

  • Complex reasoning tasks where first attempts might fail
  • Questions requiring specific formats or steps
  • Cases where learning from previous attempt feedback can help

Changes Include:

  • New configuration parameters in GRPOConfig:

    • max_retries_per_question: Controls maximum retry attempts per question (default=1)
    • min_reward_threshold: Sets minimum reward threshold for satisfactory responses (default=None)
  • Modified training loop in GRPOTrainer to:

    • Track best responses across multiple attempts
    • Stop early when reward threshold is met
    • Log retry-related metrics
  • Added comprehensive documentation and tests for the new functionality

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines.
  • Did you write any new necessary tests?
    • Test basic retry functionality
    • Test early stopping on reward threshold
    • Test retry metrics logging

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@mandeep511 mandeep511 changed the title ✨ Add retry mechanism for GRPO training with configurable reward thre… 🔁 Add retry mechanism for GRPO training with configurable reward thre… Feb 10, 2025
@qgallouedec
Copy link
Member

Hi and thanks for your contribution! The idea seems quite natural. Do you have any quantitative results?
I'd like to keep the codebase simple, so for the moment I'm in favor of leaving this PR open for the community to reference, and if it's a feature in high demand, then we'll merge it.

@mandeep511
Copy link
Author

Hi and thanks for your contribution! The idea seems quite natural. Do you have any quantitative results? I'd like to keep the codebase simple, so for the moment I'm in favor of leaving this PR open for the community to reference, and if it's a feature in high demand, then we'll merge it.

Hey, I'm doing a couple of runs to test how well this performs compared to the vanilla implementation. Will post the results here once it is done.

@willccbb
Copy link

I'm not sure it makes sense to directly add features like this which are not part of the canonical algorithm, and which add significant complexity to the codebase, making further maintenance + feature compatibility more difficult.

I believe that this approach probably works, and could yield higher performance, but my opinion is that this shouldn't be the primary goal of GRPOTrainer. There are lots of such tricks that could be added, but each one makes the code harder to read and modify, and is no longer "GRPO" in the literal sense.

Mentioning my PR #2810 here because I think it's directly relevant: if people want to try out non-canonical sampling methods like MCTS, reward thresholds, reward score diversity constraints etc., or add things like tool calls or multi-step interactions they should have a way to do this without needing to modify the core Trainer. The Environment abstraction would allow users to have full control over the sampling step, and perhaps implementations of these could live in a unified place in TRL (e.g. RolloutSamplers) to support easy hot-swapping. We may also want to allow users to return rewards directly in the rollout stage. On the whole, I think it will be easier for more people to use and adapt TRL trainers if the primary code stays simple while supporting modular customization.

@winglian
Copy link
Contributor

It could be worthwhile to refactor the GRPOTrainer to make it easier to extend the class without having to duplicate whole swaths of code in a method in order to add retries in a subclass.

@qgallouedec
Copy link
Member

Closing the PR, see reason #2823 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants