Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hybrid Group Relative Policy Optimization (Hybrid GRPO): A Multi-Sample Approach to Reinforcement Learning #275

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

Soham4001A
Copy link

Description

This update introduces Hybrid Group Relative Policy Optimization (Hybrid GRPO), a novel reinforcement learning framework that extends Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) by incorporating empirical multi-sample action evaluation while maintaining the stability of value function-based learning.

Unlike DeepSeek’s GRPO, which eliminates the value function in favor of purely empirical reward estimation, Hybrid GRPO introduces structured advantage computation, balancing empirical action sampling with bootstrapped value estimation. This approach enhances sample efficiency, improves learning stability, and mitigates variance amplification observed in purely empirical methods.

Key Features:
• Empirical Multi-Sample Action Evaluation: Hybrid GRPO samples multiple actions per state while leveraging a value function, extracting additional training data.
• Structured Advantage Computation: Integrates bootstrapped value estimation with empirical sampling for reduced variance and improved policy stability.
• Adaptive Reward Scaling: Implements tanh-based normalization to stabilize learning.
• Entropy-Regularized Sampling & Hierarchical Multi-Step Sub-Sampling: Reduces sample inefficiency and improves convergence.
• Scalability Across Domains: Applicable to LLMs, autonomous robotics, financial modeling, and AI-driven control systems.

Experimental validation in reinforcement learning environments demonstrates superior convergence speed, policy stability, and sample efficiency compared to existing methods. This update provides a scalable reinforcement learning framework, bridging the gap between empirical sampling and value-function-based optimization.

For more details, refer to our published paper: Hybrid GRPO: A Multi-Sample Approach to Enhancing Policy Optimization.

Context

  • [x ] I have raised an issue to propose this change (Related Issue)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • [ x] New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist:

  • [ x] I've read the CONTRIBUTION guide (required)
  • [ x] The functionality/performance matches that of the source (required for new training algorithms or training-related features).
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • [x ] I have included an example of using the feature (required for new features).
  • I have included baseline results (required for new training algorithms or training-related features).
  • I have updated the documentation accordingly.
  • [ x] I have updated the changelog accordingly (required).
  • [ x] I have reformatted the code using make format (required)
  • [ x] I have checked the codestyle using make check-codestyle and make lint (required)
  • [ x] I have ensured make pytest and make type both pass. (required)

Note: we are using a maximum length of 127 characters per line

@ReadyPlayerEmma
Copy link

@Soham4001A As an external interested observer, do you have any evaluation results showing improved convergence?

@axelsnoski
Copy link

@Soham4001A As an external interested observer, do you have any evaluation results showing improved convergence?

Also interested to know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants