Hybrid Group Relative Policy Optimization (Hybrid GRPO): A Multi-Sample Approach to Reinforcement Learning #275
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This update introduces Hybrid Group Relative Policy Optimization (Hybrid GRPO), a novel reinforcement learning framework that extends Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) by incorporating empirical multi-sample action evaluation while maintaining the stability of value function-based learning.
Unlike DeepSeek’s GRPO, which eliminates the value function in favor of purely empirical reward estimation, Hybrid GRPO introduces structured advantage computation, balancing empirical action sampling with bootstrapped value estimation. This approach enhances sample efficiency, improves learning stability, and mitigates variance amplification observed in purely empirical methods.
Key Features:
• Empirical Multi-Sample Action Evaluation: Hybrid GRPO samples multiple actions per state while leveraging a value function, extracting additional training data.
• Structured Advantage Computation: Integrates bootstrapped value estimation with empirical sampling for reduced variance and improved policy stability.
• Adaptive Reward Scaling: Implements tanh-based normalization to stabilize learning.
• Entropy-Regularized Sampling & Hierarchical Multi-Step Sub-Sampling: Reduces sample inefficiency and improves convergence.
• Scalability Across Domains: Applicable to LLMs, autonomous robotics, financial modeling, and AI-driven control systems.
Experimental validation in reinforcement learning environments demonstrates superior convergence speed, policy stability, and sample efficiency compared to existing methods. This update provides a scalable reinforcement learning framework, bridging the gap between empirical sampling and value-function-based optimization.
For more details, refer to our published paper: Hybrid GRPO: A Multi-Sample Approach to Enhancing Policy Optimization.
Context
Types of changes
Checklist:
make format
(required)make check-codestyle
andmake lint
(required)make pytest
andmake type
both pass. (required)Note: we are using a maximum length of 127 characters per line