Hybrid Group Relative Policy Optimization (Hybrid GRPO): A Multi-Sample Approach to Reinforcement Learning #275

Soham4001A · 2025-02-01T01:07:23Z

Description

This update introduces Hybrid Group Relative Policy Optimization (Hybrid GRPO), a novel reinforcement learning framework that extends Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) by incorporating empirical multi-sample action evaluation while maintaining the stability of value function-based learning.

Unlike DeepSeek’s GRPO, which eliminates the value function in favor of purely empirical reward estimation, Hybrid GRPO introduces structured advantage computation, balancing empirical action sampling with bootstrapped value estimation. This approach enhances sample efficiency, improves learning stability, and mitigates variance amplification observed in purely empirical methods.

Key Features:
• Empirical Multi-Sample Action Evaluation: Hybrid GRPO samples multiple actions per state while leveraging a value function, extracting additional training data.
• Structured Advantage Computation: Integrates bootstrapped value estimation with empirical sampling for reduced variance and improved policy stability.
• Adaptive Reward Scaling: Implements tanh-based normalization to stabilize learning.
• Entropy-Regularized Sampling & Hierarchical Multi-Step Sub-Sampling: Reduces sample inefficiency and improves convergence.
• Scalability Across Domains: Applicable to LLMs, autonomous robotics, financial modeling, and AI-driven control systems.

Experimental validation in reinforcement learning environments demonstrates superior convergence speed, policy stability, and sample efficiency compared to existing methods. This update provides a scalable reinforcement learning framework, bridging the gap between empirical sampling and value-function-based optimization.

For more details, refer to our published paper: Hybrid GRPO: A Multi-Sample Approach to Enhancing Policy Optimization.

Context

[x ] I have raised an issue to propose this change (Related Issue)

Types of changes

Bug fix (non-breaking change which fixes an issue)
[ x] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

[ x] I've read the CONTRIBUTION guide (required)
[ x] The functionality/performance matches that of the source (required for new training algorithms or training-related features).
I have updated the tests accordingly (required for a bug fix or a new feature).
[x ] I have included an example of using the feature (required for new features).
I have included baseline results (required for new training algorithms or training-related features).
I have updated the documentation accordingly.
[ x] I have updated the changelog accordingly (required).
[ x] I have reformatted the code using make format (required)
[ x] I have checked the codestyle using make check-codestyle and make lint (required)
[ x] I have ensured make pytest and make type both pass. (required)

Note: we are using a maximum length of 127 characters per line

ReadyPlayerEmma · 2025-03-11T13:32:48Z

@Soham4001A As an external interested observer, do you have any evaluation results showing improved convergence?

axelsnoski · 2025-03-18T01:25:48Z

@Soham4001A As an external interested observer, do you have any evaluation results showing improved convergence?

Also interested to know

Soham Sane added 6 commits January 29, 2025 10:49

init - untested

c07a7b4

Reformatted but yet untested - still need to edit test files

4643314

Ready for PR (Untested Still

adb6110

Ready for PR - Tested

0c33dbb

Changelog updated

70b67f6

Updated GRPO to use environment reward function for sampled rewards

306ae63

Soham4001A mentioned this pull request Feb 1, 2025

[Feature Request] Group Relative Proximity Optimization (GRPO) #273

Open

2 tasks

Soham Sane added 2 commits March 30, 2025 18:03

Updated Method - GRPO is now in Hybrid Implementation & not standard

858b639

Formatting + Commenting

c09e2d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid Group Relative Policy Optimization (Hybrid GRPO): A Multi-Sample Approach to Reinforcement Learning #275

Hybrid Group Relative Policy Optimization (Hybrid GRPO): A Multi-Sample Approach to Reinforcement Learning #275

Soham4001A commented Feb 1, 2025

ReadyPlayerEmma commented Mar 11, 2025

axelsnoski commented Mar 18, 2025

Hybrid Group Relative Policy Optimization (Hybrid GRPO): A Multi-Sample Approach to Reinforcement Learning #275

Are you sure you want to change the base?

Hybrid Group Relative Policy Optimization (Hybrid GRPO): A Multi-Sample Approach to Reinforcement Learning #275

Conversation

Soham4001A commented Feb 1, 2025

Description

Context

Types of changes

Checklist:

ReadyPlayerEmma commented Mar 11, 2025

axelsnoski commented Mar 18, 2025