You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GRPO (Generalized Policy Reward Optimization) is a new reinforcement learning algorithm designed to enhance Proximal Policy Optimization (PPO) by introducing sub-step sampling per time step and customizable reward scaling functions. This method is inspired by DeepSeek’s Generalized Proximal Policy Optimization (GPRO), which has demonstrated improved policy learning efficiency by refining how rewards and actions are evaluated during training.
The core innovation in GRPO lies in its ability to sample multiple actions within each time step, allowing for richer gradient estimations and more stable training in environments with high reward variance or sparse rewards. Unlike traditional PPO, which updates policies based on a single action per step, GRPO dynamically refines policy updates by leveraging multiple sub-samples, enabling a more robust understanding of optimal action sequences. Additionally, GRPO provides users with the ability to inject their own reward scaling functions, making it highly adaptable across robotics, finance, and multi-agent systems where reward structures can vary significantly.
By integrating these two enhancements, GRPO maintains the computational efficiency of PPO while offering greater flexibility and improved reward modeling, making it a valuable addition to the SB3-contrib ecosystem.
Motivation
DeepSeek’s work on GPRO has demonstrated how sampling multiple actions within a single time step can significantly improve the training process for reinforcement learning agents. Traditional PPO, while effective, is often constrained by limited reward representation due to its single-action-per-step approach. This limitation can cause instability in environments where the reward function is highly non-linear, sparse, or noisy.
In complex applications such as robotics control, stock trading, and autonomous navigation, reward structures are rarely straightforward. Allowing users to define their own reward scaling functions enhances model interpretability and ensures that policy learning aligns with real-world objectives. With this flexibility, GRPO is well-positioned to outperform PPO in environments requiring adaptive reward normalization while maintaining the same training framework.
Furthermore, many users of Stable-Baselines3 have expressed a need for more customizable reinforcement learning algorithms, particularly in cases where standardized reward processing does not align with task-specific goals. GRPO addresses this demand by providing a plug-and-play solution that is both compatible with existing environments and extensible for future use cases.
Pitch
The introduction of GRPO will provide reinforcement learning practitioners with a powerful alternative to PPO, particularly in high-variance or custom-reward environments. This proposal includes two major additions:
Sub-step Sampling Per Time Step – Instead of relying on a single action per step, GRPO enables multiple samples within a time step, resulting in a richer policy update process that improves convergence in unstable environments.
Custom Reward Scaling Functions – Users can define their own reward transformation functions, ensuring that models can adapt to specific domain requirements without the need for extensive modifications to the algorithm itself.
By maintaining the same API structure as PPO, GRPO will be easy to integrate into existing SB3 workflows. Users will be able to seamlessly switch from PPO to GRPO while benefiting from enhanced reward modeling and improved sample efficiency.
Alternatives
I am open to alternatives as to how to improve GRPO but currently have nothing proposed
Additional context
I have an implementation of this algorithm already ready in a forked repository and will be creating a pull request by EOD 1/29/2025
Checklist
I have checked that there is no similar issue in the repo
If I'm requesting a new feature, I have proposed alternatives
The text was updated successfully, but these errors were encountered:
Hello,
I think this issue should be more for SB3 contrib (please check the contributing guide there, it is slightly different).
More than the hype, it would be nice to have some results on standard benchmarks first.
I have an implementation of this algorithm already ready in a forked repository
🚀 Feature
GRPO (Generalized Policy Reward Optimization) is a new reinforcement learning algorithm designed to enhance Proximal Policy Optimization (PPO) by introducing sub-step sampling per time step and customizable reward scaling functions. This method is inspired by DeepSeek’s Generalized Proximal Policy Optimization (GPRO), which has demonstrated improved policy learning efficiency by refining how rewards and actions are evaluated during training.
The core innovation in GRPO lies in its ability to sample multiple actions within each time step, allowing for richer gradient estimations and more stable training in environments with high reward variance or sparse rewards. Unlike traditional PPO, which updates policies based on a single action per step, GRPO dynamically refines policy updates by leveraging multiple sub-samples, enabling a more robust understanding of optimal action sequences. Additionally, GRPO provides users with the ability to inject their own reward scaling functions, making it highly adaptable across robotics, finance, and multi-agent systems where reward structures can vary significantly.
By integrating these two enhancements, GRPO maintains the computational efficiency of PPO while offering greater flexibility and improved reward modeling, making it a valuable addition to the SB3-contrib ecosystem.
Motivation
DeepSeek’s work on GPRO has demonstrated how sampling multiple actions within a single time step can significantly improve the training process for reinforcement learning agents. Traditional PPO, while effective, is often constrained by limited reward representation due to its single-action-per-step approach. This limitation can cause instability in environments where the reward function is highly non-linear, sparse, or noisy.
In complex applications such as robotics control, stock trading, and autonomous navigation, reward structures are rarely straightforward. Allowing users to define their own reward scaling functions enhances model interpretability and ensures that policy learning aligns with real-world objectives. With this flexibility, GRPO is well-positioned to outperform PPO in environments requiring adaptive reward normalization while maintaining the same training framework.
Furthermore, many users of Stable-Baselines3 have expressed a need for more customizable reinforcement learning algorithms, particularly in cases where standardized reward processing does not align with task-specific goals. GRPO addresses this demand by providing a plug-and-play solution that is both compatible with existing environments and extensible for future use cases.
Pitch
The introduction of GRPO will provide reinforcement learning practitioners with a powerful alternative to PPO, particularly in high-variance or custom-reward environments. This proposal includes two major additions:
Sub-step Sampling Per Time Step – Instead of relying on a single action per step, GRPO enables multiple samples within a time step, resulting in a richer policy update process that improves convergence in unstable environments.
Custom Reward Scaling Functions – Users can define their own reward transformation functions, ensuring that models can adapt to specific domain requirements without the need for extensive modifications to the algorithm itself.
By maintaining the same API structure as PPO, GRPO will be easy to integrate into existing SB3 workflows. Users will be able to seamlessly switch from PPO to GRPO while benefiting from enhanced reward modeling and improved sample efficiency.
Alternatives
I am open to alternatives as to how to improve GRPO but currently have nothing proposed
Additional context
I have an implementation of this algorithm already ready in a forked repository and will be creating a pull request by EOD 1/29/2025
Checklist
The text was updated successfully, but these errors were encountered: