-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPRO - Feature Addition #272
base: master
Are you sure you want to change the base?
Conversation
Hi @Soham4001A, I tried the example you provided by checking out your branch and it's throwing error:
Update: I realised that |
Is there an update on the timeline for this? |
I'm not sure what the proper direction might be. I may need to rewrite the description of what is happening behind the scenes too. This new implementation is technically Hybrid GRPO & not standard GRPO. In this method, there are 2 key differences: (1) The v(s) is still used to compute A(s). And to the point of having to pass the reward func again, I made that as a passable param because I forget exactly what the issue I was having was. I could take a look into that again or if you want to elaborate on what you mean. Here is the link to the paper btw - https://arxiv.org/abs/2502.01652 |
Basically this PR throws error if The example you've provided also passes I figured out a hacky work-around with this, however, I saw that it performed worse than |
Description
This PR introduces Generalized Policy Reward Optimization (GRPO) as a new feature in stable-baselines3-contrib. GRPO extends Proximal Policy Optimization (PPO) by incorporating:
• Sub-step sampling per macro step, allowing multiple forward passes before environment transitions.
• Customizable reward scaling, enabling users to pass their own scaling functions or use the default tanh-based normalization.
• Better adaptability in reinforcement learning (RL) tasks, particularly for tracking and dynamic environments.
GRPO allows agents to explore action spaces more efficiently and refine their policy updates through multiple evaluations per time step.
Context
(DLR-RM/stable-baselines3#2076
Types of changes
Checklist:
make format
(required)make check-codestyle
andmake lint
(required)make pytest
andmake type
both pass. (required)