Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPRO - Feature Addition #272

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Soham4001A
Copy link

@Soham4001A Soham4001A commented Jan 29, 2025

Description

This PR introduces Generalized Policy Reward Optimization (GRPO) as a new feature in stable-baselines3-contrib. GRPO extends Proximal Policy Optimization (PPO) by incorporating:
• Sub-step sampling per macro step, allowing multiple forward passes before environment transitions.
• Customizable reward scaling, enabling users to pass their own scaling functions or use the default tanh-based normalization.
• Better adaptability in reinforcement learning (RL) tasks, particularly for tracking and dynamic environments.

GRPO allows agents to explore action spaces more efficiently and refine their policy updates through multiple evaluations per time step.

Context

(DLR-RM/stable-baselines3#2076

  • [x ] I have raised an issue to propose this change (required)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • [x ] New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist:

  • [x ] I've read the CONTRIBUTION guide (required)
  • [x ] The functionality/performance matches that of the source (required for new training algorithms or training-related features).
  • [x ] I have updated the tests accordingly (required for a bug fix or a new feature).
  • [x ] I have included an example of using the feature (required for new features).
  • [x ] I have included baseline results (required for new training algorithms or training-related features).
  • I have updated the documentation accordingly.
  • [ x] I have updated the changelog accordingly (required).
  • [x ] I have reformatted the code using make format (required)
  • [ x] I have checked the codestyle using make check-codestyle and make lint (required)
  • [ x] I have ensured make pytest and make type both pass. (required)

@KShivendu
Copy link

KShivendu commented Mar 1, 2025

Hi @Soham4001A, I tried the example you provided by checking out your branch and it's throwing error:

Traceback (most recent call last):
  File "/home/user/projects/rl/grpo.py", line 21, in <module>
    model.learn(total_timesteps=10_000)
  File "/home/user/.cache/pypoetry/virtualenvs/rl-fSO1OvS5-py3.10/lib/python3.10/site-packages/stable_baselines3/ppo/ppo.py", line 311, in learn
    return super().learn(
  File "/home/user/.cache/pypoetry/virtualenvs/rl-fSO1OvS5-py3.10/lib/python3.10/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 323, in learn
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
  File "/home/user/.cache/pypoetry/virtualenvs/rl-fSO1OvS5-py3.10/lib/python3.10/site-packages/sb3_contrib/grpo/grpo.py", line 223, in collect_rollouts
    raise TypeError("Your reward function must be passed for recomputing rewards ")
TypeError: Your reward function must be passed for recomputing rewards 

Update: I realised that env doesn't expose a reward function (it's a part of env.step() and it modifies the state when called). But we should figure out some way to make this easier for end users. There's no point in them rewriting the default reward function. We might have to create PRs in gymnasium refactor all existing environments to expose reward function and use that inside env.step().

@zaksemenov
Copy link

Is there an update on the timeline for this?

@Soham4001A
Copy link
Author

I'm not sure what the proper direction might be. I may need to rewrite the description of what is happening behind the scenes too. This new implementation is technically Hybrid GRPO & not standard GRPO.

In this method, there are 2 key differences:

(1) The v(s) is still used to compute A(s).
(2) Instead of only using the top sampled reward, all other time steps are also saved in the buffer. This means that for every learning update, the roll out buffer is extended by whatever number of samples the user specifies they want to use. This is not a problem that causes the library to break but it's a little confusing because when you specify n_steps for learning updates, it now becomes n_steps * num_samples_per_time_step of the actual size of the rollout buffers even if the simulation environment only passes n_steps.

And to the point of having to pass the reward func again, I made that as a passable param because I forget exactly what the issue I was having was. I could take a look into that again or if you want to elaborate on what you mean.

Here is the link to the paper btw - https://arxiv.org/abs/2502.01652

@KShivendu
Copy link

KShivendu commented Mar 29, 2025

could take a look into that again or if you want to elaborate on what you mean

Basically this PR throws error if reward_function=None is passed.

The example you've provided also passes reward_function=None and hence crashes. I don't want to redefine the reward function myself since it's already part of the env.step() code.

I figured out a hacky work-around with this, however, I saw that it performed worse than PPO. What exactly did you do locally that made it perform any better than PPO in your runs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants