GPRO - Feature Addition #272

Soham4001A · 2025-01-29T17:25:01Z

Description

This PR introduces Generalized Policy Reward Optimization (GRPO) as a new feature in stable-baselines3-contrib. GRPO extends Proximal Policy Optimization (PPO) by incorporating:
• Sub-step sampling per macro step, allowing multiple forward passes before environment transitions.
• Customizable reward scaling, enabling users to pass their own scaling functions or use the default tanh-based normalization.
• Better adaptability in reinforcement learning (RL) tasks, particularly for tracking and dynamic environments.

GRPO allows agents to explore action spaces more efficiently and refine their policy updates through multiple evaluations per time step.

Context

(DLR-RM/stable-baselines3#2076

[x ] I have raised an issue to propose this change (required)

Types of changes

Bug fix (non-breaking change which fixes an issue)
[x ] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

[x ] I've read the CONTRIBUTION guide (required)
[x ] The functionality/performance matches that of the source (required for new training algorithms or training-related features).
[x ] I have updated the tests accordingly (required for a bug fix or a new feature).
[x ] I have included an example of using the feature (required for new features).
[x ] I have included baseline results (required for new training algorithms or training-related features).
I have updated the documentation accordingly.
[ x] I have updated the changelog accordingly (required).
[x ] I have reformatted the code using make format (required)
[ x] I have checked the codestyle using make check-codestyle and make lint (required)
[ x] I have ensured make pytest and make type both pass. (required)

KShivendu · 2025-03-01T10:18:28Z

Hi @Soham4001A, I tried the example you provided by checking out your branch and it's throwing error:

Traceback (most recent call last):
  File "/home/user/projects/rl/grpo.py", line 21, in <module>
    model.learn(total_timesteps=10_000)
  File "/home/user/.cache/pypoetry/virtualenvs/rl-fSO1OvS5-py3.10/lib/python3.10/site-packages/stable_baselines3/ppo/ppo.py", line 311, in learn
    return super().learn(
  File "/home/user/.cache/pypoetry/virtualenvs/rl-fSO1OvS5-py3.10/lib/python3.10/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 323, in learn
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
  File "/home/user/.cache/pypoetry/virtualenvs/rl-fSO1OvS5-py3.10/lib/python3.10/site-packages/sb3_contrib/grpo/grpo.py", line 223, in collect_rollouts
    raise TypeError("Your reward function must be passed for recomputing rewards ")
TypeError: Your reward function must be passed for recomputing rewards

Update: I realised that env doesn't expose a reward function (it's a part of env.step() and it modifies the state when called). But we should figure out some way to make this easier for end users. There's no point in them rewriting the default reward function. We might have to create PRs in gymnasium refactor all existing environments to expose reward function and use that inside env.step().

zaksemenov · 2025-03-26T18:55:44Z

Is there an update on the timeline for this?

Soham4001A · 2025-03-26T19:39:31Z

I'm not sure what the proper direction might be. I may need to rewrite the description of what is happening behind the scenes too. This new implementation is technically Hybrid GRPO & not standard GRPO.

In this method, there are 2 key differences:

(1) The v(s) is still used to compute A(s).
(2) Instead of only using the top sampled reward, all other time steps are also saved in the buffer. This means that for every learning update, the roll out buffer is extended by whatever number of samples the user specifies they want to use. This is not a problem that causes the library to break but it's a little confusing because when you specify n_steps for learning updates, it now becomes n_steps * num_samples_per_time_step of the actual size of the rollout buffers even if the simulation environment only passes n_steps.

And to the point of having to pass the reward func again, I made that as a passable param because I forget exactly what the issue I was having was. I could take a look into that again or if you want to elaborate on what you mean.

Here is the link to the paper btw - https://arxiv.org/abs/2502.01652

KShivendu · 2025-03-29T10:52:16Z

could take a look into that again or if you want to elaborate on what you mean

Basically this PR throws error if reward_function=None is passed.

The example you've provided also passes reward_function=None and hence crashes. I don't want to redefine the reward function myself since it's already part of the env.step() code.

I figured out a hacky work-around with this, however, I saw that it performed worse than PPO. What exactly did you do locally that made it perform any better than PPO in your runs?

Soham Sane added 5 commits January 29, 2025 10:49

init - untested

c07a7b4

Reformatted but yet untested - still need to edit test files

4643314

Ready for PR (Untested Still

adb6110

Ready for PR - Tested

0c33dbb

Changelog updated

70b67f6

This was referenced Jan 29, 2025

Group Relative Proximity Optimization (GRPO) - New Feature DLR-RM/stable-baselines3#2076

Closed

[Feature Request] Group Relative Proximity Optimization (GRPO) #273

Open

araffin changed the base branch from feat/cem to master January 30, 2025 08:44

Updated GRPO to use environment reward function for sampled rewards

306ae63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPRO - Feature Addition #272

GPRO - Feature Addition #272

Soham4001A commented Jan 29, 2025 •

edited

Loading

KShivendu commented Mar 1, 2025 •

edited

Loading

zaksemenov commented Mar 26, 2025

Soham4001A commented Mar 26, 2025

KShivendu commented Mar 29, 2025 •

edited

Loading

GPRO - Feature Addition #272

Are you sure you want to change the base?

GPRO - Feature Addition #272

Conversation

Soham4001A commented Jan 29, 2025 • edited Loading

Description

Context

Types of changes

Checklist:

KShivendu commented Mar 1, 2025 • edited Loading

zaksemenov commented Mar 26, 2025

Soham4001A commented Mar 26, 2025

KShivendu commented Mar 29, 2025 • edited Loading

Soham4001A commented Jan 29, 2025 •

edited

Loading

KShivendu commented Mar 1, 2025 •

edited

Loading

KShivendu commented Mar 29, 2025 •

edited

Loading