Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Group Relative Proximity Optimization (GRPO) #273

Open
2 tasks done
Soham4001A opened this issue Jan 29, 2025 · 4 comments
Open
2 tasks done

[Feature Request] Group Relative Proximity Optimization (GRPO) #273

Soham4001A opened this issue Jan 29, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@Soham4001A
Copy link

🚀 Feature

GRPO (Generalized Policy Reward Optimization) is a new reinforcement learning algorithm designed to enhance Proximal Policy Optimization (PPO) by introducing sub-step sampling per time step and customizable reward scaling functions. This method is inspired by DeepSeek’s Generalized Proximal Policy Optimization (GPRO), which has demonstrated improved policy learning efficiency by refining how rewards and actions are evaluated during training.

The core innovation in GRPO lies in its ability to sample multiple actions within each time step, allowing for richer gradient estimations and more stable training in environments with high reward variance or sparse rewards. Unlike traditional PPO, which updates policies based on a single action per step, GRPO dynamically refines policy updates by leveraging multiple sub-samples, enabling a more robust understanding of optimal action sequences. Additionally, GRPO provides users with the ability to inject their own reward scaling functions, making it highly adaptable across robotics, finance, and multi-agent systems where reward structures can vary significantly.

By integrating these two enhancements, GRPO maintains the computational efficiency of PPO while offering greater flexibility and improved reward modeling, making it a valuable addition to the SB3-contrib ecosystem.

Motivation

DeepSeek’s work on GPRO has demonstrated how sampling multiple actions within a single time step can significantly improve the training process for reinforcement learning agents. Traditional PPO, while effective, is often constrained by limited reward representation due to its single-action-per-step approach. This limitation can cause instability in environments where the reward function is highly non-linear, sparse, or noisy.

In complex applications such as robotics control, stock trading, and autonomous navigation, reward structures are rarely straightforward. Allowing users to define their own reward scaling functions enhances model interpretability and ensures that policy learning aligns with real-world objectives. With this flexibility, GRPO is well-positioned to outperform PPO in environments requiring adaptive reward normalization while maintaining the same training framework.

Furthermore, many users of Stable-Baselines3 have expressed a need for more customizable reinforcement learning algorithms, particularly in cases where standardized reward processing does not align with task-specific goals. GRPO addresses this demand by providing a plug-and-play solution that is both compatible with existing environments and extensible for future use cases.

Pitch

The introduction of GRPO will provide reinforcement learning practitioners with a powerful alternative to PPO, particularly in high-variance or custom-reward environments. This proposal includes two major additions:

Sub-step Sampling Per Time Step – Instead of relying on a single action per step, GRPO enables multiple samples within a time step, resulting in a richer policy update process that improves convergence in unstable environments.

Custom Reward Scaling Functions – Users can define their own reward transformation functions, ensuring that models can adapt to specific domain requirements without the need for extensive modifications to the algorithm itself.

By maintaining the same API structure as PPO, GRPO will be easy to integrate into existing SB3 workflows. Users will be able to seamlessly switch from PPO to GRPO while benefiting from enhanced reward modeling and improved sample efficiency.

Alternatives

I am open to alternatives as to how to improve GRPO but currently have nothing proposed

Additional context

I have an implementation of this algorithm already ready in a forked repository-

PR is up to integrate into a feature branch - #272

Benchmarks can be seen through this simulation I have developed through personal testing & these LinkedIn Posts-
https://github.com/Soham4001A/RL_Tracking
https://www.linkedin.com/feed/update/urn:li:activity:7290081519207858177/

Original GRPO documentation can be found in DeepSeek's Paper-
https://arxiv.org/abs/2501.12948

Checklist

  • I have checked that there is no similar issue in the repo
  • If I'm requesting a new feature, I have proposed alternatives
@Soham4001A Soham4001A added the enhancement New feature or request label Jan 29, 2025
@Soham4001A Soham4001A changed the title [Feature Request] request title [Feature Request] Group Relative Proximity Optimization (GRPO) Jan 29, 2025
@araffin
Copy link
Member

araffin commented Jan 30, 2025

To continue the discussion from SB3 issue, I had a closer look at GRPO.
The algorithm was actually described in https://arxiv.org/abs/2402.03300 and seems to be specific to LLM training (which SB3 doesn't really support).
Notably, it replaces the value function by averaging rewards for the same observation, which require access to a reward model and which is not standard for most RL env (in the sense, only a subset of environment expose a compute_reward(obs, action) method that doesn't depend on the internal state of the env).

Some other clarifications:

  • by standard benchmark, I meant benchmarks like MuJoCo/Pybullet/Atari, where we can compare to previous results, and I also meant quantitative results (you can have a look at what was done for Recurrent PPO #53 or TQC Add TQC #4)
  • by feature branch, I meant that the branch of your fork should not be master/main, but you should point to SB3 contrib master (and be up to date with it)
  • your current implementation doesn't seem to follow https://arxiv.org/abs/2402.03300 because you are still using a value fn and you are only resampling action but using the same reward

@Soham4001A
Copy link
Author

You are right to question my implementation. After going back and making some edits, I actually discovered that my implementation was different than the one in the DeepSeek paper and therefore I wrote up a quick paper documenting my new "Hybrid GRPO" approach. This paper should be available to cite on arXiv in the upcoming days as it is still processing but it can be found here for now - Paper

I did go ahead a branch and a new PR which can be found here - Hybrid GRPO PR

In terms of the standard benchmarking, I am unsure as to how to do the benchmarking as this is my first time doing any kind of pull request into an open source library. I initially just made this out of curiosity as I was quite surprised by the method in the DeepSeek paper and ended up creating a derivative of it. I usually do not make any open source contributions as I keep those private at my company for monetary award reasons but I have been switching to publishing more research and want to contribute more freely.

I hope you understand and I am sorry for being new at this process but would some help and guidelines!

@KShivendu
Copy link

KShivendu commented Mar 29, 2025

which require access to a reward model and which is not standard for most RL env (in the sense, only a subset of environment expose a compute_reward(obs, action) method

Although, it won't be the most optimal solution memory-wise, it's possible to compute rewards of different actions from the environment at the same step by cloning the env object. We can't modify all possible environments but this at least allows it to work with all existing environments.

# For each of n_rollout_steps:
self.num_timesteps += env.num_envs

sub_actions = []
sub_values = []
sub_log_probs = []
sampled_rewards = []

obs_tensor = th.as_tensor(obs, device=self.device, dtype=th.float32)

for _ in range(self.samples_per_time_step):
    with th.no_grad():
        actions, values, log_probs = self.policy.forward(obs_tensor)

    sub_actions.append(actions.cpu().numpy())
    sub_values.append(values)
    sub_log_probs.append(log_probs)

final_action = sub_actions[-1]  # The last sampled action is used for stepping
original_env = deepcopy(env)
new_obs, rewards, dones, infos = env.step(final_action)

for i in range(self.samples_per_time_step):
    if self.reward_function is not None:
        new_reward = self.reward_function(obs, sub_actions[i])
    else:
        temp_env = deepcopy(original_env) # temporary hack since we can't modify existing environments. a little expensive memory wise
        _, new_reward, _, _ = temp_env.step(sub_actions[i])

    sampled_rewards.append(new_reward)

sampled_rewards = np.array(sampled_rewards)

# Store the recomputed rewards and actions in the rollout buffer
for i in range(self.samples_per_time_step):
    rollout_buffer.add(
        obs,
        sub_actions[i],
        sampled_rewards[i],
        dones,
        sub_values[i],
        sub_log_probs[i],
    )

This only affects grpo.py so I guess it's okay? @araffin @Soham4001A if you agree, I can help with the PR and benchmarks :)

@Soham4001A
Copy link
Author

So I just realized something. This PR actually has a bug where the sampling is occurring but only the top reward is being saved. This is how GRPO works but this is not the intention of Hybrid GRPO. I just implemented a fix that modifies the buffers and saves all of the additional points in line with my paper on 'Hybrid GRPO' and just tested it on my sandbox sim to make sure it was actually working.

In this fix, for the environment’s state update and rollout buffer management, I modified the buffer configuration to ensure that, although multiple samples are used for computing the advantage, only one representative transition (typically the final sample) is stored per macro step. The buffer size is adjusted in the model setup to reflect the effective number of macro steps (i.e. n_steps divided by samples_per_time_step), thereby preserving a coherent simulation history. This way, the agent’s observed environment transitions remain consistent with the actual simulation time steps, even though the loss calculations benefit from the additional samples.

Additionally, I implemented safe reset handling by creating a helper function that ensures env.reset() always returns a consistent tuple, preventing multiple or redundant resets that might disrupt the simulation. A reward function wrapper was also added to adjust the environment’s step count during reward calculation, ensuring that the advantage estimation aligns with the effective macro steps. Overall, the modifications combine the stability and bootstrapped advantage estimation of PPO with the enhanced data efficiency of GRPO, all while preserving the temporal coherence of the environment’s state updates.

Here is the latest commit - Soham4001A@306ae63

@KShivendu if you want to take over this, I would be glad to hand it off and really appreciate it. I am incredibly busy with other commitments and in the middle of another novel study. Actually, if anyone is curious, I am looking for co-authors to help me with it. It is based around transformer architecture & reducing FLOPs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants