GRPO Environments for custom multi-step rollouts (vLLM-only) #2810

willccbb · 2025-02-09T17:41:13Z

What does this PR do?

Adds a protocol under trl/environments for an Environment object which wraps vLLM's .generate(...) to allow for custom rollout logic, and an optional env field to the Trainer for passing such an object.

A simple example usage is included below, others in this repo: willccbb/verifiers

Given the breadth of different agentic tasks people are interested in, I think an implementation of multi-step behavior should be as open-ended and customizable as possible, rather than having everything flow through explicit tool use or a predefined format. Here, the only requirement for an Environment is that it mirrors the behavior of calling llm.generate() + extracting token_ids. I have found that it's more practical to pass message dicts to Environments rather than preformatted text, hence the addition of gather_object(prompts).

I agree with Quentin's comment here that while masking of tool outputs/environment responses/messages from other users/agents is maybe desirable in some cases, it is probably not necessary for initial experimentation. Future iterations could perhaps extend the Environment definition to allow for mask outputs.

Usage looks like:

doublecheck_env = DoubleCheckEnv() # assistant message -> hardcoded user "Are you sure?" message -> second assistant message
...
trainer = GRPOTrainer(
    model=model_name,
    processing_class=tokenizer,
    reward_funcs=reward_funcs,
    env=doublecheck_env, # Optional, defaults to `None`
    args=training_args,
    train_dataset=dataset
)
trainer.train()

Example implementation of such an environment:

from typing import List, Callable, Dict, Any, Sequence, Tuple
from vllm import LLM, SamplingParams, RequestOutput

class DoubleCheckEnv:

    def step(self,
             states: List[Dict[str, Any]],
             llm: LLM,
             sampling_params: SamplingParams) -> Tuple[List[Dict[str, Any]], List[RequestOutput]]:
        
        outputs = llm.chat([state["messages"] for state in states], sampling_params=sampling_params) # type: ignore
        for i, state in enumerate(states):
            state["messages"].append({'role': 'assistant', 'content': outputs[i].outputs[0].text})
            state["messages"].append({'role': 'user', 'content': 'Are you sure?'})
            state["prompt_tokens"] = len(outputs[i].prompt_token_ids)

        outputs = llm.chat([state["messages"] for state in states], sampling_params=sampling_params) # type: ignore

        for i, state in enumerate(states):
            state["messages"].append({'role': 'assistant', 'content': outputs[i].outputs[0].text})
            state["completed"] = True
        return states, outputs

    def generate(self, prompts: List[List[Dict[str, Any]]], llm: LLM, sampling_params: SamplingParams) -> List[Sequence[int]]:
        all_completed = False
        states = [{"messages": m, "completed": False, "prompt_tokens": -1} for m in prompts]
        outputs = [None] * len(prompts)
        while not all_completed:
            states, outputs = self.step(states, llm, sampling_params)
            all_completed = all(state["completed"] for state in states)
        all_ids = [list(output.prompt_token_ids) + list(output.outputs[0].token_ids) for output in outputs]
        completion_ids = [all_ids[i][states[i]["prompt_tokens"]:] for i in range(len(outputs))]
        return completion_ids

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

ywang96 · 2025-02-10T06:55:55Z

trl/trainer/grpo_trainer.py

                        enable_prefix_caching=True,
                        max_model_len=self.args.vllm_max_model_len,
                    )
                self.sampling_params = SamplingParams(
                    temperature=args.temperature,
                    max_tokens=self.max_completion_length,
+                    skip_special_tokens=False,


FYI - on vLLM we have another parameter spaces_between_special_tokens(set to True by default) that adds spaces between special tokens in the output when skip_special_tokens=False.

Gotcha, thanks for heads up. Will do some testing and make sure everything looks OK at token-level, maybe don't need that line.

Reverted that line. If users want different SamplingParams settings they can specify that at the Environment level anyway.

HuggingFaceDocBuilderDev · 2025-02-12T11:42:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec

Would it be possible to wrap the generate method instead? It would allow keep grpo as is

willccbb · 2025-02-12T13:32:43Z

I'm not quite sure what you mean by wrapping the generate method here. Many parts of the codebase touch the LLM object, wrapping it in another object would require changing the access in each place. Overriding generate entails a high amount of complexity on the user end, as most applications will either want to use the true generate (or chat, which calls generate) method. Adding an if/else Environment route like in this PR is the simplest approach I could think of which allows users to directly use the LLM object as-is in their rollouts, while still allowing the Trainer to reference LLM normally throughout. Note that this enables many requested features--tool use, sampling strategies, agentic interactions--to be encapsulated within Environments, avoiding further complexity down the road.

If you have something specific in mind that you could illustrate with a short snippet I'm happy to try.

qgallouedec · 2025-02-12T16:15:22Z

Actually it can be pretty straightforward and simple:

def wrapper_decorator(generate_func):
    def generate_wrapper(*args, **kwargs):
        ...  # stuff before
        result = generate_func(*args, **kwargs)
        ...  # stuff after
        return result
    return generate_wrapper

trainer.llm.model.generate = wrapper_decorator(trainer.llm.model.generate)

accupham · 2025-02-12T17:07:53Z

Seems like the wrapper idea could be implemented internally. The user interface could remain the same with the environments idea.

Also, I think there might be a bug in the DoubleCheckEnv.generate example. I think this is the correct implementation:

# original
completion_ids = [all_ids[states[i]["prompt_tokens"]:] for i, output in enumerate(outputs)]
# fixed
completion_ids = [all_ids[i][states[i]["prompt_tokens"]:] for i in range(len(prompts))]

accupham · 2025-02-12T17:12:36Z

trl/environment/env_protocol.py

Docstrings to make implementation easier for end users:

class Environment(Protocol): """ A protocol describing the minimal interface needed for integration with the trainer. Your environment can run any multi-step logic, but must ultimately return token sequences akin to a typical vllm.LLM's generate() output. https://docs.vllm.ai/en/stable/api/offline_inference/llm.html """ def generate( self, prompts: List[List[Dict[str, Any]]], llm: Any, sampling_params: Any ) -> List[Any]: ...

accupham · 2025-02-12T17:15:54Z

docs/source/grpo_trainer.md

Examples are best way to learn how to use libraries... I took your example and added illustrative comments to make it easier to understand what is going on and how a user might implement their own.

class DoubleCheckEnv: """ Example Environment that: 1) Sends an initial user prompt to the LLM. 2) Appends the assistant's reply and a follow-up user query: "Are you sure?". 3) Sends everything again to the LLM for a final response. 4) Returns just the completion tokens for each prompt. """ def step( self, states: List[Dict[str, Any]], llm: LLM, sampling_params: SamplingParams ) -> Tuple[List[Dict[str, Any]], List[RequestOutput]]: # First LLM call for each state's messages outputs = llm.chat([s["messages"] for s in states], sampling_params=sampling_params) for i, state in enumerate(states): state["messages"].append({ "role": "assistant", "content": outputs[i].outputs[0].text }) state["messages"].append({ "role": "user", "content": "Are you sure?" }) # Track prompt_tokens to later slice out the completion part state["prompt_tokens"] = len(outputs[i].prompt_token_ids) # Second LLM call after "Are you sure?" is appended outputs = llm.chat([s["messages"] for s in states], sampling_params=sampling_params) for i, state in enumerate(states): state["messages"].append({ "role": "assistant", "content": outputs[i].outputs[0].text }) state["completed"] = True return states, outputs def generate( self, prompts: List[List[Dict[str, Any]]], llm: LLM, sampling_params: SamplingParams ) -> List[Sequence[int]]: # Setup conversation states states = [{"messages": p, "completed": False, "prompt_tokens": -1} for p in prompts] outputs = [None] * len(prompts) # Keep stepping until each conversation is marked complete while not all(s["completed"] for s in states): states, outputs = self.step(states, llm, sampling_params) # Gather prompt+completion IDs, then slice out the prompt portion all_ids = [ list(o.prompt_token_ids) + list(o.outputs[0].token_ids) for o in outputs ] completion_ids = [ all_ids[i][states[i]["prompt_tokens"]:] for i in range(len(prompts)) ] return completion_ids

Thank you!! Will add that + docstring to the PR

willccbb · 2025-02-13T16:34:19Z

@xiangjjj One complication is that many base model tokenizers have pad_token_id = eos_token_id, so when padding a batch, the "last EOS token" will be the last token in the pad sequence. Trying out a couple workarounds.

xiangjjj · 2025-02-13T16:41:44Z

Ah, I see! That is tricky. Thank for this!

willccbb · 2025-02-14T01:01:18Z

Simplest solution I think is to move the masking logic into the respective vllm/transformers generate routes. vLLM now masks based on completion_ids length rather than the position of the first EOS token.

xiangjjj · 2025-02-14T02:54:39Z

Simplest solution I think is to move the masking logic into the respective vllm/transformers generate routes. vLLM now masks based on completion_ids length rather than the position of the first EOS token.

Sure, it makes sense and should resolve the masking issue! Thanks for fixing this!

xiangjjj · 2025-02-15T00:42:20Z

trl/trainer/grpo_trainer.py

-                outputs = self.llm.generate(all_prompts_text, sampling_params=self.sampling_params, use_tqdm=False)
-                completion_ids = [out.token_ids for completions in outputs for out in completions.outputs]
+                if self.env is not None:
+                    completion_ids = self.env.generate(


I was wondering if we could enhance the output structure to include additional metadata alongside the current completion_ids. In multi-step rollouts, having step-wise details available directly would be really useful for determining the final rewards. Currently, parsing completion_ids to reconstruct this information feels a bit cumbersome. Would it be possible to return a dict that encapsulates both the tokens and the extra metadata?

I'll defer to @qgallouedec on that one, could be added with pretty minimal changes (happy to do so), but there are also easy enough workarounds for what you're describing (computing rewards at generation time and caching them in a data structure accessed by your reward functions). My first priority with this PR is getting basic support for Environments enabled.

Do you have suggestions on how to cache the information? I'm new to trl and have not done that before.

xiangjjj · 2025-02-15T00:54:29Z

trl/trainer/grpo_trainer.py

@qgallouedec @willccbb Regarding logits_to_keep, what do you think is the best strategy to filter out tool observation tokens? My concern is that when we invoke a web search tool—which might return thousands of tokens—those tokens could overwhelm the policy tokens during gradient updates. If this extraneous information turns out to be noisy, it might adversely affect the policy gradient learning. We can leave this out of the current PR, but I’d appreciate your thoughts on ideas/directions to resolve this as I'm new to trl.

For now I'd suggest having deterministic processing of tool call results to avoid excessive outputs, that'll cause problems whether or not we mask tokens. In some experiments on multi-step code tool use, I found limiting allowed printouts to 500 chars per step worked reasonably well.

Not clear to me that naive masking of tool outputs "makes sense" for GRPO algorithmically, and for now it probably is fine to treat tool call results as just part of the LLM output. Especially you are letting your model "reason" for many tokens per step, it should not be a major problem I think.

I think that applying policy gradient directly to tool call outputs is not theoretically sound—especially in outcome reward methods. Note that I'm not suggesting "naive masking of tool outputs." I propose that we do not compute the log-likelihood of the tool observation tokens together with their policy gradient and KL loss functions. While one might argue this is an empirical question depending on the setup of the system, I’m interested in learning about what kind of design in grpo would enable more flexible loss masking.

I’m fine with keeping this PR as is, but I’m interested in exploring a design that allows for more flexible loss masking in the future.

Yeah, I see what you mean. I think this is a very interesting research question, but hard to say definitively right now. Including them as normal response tokens is essentially forcing the model to "model" the tool calls directly, which feels reasonable to me.

Papers exploring this issue are pretty recent:

https://arxiv.org/pdf/2412.16145

https://arxiv.org/pdf/2410.09302

If interest in multi-step RL continues to grow, I would imagine having more specialized trainers in the future could make sense. For now this is just a way to get something that runs using GRPO, whether or not it is the most principled.

Does that mean in the current implementation we're still calculating KLD and policy gradient on environment responses? I thought we're not doing that based on my understanding of DoubleCheckEnv but now I'm confused...

vladrad · 2025-02-17T17:30:35Z

All I would love to help out with this as I am working on it myself.

vladrad · 2025-02-17T17:52:19Z

@willccbb I reach out via your email on your profile! let me know if you want to Colab on this as I am continuing to work on it.

qgallouedec · 2025-02-17T17:52:34Z

I still don't understand why wrapping would limit what you can do. For example for the double call:

def wrapper_decorator(generate_func):
    def generate_wrapper(*args, **kwargs):
        ...  # stuff before
        result1 = generate_func(*args, **kwargs)
        result1 = generate_func(*args, **kwargs)
        ...  # stuff after
        return result
    return generate_wrapper

trainer.llm.model.generate = wrapper_decorator(trainer.llm.model.generate)

Taking the env paradigm, I think it should work as is with the main branch with the something like:

env = MyEnv(...)

def wrapper_decorator(generate_func):
    def generate_wrapper(*args, **kwargs):
        prompts = args[0]
        return env.generate(prompts, self, *args, **kwargs)
    return generate_wrapper

trainer.llm.model.generate = wrapper_decorator(trainer.llm.model.generate)

I might be missing something though

vladrad · 2025-02-17T18:50:47Z

@qgallouedec
I tried something similar before without success, but I'll give it another go.

I've been experimenting with combining chat and completion responses during training. The idea is to score each response based on its format and content. If a mistake is detected, another LLM—one that provides the correct answer—is consulted. This secondary LLM offers a brief hint to guide the correction, and then the response is re-generated.

For example, this dataset snippet:

<think>
Oh, the user wants me to call a tool.
</think>
</answer>
I am going to call X...
</answer>
<tool>
<function name=read_file>...</function>
</tool>

If a mistake is found (say, the tool call should be within the <answer> tags), the correction process would work like this:

<think>
Oh, I forgot the format requires tool calls to be within the answer

Now I go back with the hint and try to get a competition:

tags. This will provide the correct format the user requested.
</think>
</answer>
I am going to call X...
<tool>
<function name=read_file>...</function>
</tool>
</answer>

My goal is to see I can get auto correction via hints scored on top of it.

I have made a really hacky overfitted solution where I was training in epoch runs like this, creating the dataset and then going back to round 2,3,4 GRPO training... slowly guiding it to the right answer. Now I think I need to work on getting an actual solution that's how I ended up here. Happy to help/code things up and test.

Thanks all!

willccbb · 2025-02-17T20:28:53Z

@qgallouedec biggest problem is that LLM.chat relies on LLM.generate, and many (most?) multi-step interaction protocols will want to use LLM.chat. If we override generate as you propose, any call to chat inside of our wrapper will result in recursive blowup. We also have to keep all of our logic contained within a single wrapper function, and we also can't easily maintain global state within/across rollouts (for things like precomputing/caching rewards to be retrieved by reward functions, which can be objects with access to the Env state).

It also is just much nicer to be able to have access to the SamplingParams and LLM objects directly, as this is how people typically develop agent applications on top of vLLM. The added complexity to the trainer by allowing an Env object is pretty minor, but it unlocks quite a bit from the user perspective. Other libraries which have already built these kinds of environments (TextArena, reasoning-gym, etc.) are way easier to adapt if we can just "use the model like a normal LLM" rather than having to rewrite all of the chat parsing logic again for every application.

amitlevy · 2025-02-25T09:40:47Z

Any update? The PR is falling behind the GRPO changes on main

willccbb · 2025-02-25T23:14:45Z

Working on a refactor now which I think should allow directly using the main TRL branch. Approach is to extend GRPOTrainer to a class GRPOEnvTrainer which only needs to override the _generate_and_score_completions function, which I think is already a reasonable encapsulation of the minimum logic needed to implement custom rollout strategies. Once that's tested + pushed to the verifiers repo I'll probably just close this PR.

willccbb · 2025-02-26T01:57:47Z

Working now on the dev branch of verifiers, will clean up some things + merge to main shortly. Closing this PR, will maybe revisit later, but overloading the trainer seems to be the best method for supporting these kinds of features for now.

willccbb and others added 13 commits February 7, 2025 20:52

env protocol, vllm-only env generate route

e1d4014

env llm/sampling_params only on main process

a781628

move args to env.generate

87ff4f8

AsyncLLMEngine

2ae4e1b

trl/trainer/grpo_trainer.py

94bef4a

llm.engine

76dffbc

llm.engine

779ae5a

remove async

cb3ccee

include special tokens

5f4f567

environments

95497dd

fix: typos in documentation files (huggingface#2804)

f481232

Merge branch 'main' into grpo-envs

f4aad72

fixed import

117c419

willccbb mentioned this pull request Feb 9, 2025

[Question] Proper data format for GRPO Agent Training #2809

Open

ywang96 reviewed Feb 10, 2025

View reviewed changes

gmonair mentioned this pull request Feb 10, 2025

GRPOTrainer crashes with unsloth unslothai/unsloth#1624

Open

remove skip_special_tokens=False

72a822a

willccbb mentioned this pull request Feb 11, 2025

🔁 Add retry mechanism for GRPO training with configurable reward thre… #2823

Closed

5 tasks

willccbb added 2 commits February 10, 2025 20:23

Environment docs

c8bbb09

Merge branch 'main' into grpo-envs

709a89e

willccbb mentioned this pull request Feb 11, 2025

[Project] Training Agents with GRPO #2723

Open

Merge branch 'main' into grpo-envs

5c822e0

August-murr requested a review from qgallouedec February 12, 2025 12:49

qgallouedec reviewed Feb 12, 2025

View reviewed changes

accupham reviewed Feb 12, 2025

View reviewed changes

willccbb added 13 commits February 13, 2025 19:52

mask after last EOS

8edde68

mask after last EOS vs first

09cda4b

log for test

8e25042

mask len logs

170e732

mask len logs

331d97b

mask len logs

950d05f

mask len logs

21ca897

completion_lengths

1525d81

vllm multistep mask

fe9560e

comments

835952f

consolidating lines

6cc4913

Merge branch 'main' into grpo-envs

2b83f9b

re-add comment

92a36ca

xiangjjj reviewed Feb 15, 2025

View reviewed changes

willccbb mentioned this pull request Feb 16, 2025

thoughts on using a forked trl willccbb/verifiers#1

Closed

Merge branch 'main' into grpo-envs

a136614

willccbb closed this Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO Environments for custom multi-step rollouts (vLLM-only) #2810

GRPO Environments for custom multi-step rollouts (vLLM-only) #2810

willccbb commented Feb 9, 2025 •

edited

Loading

ywang96 Feb 10, 2025

willccbb Feb 10, 2025

willccbb Feb 11, 2025

HuggingFaceDocBuilderDev commented Feb 12, 2025

qgallouedec left a comment

willccbb commented Feb 12, 2025

qgallouedec commented Feb 12, 2025

accupham commented Feb 12, 2025

accupham Feb 12, 2025

accupham Feb 12, 2025

willccbb Feb 12, 2025

willccbb commented Feb 13, 2025

xiangjjj commented Feb 13, 2025

willccbb commented Feb 14, 2025

xiangjjj commented Feb 14, 2025

xiangjjj Feb 15, 2025

willccbb Feb 15, 2025

xiangjjj Feb 15, 2025

xiangjjj Feb 15, 2025

willccbb Feb 15, 2025

xiangjjj Feb 15, 2025 •

edited

Loading

xiangjjj Feb 15, 2025

willccbb Feb 15, 2025

Some-random Feb 19, 2025

vladrad commented Feb 17, 2025

vladrad commented Feb 17, 2025

qgallouedec commented Feb 17, 2025

vladrad commented Feb 17, 2025

willccbb commented Feb 17, 2025

amitlevy commented Feb 25, 2025

willccbb commented Feb 25, 2025

willccbb commented Feb 26, 2025

GRPO Environments for custom multi-step rollouts (vLLM-only) #2810

GRPO Environments for custom multi-step rollouts (vLLM-only) #2810

Conversation

willccbb commented Feb 9, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 12, 2025

qgallouedec left a comment

Choose a reason for hiding this comment

willccbb commented Feb 12, 2025

qgallouedec commented Feb 12, 2025

accupham commented Feb 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

willccbb commented Feb 13, 2025

xiangjjj commented Feb 13, 2025

willccbb commented Feb 14, 2025

xiangjjj commented Feb 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiangjjj Feb 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladrad commented Feb 17, 2025

vladrad commented Feb 17, 2025

qgallouedec commented Feb 17, 2025

vladrad commented Feb 17, 2025

willccbb commented Feb 17, 2025

amitlevy commented Feb 25, 2025

willccbb commented Feb 25, 2025

willccbb commented Feb 26, 2025

willccbb commented Feb 9, 2025 •

edited

Loading

xiangjjj Feb 15, 2025 •

edited

Loading