[Project] Training Agents with GRPO #2723

August-murr · 2025-01-31T19:47:07Z

Let's discuss how to train agents using GRPO.

Here, I will link sub-issues related to various problems, features, or questions that need resolution for implementing this idea.

accupham · 2025-01-31T21:38:22Z

So hear me out. I think the current style of function/tool calling as popularized by OpenAI does not lend itself to the expressiveness and fluidity of thought that R1-like models capitalize on and excel at. The feedback loop is too long.

Instead, I think the best way is to express function calls as mixed code and language. Let the user decide the calling convention, and inject the results inline.

Here's a example of what I mean. This is QwQ hooked up directly to a python interpreter to do neurosymbolic reasoning. Every line entered is directly fed into the REPL, and results fed back immediately to the LLM. I told it to express thoughts as python comments.

Python 3.12.5 (main, Aug 14 2024, 04:32:18) [Clang 18.1.8 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> # Okay, let's tackle this step by step. The problem says:
>>> # If the equations x^2 - a = p * y and x^2 - b = p * y have integer solutions,
>>> # then x^2 - ab = p * y also has integer solutions. p is prime, and a, b are not divisible by p.
>>>
>>> from sympy import symbols, Eq, solveset, S
>>>
>>> # Let's define the symbols we need. I'll start with x, y, p, a, and b.
>>> x, y, p, a, b = symbols('x y p a b', integer=True)
>>>
>>> # Alright, I'll write out the first two equations.
>>> eq1 = Eq(x**2 - a, p * y)  # This is x^2 - a = p * y
>>> eq2 = Eq(x**2 - b, p * y)  # This is x^2 - b = p * y
>>>
>>> # The claim is that if these two equations have solutions,
>>> # then x^2 - ab = p * y should also have a solution. I'll write that equation too.
>>> eq3 = Eq(x**2 - a * b, p * y)  # This is x^2 - ab = p * y
>>>
>>># Okay, let's check if we can find integer solutions for the first two equations.
>>># I'm going to use `solveset` to see if x has solutions in the integers for eq1 and eq2.
>>>sol_x1 = solveset(eq1, x, domain=S.Integers)  # Solve eq1 for x
>>>sol_x1  # Let's see what we get here.
ConditionSet(x, Eq(x**2 - a, p*y), Integers)
>>>
>>> # Hmm, that's a generic solution for x in terms of the parameters.
>>> # Makes sense because we don't have concrete values for p, a, or y yet.
>>>
>>> sol_x2 = solveset(eq2, x, domain=S.Integers)  # Solve eq2 for x
>>> sol_x2  # Alright, let me look at this one too.
>>> # >>> ConditionSet(x, Eq(x**2 - b, p*y), Integers)
>>>
>>> # Okay, so I have the solution sets for x in eq1 and eq2.
>>> # Now, let's think about what needs to happen for eq3 to have integer solutions.
>>>
>>> # eq3 implies x^2 ≡ ab (mod p) when reduced modulo p.
>>> # If a and b are quadratic residues modulo p, then their product ab should also be.
>>> # I know there's a property about the product of quadratic residues modulo a prime.
>>>
>>> # But let me explicitly check if solving eq3 gives integer solutions.
>>> sol_x3 = solveset(eq3, x, domain=S.Integers)  # Solve eq3 for x
>>> sol_x3  # Does this give us something valid?
ConditionSet(x, Eq(x**2 - a*b, p*y), Integers)
>>>
>>> # Hmm, this matches what I expected! x^2 ≡ ab (mod p) implies solutions exist
>>> # as long as ab is a quadratic residue mod p.
>>>
>>> # To confirm, I need to remember why a and b being residues implies ab is a residue.
>>> # Ah, right! The Legendre symbol tells us (a/p) * (b/p) = (ab/p),
>>> # and both (a/p) and (b/p) are 1 because a and b are residues.
>>>
>>> # So, if eq1 and eq2 have solutions, a and b are residues mod p,
>>> # and their product ab is also a residue mod p. That ensures eq3 has solutions too.
>>>
>>> # Final confirmation: we have integer solutions for x in eq3 if integer solutions exist for eq1 and eq2.
>>> # That settles it!
>>>
>>> sol_x1, sol_x2, sol_x3  # Let's print all these solutions for the final check.
(ConditionSet(x, Eq(x**2 - a, p*y), Integers),
ConditionSet(x, Eq(x**2 - b, p*y), Integers),
ConditionSet(x, Eq(x**2 - a*b, p*y), Integers))
>>> # Solutions are correct.
>>> submit_final_answer(sol_x1, sol_x2, sol_x3)

The fast feedback loop affords us a few crucial things:

It's also able to recover from errors (such as syntax/import/unexpected result) in real time
It can learn from it's mistakes and recover mid-trajectory
Each line execution can be used as a reward or punishment (ie: -0.3 for syntax error)

Another example of more traditional agentic function calling:

def weather(city: str) -> str:
    return "The weather in Seattle is 40 deg F and sunny all day."

User: Should I pack an umbrella today?
Assistant: <think>Ok, so the user wants to know if it's going to rain or now. I should look up the weather using the `weather()` tool. But wait, I don't know where the user is currently located. I should ask before using the tool. </think><answer>What city are you currently in?</answer>
User: Seattle
Assistant: <think>I'll look up the weather. <fn>weather("Seattle")</fn>
<fn_error>
WeatherLookupError: Must specify a two-letter state.
</fn_error>
<fn>weather("Seattle, WA")</fn>
<fn_results>
The weather in Seattle is 40 deg F and sunny all day.
</fn_results>
Ah, so according to the results, it's unlikely the user will need an umbrella because it will not rain.
<answer>
The weather is nice and sunny today, no need to pack an umbrella today. Can I assist with anything else?
</answer>

accupham · 2025-01-31T21:52:49Z

My opinion is to standardize around vLLM's LLM api. We should pass in a user defined RolloutSampler class, which takes in a vLLM LLM class, and let the user figure out how to do function calling during rollout sampling. If they want to do it the standard way, they could use the LLM.chat() api with tools and call it the traditional way. If they wanted do something more interactive and realtime, they could stream tokens and detect llama-3.2 style function calls with some sort of regex hook, and inject in results into the output stream.

August-murr · 2025-02-01T05:58:54Z

@accupham I think it's better to take things step by step.
Let's build a minimal prototype that works, and then we can focus on different ideas and ways to improve it.

xiangjjj · 2025-02-09T04:26:43Z

Any considerations for the observation tokens from tool use? I don't think we should compute KL for those tokens.

willccbb · 2025-02-11T18:49:39Z

This PR (#2810) addresses @accupham 's suggestion to allow user-defined rollout logic which wraps vLLM. Would be curious to hear any comments about if this is sufficient for what people have in mind for now.

The protocol here could potentially be extended to allow user-defined masks (for tool calls) as well as rewards being computed at this stage as well.

jlia0 · 2025-02-12T16:01:52Z

So hear me out. I think the current style of function/tool calling as popularized by OpenAI does not lend itself to the expressiveness and fluidity of thought that R1-like models capitalize on and excel at. The feedback loop is too long.

Instead, I think the best way is to express function calls as mixed code and language. Let the user decide the calling convention, and inject the results inline.

Here's a example of what I mean. This is QwQ hooked up directly to a python interpreter to do neurosymbolic reasoning. Every line entered is directly fed into the REPL, and results fed back immediately to the LLM. I told it to express thoughts as python comments.

Python 3.12.5 (main, Aug 14 2024, 04:32:18) [Clang 18.1.8 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> # Okay, let's tackle this step by step. The problem says:
>>> # If the equations x^2 - a = p * y and x^2 - b = p * y have integer solutions,
>>> # then x^2 - ab = p * y also has integer solutions. p is prime, and a, b are not divisible by p.
>>>
>>> from sympy import symbols, Eq, solveset, S
>>>
>>> # Let's define the symbols we need. I'll start with x, y, p, a, and b.
>>> x, y, p, a, b = symbols('x y p a b', integer=True)
>>>
>>> # Alright, I'll write out the first two equations.
>>> eq1 = Eq(x**2 - a, p * y)  # This is x^2 - a = p * y
>>> eq2 = Eq(x**2 - b, p * y)  # This is x^2 - b = p * y
>>>
>>> # The claim is that if these two equations have solutions,
>>> # then x^2 - ab = p * y should also have a solution. I'll write that equation too.
>>> eq3 = Eq(x**2 - a * b, p * y)  # This is x^2 - ab = p * y
>>>
>>># Okay, let's check if we can find integer solutions for the first two equations.
>>># I'm going to use `solveset` to see if x has solutions in the integers for eq1 and eq2.
>>>sol_x1 = solveset(eq1, x, domain=S.Integers)  # Solve eq1 for x
>>>sol_x1  # Let's see what we get here.
ConditionSet(x, Eq(x**2 - a, p*y), Integers)
>>>
>>> # Hmm, that's a generic solution for x in terms of the parameters.
>>> # Makes sense because we don't have concrete values for p, a, or y yet.
>>>
>>> sol_x2 = solveset(eq2, x, domain=S.Integers)  # Solve eq2 for x
>>> sol_x2  # Alright, let me look at this one too.
>>> # >>> ConditionSet(x, Eq(x**2 - b, p*y), Integers)
>>>
>>> # Okay, so I have the solution sets for x in eq1 and eq2.
>>> # Now, let's think about what needs to happen for eq3 to have integer solutions.
>>>
>>> # eq3 implies x^2 ≡ ab (mod p) when reduced modulo p.
>>> # If a and b are quadratic residues modulo p, then their product ab should also be.
>>> # I know there's a property about the product of quadratic residues modulo a prime.
>>>
>>> # But let me explicitly check if solving eq3 gives integer solutions.
>>> sol_x3 = solveset(eq3, x, domain=S.Integers)  # Solve eq3 for x
>>> sol_x3  # Does this give us something valid?
ConditionSet(x, Eq(x**2 - a*b, p*y), Integers)
>>>
>>> # Hmm, this matches what I expected! x^2 ≡ ab (mod p) implies solutions exist
>>> # as long as ab is a quadratic residue mod p.
>>>
>>> # To confirm, I need to remember why a and b being residues implies ab is a residue.
>>> # Ah, right! The Legendre symbol tells us (a/p) * (b/p) = (ab/p),
>>> # and both (a/p) and (b/p) are 1 because a and b are residues.
>>>
>>> # So, if eq1 and eq2 have solutions, a and b are residues mod p,
>>> # and their product ab is also a residue mod p. That ensures eq3 has solutions too.
>>>
>>> # Final confirmation: we have integer solutions for x in eq3 if integer solutions exist for eq1 and eq2.
>>> # That settles it!
>>>
>>> sol_x1, sol_x2, sol_x3  # Let's print all these solutions for the final check.
(ConditionSet(x, Eq(x**2 - a, p*y), Integers),
ConditionSet(x, Eq(x**2 - b, p*y), Integers),
ConditionSet(x, Eq(x**2 - a*b, p*y), Integers))
>>> # Solutions are correct.
>>> submit_final_answer(sol_x1, sol_x2, sol_x3)

The fast feedback loop affords us a few crucial things:

It's also able to recover from errors (such as syntax/import/unexpected result) in real time
It can learn from it's mistakes and recover mid-trajectory
Each line execution can be used as a reward or punishment (ie: -0.3 for syntax error)

Another example of more traditional agentic function calling:

def weather(city: str) -> str:
    return "The weather in Seattle is 40 deg F and sunny all day."

User: Should I pack an umbrella today?
Assistant: <think>Ok, so the user wants to know if it's going to rain or now. I should look up the weather using the `weather()` tool. But wait, I don't know where the user is currently located. I should ask before using the tool. </think><answer>What city are you currently in?</answer>
User: Seattle
Assistant: <think>I'll look up the weather. <fn>weather("Seattle")</fn>
<fn_error>
WeatherLookupError: Must specify a two-letter state.
</fn_error>
<fn>weather("Seattle, WA")</fn>
<fn_results>
The weather in Seattle is 40 deg F and sunny all day.
</fn_results>
Ah, so according to the results, it's unlikely the user will need an umbrella because it will not rain.
<answer>
The weather is nice and sunny today, no need to pack an umbrella today. Can I assist with anything else?
</answer>

This is exactly what I have been thinking and tinkering as well. I wonder how did you make QwQ to do "neurosymbolic reasoning" / "inline function call" like the example?

accupham · 2025-02-12T16:32:31Z

This is exactly what I have been thinking and tinkering as well. I wonder how did you make QwQ to do "neurosymbolic reasoning" / "inline function call" like the example?

The system prompt was quite simple:

You are now operating as a stateful Python REPL environment. You can use it as memory buffer and scratch pad as a goal-seeking agent.

Then you set the prefill to the default python REPL intro text:

Python 3.13.1 (main, Dec  3 2024, 17:59:52) [Clang 16.0.0 (clang-1600.0.26.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

At this point we just do a while-loop with stop token set to "\n". Feed the LLM text into a REPL or some stateful code execution environment, and concatenate the results to the end of the prefill, plus ">>>". Feed that prefill into another LLM call and continue completion from there.

I think a jupyter notebook like environment might be more appropriate next time-- easier to sandbox.

August-murr added the 🏋 GRPO Related to GRPO label Jan 31, 2025

August-murr mentioned this issue Jan 31, 2025

GRPO for RL on agent trajectories #2715

Open

August-murr mentioned this issue Feb 9, 2025

[Question] Proper data format for GRPO Agent Training #2809

Open

willccbb mentioned this issue Feb 9, 2025

GRPO Environments for custom multi-step rollouts (vLLM-only) #2810

Open

5 tasks

August-murr added the ✨ enhancement New feature or request label Feb 10, 2025

August-murr pinned this issue Feb 10, 2025

August-murr changed the title ~~Training Agents with GRPO~~ [Project] Training Agents with GRPO Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Project] Training Agents with GRPO #2723

[Project] Training Agents with GRPO #2723

August-murr commented Jan 31, 2025 •

edited

Loading

accupham commented Jan 31, 2025 •

edited

Loading

accupham commented Jan 31, 2025

August-murr commented Feb 1, 2025

xiangjjj commented Feb 9, 2025

willccbb commented Feb 11, 2025

jlia0 commented Feb 12, 2025

accupham commented Feb 12, 2025

[Project] Training Agents with GRPO #2723

[Project] Training Agents with GRPO #2723

Comments

August-murr commented Jan 31, 2025 • edited Loading

accupham commented Jan 31, 2025 • edited Loading

accupham commented Jan 31, 2025

August-murr commented Feb 1, 2025

xiangjjj commented Feb 9, 2025

willccbb commented Feb 11, 2025

jlia0 commented Feb 12, 2025

accupham commented Feb 12, 2025

August-murr commented Jan 31, 2025 •

edited

Loading

accupham commented Jan 31, 2025 •

edited

Loading