Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Project] Training Agents with GRPO #2723

Open
August-murr opened this issue Jan 31, 2025 · 7 comments
Open

[Project] Training Agents with GRPO #2723

August-murr opened this issue Jan 31, 2025 · 7 comments
Labels
✨ enhancement New feature or request 🏋 GRPO Related to GRPO

Comments

@August-murr
Copy link
Collaborator

August-murr commented Jan 31, 2025

Let's discuss how to train agents using GRPO.

Here, I will link sub-issues related to various problems, features, or questions that need resolution for implementing this idea.

@accupham
Copy link

accupham commented Jan 31, 2025

So hear me out. I think the current style of function/tool calling as popularized by OpenAI does not lend itself to the expressiveness and fluidity of thought that R1-like models capitalize on and excel at. The feedback loop is too long.

Instead, I think the best way is to express function calls as mixed code and language. Let the user decide the calling convention, and inject the results inline.

Here's a example of what I mean. This is QwQ hooked up directly to a python interpreter to do neurosymbolic reasoning. Every line entered is directly fed into the REPL, and results fed back immediately to the LLM. I told it to express thoughts as python comments.

Python 3.12.5 (main, Aug 14 2024, 04:32:18) [Clang 18.1.8 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> # Okay, let's tackle this step by step. The problem says:
>>> # If the equations x^2 - a = p * y and x^2 - b = p * y have integer solutions,
>>> # then x^2 - ab = p * y also has integer solutions. p is prime, and a, b are not divisible by p.
>>>
>>> from sympy import symbols, Eq, solveset, S
>>>
>>> # Let's define the symbols we need. I'll start with x, y, p, a, and b.
>>> x, y, p, a, b = symbols('x y p a b', integer=True)
>>>
>>> # Alright, I'll write out the first two equations.
>>> eq1 = Eq(x**2 - a, p * y)  # This is x^2 - a = p * y
>>> eq2 = Eq(x**2 - b, p * y)  # This is x^2 - b = p * y
>>>
>>> # The claim is that if these two equations have solutions,
>>> # then x^2 - ab = p * y should also have a solution. I'll write that equation too.
>>> eq3 = Eq(x**2 - a * b, p * y)  # This is x^2 - ab = p * y
>>>
>>># Okay, let's check if we can find integer solutions for the first two equations.
>>># I'm going to use `solveset` to see if x has solutions in the integers for eq1 and eq2.
>>>sol_x1 = solveset(eq1, x, domain=S.Integers)  # Solve eq1 for x
>>>sol_x1  # Let's see what we get here.
ConditionSet(x, Eq(x**2 - a, p*y), Integers)
>>>
>>> # Hmm, that's a generic solution for x in terms of the parameters.
>>> # Makes sense because we don't have concrete values for p, a, or y yet.
>>>
>>> sol_x2 = solveset(eq2, x, domain=S.Integers)  # Solve eq2 for x
>>> sol_x2  # Alright, let me look at this one too.
>>> # >>> ConditionSet(x, Eq(x**2 - b, p*y), Integers)
>>>
>>> # Okay, so I have the solution sets for x in eq1 and eq2.
>>> # Now, let's think about what needs to happen for eq3 to have integer solutions.
>>>
>>> # eq3 implies x^2 ≡ ab (mod p) when reduced modulo p.
>>> # If a and b are quadratic residues modulo p, then their product ab should also be.
>>> # I know there's a property about the product of quadratic residues modulo a prime.
>>>
>>> # But let me explicitly check if solving eq3 gives integer solutions.
>>> sol_x3 = solveset(eq3, x, domain=S.Integers)  # Solve eq3 for x
>>> sol_x3  # Does this give us something valid?
ConditionSet(x, Eq(x**2 - a*b, p*y), Integers)
>>>
>>> # Hmm, this matches what I expected! x^2 ≡ ab (mod p) implies solutions exist
>>> # as long as ab is a quadratic residue mod p.
>>>
>>> # To confirm, I need to remember why a and b being residues implies ab is a residue.
>>> # Ah, right! The Legendre symbol tells us (a/p) * (b/p) = (ab/p),
>>> # and both (a/p) and (b/p) are 1 because a and b are residues.
>>>
>>> # So, if eq1 and eq2 have solutions, a and b are residues mod p,
>>> # and their product ab is also a residue mod p. That ensures eq3 has solutions too.
>>>
>>> # Final confirmation: we have integer solutions for x in eq3 if integer solutions exist for eq1 and eq2.
>>> # That settles it!
>>>
>>> sol_x1, sol_x2, sol_x3  # Let's print all these solutions for the final check.
(ConditionSet(x, Eq(x**2 - a, p*y), Integers),
ConditionSet(x, Eq(x**2 - b, p*y), Integers),
ConditionSet(x, Eq(x**2 - a*b, p*y), Integers))
>>> # Solutions are correct.
>>> submit_final_answer(sol_x1, sol_x2, sol_x3)

The fast feedback loop affords us a few crucial things:

  • It's also able to recover from errors (such as syntax/import/unexpected result) in real time
  • It can learn from it's mistakes and recover mid-trajectory
  • Each line execution can be used as a reward or punishment (ie: -0.3 for syntax error)

Another example of more traditional agentic function calling:

def weather(city: str) -> str:
    return "The weather in Seattle is 40 deg F and sunny all day."
User: Should I pack an umbrella today?
Assistant: <think>Ok, so the user wants to know if it's going to rain or now. I should look up the weather using the `weather()` tool. But wait, I don't know where the user is currently located. I should ask before using the tool. </think><answer>What city are you currently in?</answer>
User: Seattle
Assistant: <think>I'll look up the weather. <fn>weather("Seattle")</fn>
<fn_error>
WeatherLookupError: Must specify a two-letter state.
</fn_error>
<fn>weather("Seattle, WA")</fn>
<fn_results>
The weather in Seattle is 40 deg F and sunny all day.
</fn_results>
Ah, so according to the results, it's unlikely the user will need an umbrella because it will not rain.
<answer>
The weather is nice and sunny today, no need to pack an umbrella today. Can I assist with anything else?
</answer>

@accupham
Copy link

My opinion is to standardize around vLLM's LLM api. We should pass in a user defined RolloutSampler class, which takes in a vLLM LLM class, and let the user figure out how to do function calling during rollout sampling. If they want to do it the standard way, they could use the LLM.chat() api with tools and call it the traditional way. If they wanted do something more interactive and realtime, they could stream tokens and detect llama-3.2 style function calls with some sort of regex hook, and inject in results into the output stream.

@August-murr
Copy link
Collaborator Author

@accupham I think it's better to take things step by step.
Let's build a minimal prototype that works, and then we can focus on different ideas and ways to improve it.

@xiangjjj
Copy link

xiangjjj commented Feb 9, 2025

Any considerations for the observation tokens from tool use? I don't think we should compute KL for those tokens.

@August-murr August-murr added the ✨ enhancement New feature or request label Feb 10, 2025
@August-murr August-murr pinned this issue Feb 10, 2025
@August-murr August-murr changed the title Training Agents with GRPO [Project] Training Agents with GRPO Feb 10, 2025
@willccbb
Copy link

This PR (#2810) addresses @accupham 's suggestion to allow user-defined rollout logic which wraps vLLM. Would be curious to hear any comments about if this is sufficient for what people have in mind for now.

The protocol here could potentially be extended to allow user-defined masks (for tool calls) as well as rewards being computed at this stage as well.

@jlia0
Copy link

jlia0 commented Feb 12, 2025

So hear me out. I think the current style of function/tool calling as popularized by OpenAI does not lend itself to the expressiveness and fluidity of thought that R1-like models capitalize on and excel at. The feedback loop is too long.

Instead, I think the best way is to express function calls as mixed code and language. Let the user decide the calling convention, and inject the results inline.

Here's a example of what I mean. This is QwQ hooked up directly to a python interpreter to do neurosymbolic reasoning. Every line entered is directly fed into the REPL, and results fed back immediately to the LLM. I told it to express thoughts as python comments.

Python 3.12.5 (main, Aug 14 2024, 04:32:18) [Clang 18.1.8 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> # Okay, let's tackle this step by step. The problem says:
>>> # If the equations x^2 - a = p * y and x^2 - b = p * y have integer solutions,
>>> # then x^2 - ab = p * y also has integer solutions. p is prime, and a, b are not divisible by p.
>>>
>>> from sympy import symbols, Eq, solveset, S
>>>
>>> # Let's define the symbols we need. I'll start with x, y, p, a, and b.
>>> x, y, p, a, b = symbols('x y p a b', integer=True)
>>>
>>> # Alright, I'll write out the first two equations.
>>> eq1 = Eq(x**2 - a, p * y)  # This is x^2 - a = p * y
>>> eq2 = Eq(x**2 - b, p * y)  # This is x^2 - b = p * y
>>>
>>> # The claim is that if these two equations have solutions,
>>> # then x^2 - ab = p * y should also have a solution. I'll write that equation too.
>>> eq3 = Eq(x**2 - a * b, p * y)  # This is x^2 - ab = p * y
>>>
>>># Okay, let's check if we can find integer solutions for the first two equations.
>>># I'm going to use `solveset` to see if x has solutions in the integers for eq1 and eq2.
>>>sol_x1 = solveset(eq1, x, domain=S.Integers)  # Solve eq1 for x
>>>sol_x1  # Let's see what we get here.
ConditionSet(x, Eq(x**2 - a, p*y), Integers)
>>>
>>> # Hmm, that's a generic solution for x in terms of the parameters.
>>> # Makes sense because we don't have concrete values for p, a, or y yet.
>>>
>>> sol_x2 = solveset(eq2, x, domain=S.Integers)  # Solve eq2 for x
>>> sol_x2  # Alright, let me look at this one too.
>>> # >>> ConditionSet(x, Eq(x**2 - b, p*y), Integers)
>>>
>>> # Okay, so I have the solution sets for x in eq1 and eq2.
>>> # Now, let's think about what needs to happen for eq3 to have integer solutions.
>>>
>>> # eq3 implies x^2 ≡ ab (mod p) when reduced modulo p.
>>> # If a and b are quadratic residues modulo p, then their product ab should also be.
>>> # I know there's a property about the product of quadratic residues modulo a prime.
>>>
>>> # But let me explicitly check if solving eq3 gives integer solutions.
>>> sol_x3 = solveset(eq3, x, domain=S.Integers)  # Solve eq3 for x
>>> sol_x3  # Does this give us something valid?
ConditionSet(x, Eq(x**2 - a*b, p*y), Integers)
>>>
>>> # Hmm, this matches what I expected! x^2 ≡ ab (mod p) implies solutions exist
>>> # as long as ab is a quadratic residue mod p.
>>>
>>> # To confirm, I need to remember why a and b being residues implies ab is a residue.
>>> # Ah, right! The Legendre symbol tells us (a/p) * (b/p) = (ab/p),
>>> # and both (a/p) and (b/p) are 1 because a and b are residues.
>>>
>>> # So, if eq1 and eq2 have solutions, a and b are residues mod p,
>>> # and their product ab is also a residue mod p. That ensures eq3 has solutions too.
>>>
>>> # Final confirmation: we have integer solutions for x in eq3 if integer solutions exist for eq1 and eq2.
>>> # That settles it!
>>>
>>> sol_x1, sol_x2, sol_x3  # Let's print all these solutions for the final check.
(ConditionSet(x, Eq(x**2 - a, p*y), Integers),
ConditionSet(x, Eq(x**2 - b, p*y), Integers),
ConditionSet(x, Eq(x**2 - a*b, p*y), Integers))
>>> # Solutions are correct.
>>> submit_final_answer(sol_x1, sol_x2, sol_x3)

The fast feedback loop affords us a few crucial things:

  • It's also able to recover from errors (such as syntax/import/unexpected result) in real time
  • It can learn from it's mistakes and recover mid-trajectory
  • Each line execution can be used as a reward or punishment (ie: -0.3 for syntax error)

Another example of more traditional agentic function calling:

def weather(city: str) -> str:
    return "The weather in Seattle is 40 deg F and sunny all day."
User: Should I pack an umbrella today?
Assistant: <think>Ok, so the user wants to know if it's going to rain or now. I should look up the weather using the `weather()` tool. But wait, I don't know where the user is currently located. I should ask before using the tool. </think><answer>What city are you currently in?</answer>
User: Seattle
Assistant: <think>I'll look up the weather. <fn>weather("Seattle")</fn>
<fn_error>
WeatherLookupError: Must specify a two-letter state.
</fn_error>
<fn>weather("Seattle, WA")</fn>
<fn_results>
The weather in Seattle is 40 deg F and sunny all day.
</fn_results>
Ah, so according to the results, it's unlikely the user will need an umbrella because it will not rain.
<answer>
The weather is nice and sunny today, no need to pack an umbrella today. Can I assist with anything else?
</answer>

This is exactly what I have been thinking and tinkering as well. I wonder how did you make QwQ to do "neurosymbolic reasoning" / "inline function call" like the example?

@accupham
Copy link

This is exactly what I have been thinking and tinkering as well. I wonder how did you make QwQ to do "neurosymbolic reasoning" / "inline function call" like the example?

The system prompt was quite simple:

You are now operating as a stateful Python REPL environment. You can use it as memory buffer and scratch pad as a goal-seeking agent.

Then you set the prefill to the default python REPL intro text:

Python 3.13.1 (main, Dec  3 2024, 17:59:52) [Clang 16.0.0 (clang-1600.0.26.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

At this point we just do a while-loop with stop token set to "\n". Feed the LLM text into a REPL or some stateful code execution environment, and concatenate the results to the end of the prefill, plus ">>>". Feed that prefill into another LLM call and continue completion from there.


I think a jupyter notebook like environment might be more appropriate next time-- easier to sandbox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ enhancement New feature or request 🏋 GRPO Related to GRPO
Projects
None yet
Development

No branches or pull requests

5 participants