-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Project] Training Agents with GRPO #2723
Comments
So hear me out. I think the current style of function/tool calling as popularized by OpenAI does not lend itself to the expressiveness and fluidity of thought that R1-like models capitalize on and excel at. The feedback loop is too long. Instead, I think the best way is to express function calls as mixed code and language. Let the user decide the calling convention, and inject the results inline. Here's a example of what I mean. This is QwQ hooked up directly to a python interpreter to do neurosymbolic reasoning. Every line entered is directly fed into the REPL, and results fed back immediately to the LLM. I told it to express thoughts as python comments. Python 3.12.5 (main, Aug 14 2024, 04:32:18) [Clang 18.1.8 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> # Okay, let's tackle this step by step. The problem says:
>>> # If the equations x^2 - a = p * y and x^2 - b = p * y have integer solutions,
>>> # then x^2 - ab = p * y also has integer solutions. p is prime, and a, b are not divisible by p.
>>>
>>> from sympy import symbols, Eq, solveset, S
>>>
>>> # Let's define the symbols we need. I'll start with x, y, p, a, and b.
>>> x, y, p, a, b = symbols('x y p a b', integer=True)
>>>
>>> # Alright, I'll write out the first two equations.
>>> eq1 = Eq(x**2 - a, p * y) # This is x^2 - a = p * y
>>> eq2 = Eq(x**2 - b, p * y) # This is x^2 - b = p * y
>>>
>>> # The claim is that if these two equations have solutions,
>>> # then x^2 - ab = p * y should also have a solution. I'll write that equation too.
>>> eq3 = Eq(x**2 - a * b, p * y) # This is x^2 - ab = p * y
>>>
>>># Okay, let's check if we can find integer solutions for the first two equations.
>>># I'm going to use `solveset` to see if x has solutions in the integers for eq1 and eq2.
>>>sol_x1 = solveset(eq1, x, domain=S.Integers) # Solve eq1 for x
>>>sol_x1 # Let's see what we get here.
ConditionSet(x, Eq(x**2 - a, p*y), Integers)
>>>
>>> # Hmm, that's a generic solution for x in terms of the parameters.
>>> # Makes sense because we don't have concrete values for p, a, or y yet.
>>>
>>> sol_x2 = solveset(eq2, x, domain=S.Integers) # Solve eq2 for x
>>> sol_x2 # Alright, let me look at this one too.
>>> # >>> ConditionSet(x, Eq(x**2 - b, p*y), Integers)
>>>
>>> # Okay, so I have the solution sets for x in eq1 and eq2.
>>> # Now, let's think about what needs to happen for eq3 to have integer solutions.
>>>
>>> # eq3 implies x^2 ≡ ab (mod p) when reduced modulo p.
>>> # If a and b are quadratic residues modulo p, then their product ab should also be.
>>> # I know there's a property about the product of quadratic residues modulo a prime.
>>>
>>> # But let me explicitly check if solving eq3 gives integer solutions.
>>> sol_x3 = solveset(eq3, x, domain=S.Integers) # Solve eq3 for x
>>> sol_x3 # Does this give us something valid?
ConditionSet(x, Eq(x**2 - a*b, p*y), Integers)
>>>
>>> # Hmm, this matches what I expected! x^2 ≡ ab (mod p) implies solutions exist
>>> # as long as ab is a quadratic residue mod p.
>>>
>>> # To confirm, I need to remember why a and b being residues implies ab is a residue.
>>> # Ah, right! The Legendre symbol tells us (a/p) * (b/p) = (ab/p),
>>> # and both (a/p) and (b/p) are 1 because a and b are residues.
>>>
>>> # So, if eq1 and eq2 have solutions, a and b are residues mod p,
>>> # and their product ab is also a residue mod p. That ensures eq3 has solutions too.
>>>
>>> # Final confirmation: we have integer solutions for x in eq3 if integer solutions exist for eq1 and eq2.
>>> # That settles it!
>>>
>>> sol_x1, sol_x2, sol_x3 # Let's print all these solutions for the final check.
(ConditionSet(x, Eq(x**2 - a, p*y), Integers),
ConditionSet(x, Eq(x**2 - b, p*y), Integers),
ConditionSet(x, Eq(x**2 - a*b, p*y), Integers))
>>> # Solutions are correct.
>>> submit_final_answer(sol_x1, sol_x2, sol_x3) The fast feedback loop affords us a few crucial things:
Another example of more traditional agentic function calling: def weather(city: str) -> str:
return "The weather in Seattle is 40 deg F and sunny all day."
|
My opinion is to standardize around vLLM's LLM api. We should pass in a user defined |
@accupham I think it's better to take things step by step. |
Any considerations for the observation tokens from tool use? I don't think we should compute KL for those tokens. |
This PR (#2810) addresses @accupham 's suggestion to allow user-defined rollout logic which wraps vLLM. Would be curious to hear any comments about if this is sufficient for what people have in mind for now. The protocol here could potentially be extended to allow user-defined masks (for tool calls) as well as rewards being computed at this stage as well. |
This is exactly what I have been thinking and tinkering as well. I wonder how did you make QwQ to do "neurosymbolic reasoning" / "inline function call" like the example? |
The system prompt was quite simple:
Then you set the prefill to the default python REPL intro text:
At this point we just do a while-loop with stop token set to "\n". Feed the LLM text into a REPL or some stateful code execution environment, and concatenate the results to the end of the prefill, plus ">>>". Feed that prefill into another LLM call and continue completion from there. I think a jupyter notebook like environment might be more appropriate next time-- easier to sandbox. |
Let's discuss how to train agents using GRPO.
Here, I will link sub-issues related to various problems, features, or questions that need resolution for implementing this idea.
The text was updated successfully, but these errors were encountered: