-
Notifications
You must be signed in to change notification settings - Fork 10
Fix Tool Calling #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Tool Calling #81
Conversation
c02e6fd
to
5c7a0dd
Compare
linted and rebased ^ |
Thanks v much JP! Quick couple of questions:
# Append the tool calls as an assistant response
request_params["messages"].append(
{
"role": "assistant",
"tool_calls": message.tool_calls,
}
)
# Append the tool message
request_params["messages"].append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result),
}
)
In general, could you please give a couple of examples of how responses have changed from using our current code vs this new code? We explicitly stuck to one tool call at a time previously, as calling multiple functions at a time often led to suboptimal results as different tool calls weren't aware of what previous calls did. Want to make sure that this change leads to useful new functionality and does cause unexpected regressions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review boss!
Replied 1 inline.
For 2:
Previously, we had tool_choice as auto by default
That is still the same, we didn't change the defaults. The PR description about tool_choice
was more about my own learning after AI edits removed it and my tests kept failing 🤦🏼
In general, could you please give a couple of examples of how responses have changed from using our current code vs this new code?
Sure thing! 1 example is one of the test cases, which I've modified to the following question: "Square 1 has length 31283 and width 2323. square 2 has length 1493 and width 951. What is their combined area? Always use the tools provided for all calculation, even simple calculations. Return only the final answer, nothing else."
We can now do 2 numprod's
after the 1st llm call (which are independent of each other), followed by 1 numsum
after the 2nd llm call, and a final 3rd llm call to synthesize the final answer (you can verify this in the tests for anthropic - the mocking was quite involved so I only did it for anthropic unfortunately). This is in contrast to the past, where there would be an additional llm call between the 2 numprod's
, because of the constraint of running at most 1 tool per llm call. i.e. multiple tool calls makes things faster and more token-efficient where it makes sense.
While this seems a bit of a toy-case, the end-game I'm aiming for is some sort of "LLM memory" tool, where the LLM can insert/update/delete various facts / findings / state along the way in a more unstructured manner. 1 way I like to think of this memory is that it is a data structure that contains the "latest" and "most concise" representation of key information, vs a growing list (chat history). We'll likely need both down the line as they both have different strengths. To facilitate memory insert / update / delete operations, the ideal case is to have that as a 2nd tool call in addition to a normal tool call e.g. text2sql. Besides storing findings and additional context from tool calls, the memory can also keep track of various parameters like function call budgets (e.g. you can only ask the user 3 questions / make n requests to an expensive API).
Having a memory is also an enabler that allows us to implement concurrent asynchronous user inputs - we get the LLM to queue a request to the user via a tool call, and it can poll the "memory" to see the state of the request of the user via a tool call, to see whether the user has answered the question. If we only allow 1 tool call per round, the model would either be constantly polling the memory, or doing other tasks and potentially forget to check on the memory. If we allow 2 tool calls per round, the model can run a tool (as it normally would), while additionally polling the memory to see if the user has returned with a response. If there is a response, the next llm response can process that further. If not, this can just stay as an additional concurrent call per round of messages.
Where things are currently (no memory implementation yet), this PR has no material effect, since we typically only give 1 or 2 tools to the LLM which have been explicitly instructed to use only either. With multiple tool calling, we can go on to implement "memory" access without worrying about normal tool calls getting affected.
Last but not least, imho constraining the model to use 1 tool-per round is a weak form of scaffolding. Given how we're trying to build for the future, giving the model the chance to use multiple tools at a go seems to be where we could potentially find scale (breadth of analyses per round), besides depth which we are already actively exploring. As models learn to schedule and work concurrently better, multiple / concurrent tool use is only going to get more reliable / efficient. While models might not be there yet, having the code ready to test when it's able to do so would be helpful.
# Append the tool calls as an assistant response | ||
request_params["messages"].append( | ||
{ | ||
"role": "assistant", | ||
"tool_calls": message.tool_calls, | ||
} | ||
) | ||
|
||
# Append error message to request_params and retry | ||
request_params["messages"].append( | ||
{ | ||
"role": "assistant", | ||
"content": str(e), | ||
} | ||
) | ||
# Append the tool result | ||
request_params["messages"].append( | ||
{ | ||
"role": "tool", | ||
"tool_call_id": tool_call.id, | ||
"content": str(result), | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question 1
For reference, this is the old code you shared:
# Append the tool calls as an assistant response
request_params["messages"].append(
{
"role": "assistant",
"tool_calls": message.tool_calls,
}
)
# Append the tool message
request_params["messages"].append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result),
}
)
The only thing that changed here is that we changed the role
from assistant
to tool
in the 2nd part based on openai's documentation. From my tests the responses we were getting before/after the change weren't significantly different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the same thing JP :) I think you're looking at just a specific part where were were logging errors – which was useful for self-correction.
Thanks for the note JP! Closing this for the following reasons:
This does not just let the model use multiple tools at a go – it also lets it use the same tool in the same call. The problem with this is that parallel tool calls are not context dependent. That is, they do not know the results of the previous tool calls that were made. For the kinds of tasks we are using this library for (data analysis), context awareness between tool calls is critical (and is what makes the analyses in the Oracle so good). The only advantage that parallel tool calls give us is better latency and token efficiency – but at the expense of potentially high quality analyses. We have experimented worked with parallel tool calls previously and they resulted in significantly worse analyses for complex problems (in fact, that was the first iteration – because OAI does parallel tool calls by default). So closing this for now. |
Motivation
We were only running the first tool if tool calls were enabled previously, which would result in wasted round trips with the LLM if there were multiple tool calls that we could have run locally. DEF-757
Changes
_process_{providers}_response
functions to run all indicated tools instead of just the first.Content
(if > 1 tool indicated) with theparts
attribute containing a list of each tool's execution results in aPart.from_function_response
class.tool_choice="auto"
after a round of tool use, if not the model will keep returning function calls, instead of synthesizing the results from the function execution.Tests
Updated a whole bunch of tests. Should all work now, although time to time I do get random api failures from the weather API.
Misc
I tried to refactor the common parts of all 3 providers'
_process_*_response
function, but failed because I haven't appreciated the nuances of the differences between each of them, and ended up wasting a lot of time trying to bend the tests. I ended up scraping the whole refactoring, concentrating on just getting the updated functionality to work, 1 provider at a time, always running and updating the tests first before moving on to the next. This made the fixing a lot more manageable and less prone to hallucinations or failure sinks.