Skip to content

Fix Tool Calling #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Fix Tool Calling #81

wants to merge 1 commit into from

Conversation

wongjingping
Copy link
Collaborator

@wongjingping wongjingping commented Mar 3, 2025

Motivation

We were only running the first tool if tool calls were enabled previously, which would result in wasted round trips with the LLM if there were multiple tool calls that we could have run locally. DEF-757

Changes

  • We refactored the 3 _process_{providers}_response functions to run all indicated tools instead of just the first.
  • We also update the follow-on API calls' messages and parameters based on the documentation for function calling from each of the model providers:
    • OpenAI
      • "role" should be "tool", not "assistant" (docs)
    • Anthropic
      • set "is_error" when we get an exception during the tool execution (docs)
    • Gemini
      • tool execution results should be returned in a single Content (if > 1 tool indicated) with the parts attribute containing a list of each tool's execution results in a Part.from_function_response class.
  • We also shifted providers over to an enum to make switching between providers more robust
  • Learnt that it is very important to set the tool_choice="auto" after a round of tool use, if not the model will keep returning function calls, instead of synthesizing the results from the function execution.

Tests

Updated a whole bunch of tests. Should all work now, although time to time I do get random api failures from the weather API.

Misc

I tried to refactor the common parts of all 3 providers' _process_*_response function, but failed because I haven't appreciated the nuances of the differences between each of them, and ended up wasting a lot of time trying to bend the tests. I ended up scraping the whole refactoring, concentrating on just getting the updated functionality to work, 1 provider at a time, always running and updating the tests first before moving on to the next. This made the fixing a lot more manageable and less prone to hallucinations or failure sinks.

@wongjingping
Copy link
Collaborator Author

linted and rebased ^

@rishsriv
Copy link
Member

rishsriv commented Mar 3, 2025

Thanks v much JP!

Quick couple of questions:

  1. I didn't quite understand the OpenAI change. We currently have something like this, which is what OAI seems to expect. What am I missing here?
# Append the tool calls as an assistant response
request_params["messages"].append(
    {
        "role": "assistant",
        "tool_calls": message.tool_calls,
    }
)

# Append the tool message
request_params["messages"].append(
    {
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": str(result),
    }
)
  1. Previously, we had tool_choice as auto by default, unless explicitly changed by the user. When tools are provided, all providers keep tool_choice as auto (unless we explicitly change it). I couldn't quite understand what the change here is

In general, could you please give a couple of examples of how responses have changed from using our current code vs this new code? We explicitly stuck to one tool call at a time previously, as calling multiple functions at a time often led to suboptimal results as different tool calls weren't aware of what previous calls did.

Want to make sure that this change leads to useful new functionality and does cause unexpected regressions!

Copy link
Collaborator Author

@wongjingping wongjingping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review boss!

Replied 1 inline.

For 2:

Previously, we had tool_choice as auto by default

That is still the same, we didn't change the defaults. The PR description about tool_choice was more about my own learning after AI edits removed it and my tests kept failing 🤦🏼

In general, could you please give a couple of examples of how responses have changed from using our current code vs this new code?

Sure thing! 1 example is one of the test cases, which I've modified to the following question: "Square 1 has length 31283 and width 2323. square 2 has length 1493 and width 951. What is their combined area? Always use the tools provided for all calculation, even simple calculations. Return only the final answer, nothing else."

We can now do 2 numprod's after the 1st llm call (which are independent of each other), followed by 1 numsum after the 2nd llm call, and a final 3rd llm call to synthesize the final answer (you can verify this in the tests for anthropic - the mocking was quite involved so I only did it for anthropic unfortunately). This is in contrast to the past, where there would be an additional llm call between the 2 numprod's, because of the constraint of running at most 1 tool per llm call. i.e. multiple tool calls makes things faster and more token-efficient where it makes sense.

While this seems a bit of a toy-case, the end-game I'm aiming for is some sort of "LLM memory" tool, where the LLM can insert/update/delete various facts / findings / state along the way in a more unstructured manner. 1 way I like to think of this memory is that it is a data structure that contains the "latest" and "most concise" representation of key information, vs a growing list (chat history). We'll likely need both down the line as they both have different strengths. To facilitate memory insert / update / delete operations, the ideal case is to have that as a 2nd tool call in addition to a normal tool call e.g. text2sql. Besides storing findings and additional context from tool calls, the memory can also keep track of various parameters like function call budgets (e.g. you can only ask the user 3 questions / make n requests to an expensive API).

Having a memory is also an enabler that allows us to implement concurrent asynchronous user inputs - we get the LLM to queue a request to the user via a tool call, and it can poll the "memory" to see the state of the request of the user via a tool call, to see whether the user has answered the question. If we only allow 1 tool call per round, the model would either be constantly polling the memory, or doing other tasks and potentially forget to check on the memory. If we allow 2 tool calls per round, the model can run a tool (as it normally would), while additionally polling the memory to see if the user has returned with a response. If there is a response, the next llm response can process that further. If not, this can just stay as an additional concurrent call per round of messages.

Where things are currently (no memory implementation yet), this PR has no material effect, since we typically only give 1 or 2 tools to the LLM which have been explicitly instructed to use only either. With multiple tool calling, we can go on to implement "memory" access without worrying about normal tool calls getting affected.

Last but not least, imho constraining the model to use 1 tool-per round is a weak form of scaffolding. Given how we're trying to build for the future, giving the model the chance to use multiple tools at a go seems to be where we could potentially find scale (breadth of analyses per round), besides depth which we are already actively exploring. As models learn to schedule and work concurrently better, multiple / concurrent tool use is only going to get more reliable / efficient. While models might not be there yet, having the code ready to test when it's able to do so would be helpful.

Comment on lines +768 to +783
# Append the tool calls as an assistant response
request_params["messages"].append(
{
"role": "assistant",
"tool_calls": message.tool_calls,
}
)

# Append error message to request_params and retry
request_params["messages"].append(
{
"role": "assistant",
"content": str(e),
}
)
# Append the tool result
request_params["messages"].append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result),
}
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question 1
For reference, this is the old code you shared:

# Append the tool calls as an assistant response
request_params["messages"].append(
    {
        "role": "assistant",
        "tool_calls": message.tool_calls,
    }
)

# Append the tool message
request_params["messages"].append(
    {
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": str(result),
    }
)

The only thing that changed here is that we changed the role from assistant to tool in the 2nd part based on openai's documentation. From my tests the responses we were getting before/after the change weren't significantly different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the same thing JP :) I think you're looking at just a specific part where were were logging errors – which was useful for self-correction.

@rishsriv
Copy link
Member

rishsriv commented Mar 3, 2025

Thanks for the note JP! Closing this for the following reasons:

giving the model the chance to use multiple tools at a go seems to be where we could potentially find scale (breadth of analyses per round), besides depth which we are already actively exploring.

This does not just let the model use multiple tools at a go – it also lets it use the same tool in the same call. The problem with this is that parallel tool calls are not context dependent. That is, they do not know the results of the previous tool calls that were made.

For the kinds of tasks we are using this library for (data analysis), context awareness between tool calls is critical (and is what makes the analyses in the Oracle so good). The only advantage that parallel tool calls give us is better latency and token efficiency – but at the expense of potentially high quality analyses.

We have experimented worked with parallel tool calls previously and they resulted in significantly worse analyses for complex problems (in fact, that was the first iteration – because OAI does parallel tool calls by default). So closing this for now.

@rishsriv rishsriv closed this Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants