Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tool-call: fix DeepSeek R1 Qwen distills #11607

Open
wants to merge 55 commits into
base: master
Choose a base branch
from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Feb 3, 2025

Fixes tool call support of DeepSeek-R1-Distill-Qwen-7B & 32B (follow up to #9639), and adds <think>thoughts</think> parsing.

(Split off Minja changes in #11641, will declutter diff once merged)

  • Had to work around the official template:
    • It doesn't describe the available tools, and the backfill done by Minja wasn't phrased well enough (for the 7B model), so I've added autogenerated tool call examples to minja's revamped "polyfill" behaviour (using a delta template eval).
      sync: minja #11641
    • It ignores message.tool_calls if message.content is not null, updated / testing the server output accordingly (better oai compliance)
    • After a tool result, it leaves the prompt hanging on a <|tool▁outputs▁end|> instead of ending with <|end▁of▁sentence|><|Assistant|>).
      • Hacked a workaround so the default template now works well with this branch
      • Added / documented better template (models/templates/llama-cpp-deepseek-r1.jinja)
  • Both 8B & 32B models seem to take liberties with their tool call start tag, so accepting variations of the syntax (which then triggers the lazy grammar / full compliance)
  • I've added a "thoughts" field for <think> content to the API similar to the "tool_plan" output of tool-call: support Command R7B (+ return tool_plan "thoughts" in API) #11585
    • Note: thoughts from previous messages are explicitly stripped from the prompt by the template, and we don't try and force them back (template wouldn't render tool calls if there was any content anyway, and while stripping thoughts will locally reset the KV cache, losing the hot tokens of the tool calls, it will save up tokens a lot in the long term over a chat)
  • Added the Q4_K_M quant to some (but not all) slow server tests (had to tell it not to overthink, bit... ironic).
  • Added slow tool result server tests (checking models make some use of tool call results, which some struggle a bit with)
# Please try with and w/o the chat template override and report back on your results :-)
llama-server --jinja -fa -hf bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q4_K_M \
  --chat-template-file models/templates/llama-cpp-deepseek-r1.jinja
llama-server --jinja -fa -hf bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF:Q6_K_L \
  --chat-template-file models/templates/llama-cpp-deepseek-r1.jinja

TODOs:

Possible follow ups

@github-actions github-actions bot added testing Everything test related examples python python script changes server labels Feb 3, 2025
@ochafik ochafik mentioned this pull request Feb 4, 2025
@ochafik ochafik changed the title tool-call: fix DeepSeek R1 Qwen distill (WIP) tool-call: fix DeepSeek R1 Qwen distill Feb 4, 2025
@ochafik ochafik marked this pull request as ready for review February 4, 2025 04:57
@ochafik ochafik requested a review from ngxson as a code owner February 4, 2025 04:57
@ochafik ochafik changed the title tool-call: fix DeepSeek R1 Qwen distill tool-call: fix DeepSeek R1 Qwen distills Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant