-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Make Mistral community chat templates optional #15420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Mistral community chat templates optional #15420
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot ask llama.cpp users to use a different tokenizer and chat template formatting tool for every model. Consider putting this effort into improving the community template instead.
Absolutely agree, releasing GGUFs that do not work unless you use Bare minimum is that |
Mistral employee and open-source contributor since many years here. Just want to chime in here to give some more context. The reasons for this PR are as follows: 1.) Freedom for the user to pick 2.) Correctness It's totally normal that mistakes are made during releases, but arguably chat templates are hard to test and very often need to be corrected a couple days after. A well-tested library with Pydantic objects is safer here. For example, right now the v13 chat template defaults to an unsloth devstral version (here). This community chat template is a) incorrect for our latest reasoning model here as we need think chunks for reasoning. The chat template would silently be loaded and give silent errors. b) It'll be wrong for future v13 releases where we might want to add more features. If llama.cpp wants to force every model provider to use chat templates, it'll be hard for us to release ggml weights and users will have to rely on untested community versions which will often only be corrected a couple days after the release. 3.) Dependency injection 4.) Flexibility Very curious to hear your thoughts on this and gentle ask to give this PR a second look. As said in the beginning, we would like to give users two options instead of one - not replace chat templates. |
The problem is that HF will be full of GGUFs long before a chat template exists this way, and they will simply not work without
..and the Mistral chat format is at what version now? :)
Of course not, support for chat templates is in fact quite new. However for ease of use we should be able to require an implementation of the chat format before conversion of any new model that does not have a chat template. |
Hi @patrickvonplaten, I will try to address two main points of your comment.
This PR does not give anybody more freedom or flexibility. What this PR does is remove the community template from the GGUF file, unless an obscure flag that nobody is going to know about is used. By doing so, you are leaving users without an option to use the community or the llama.cpp template. Users always have the option of submitting and receiving tokens for inference with llama.cpp, without involving the tokenizer or the chat formatter, regardless of if the GGUF file contains a chat template or not. Likewise, users have the option of using OpenAI Harmony if they wish, but they also have the built-in chat templates if they do not want to deal with this.
Correctness and ease of implementation cannot come at the expense of usability. I think there is a fundamental misunderstanding about the way people use llama.cpp and the users that it serves. For most of our users, telling them that they need to create some python program to be able to use llama.cpp with this model, is effectively the same than telling them that they cannot use llama.cpp with this model. I don't think this is beneficial to either the llama.cpp community or Mistral. |
Answering @CISC
To give you some context here, versioning for our tokenizers is not due to errors of previous formats but how we handle tokenization throughout releases. The process is not the same for every models (whether to use system prompts or not for example). This is why depending on the model we instantiate v1, v3, v7, v11, v13 ! This does not mean v1 was fixed with v2 it's just that our models handle chat completion requests differently. The nice thing about it is that every versions are tested which is not the case for chat templates. Answering @slaren
It's not entirely True. This happens only for mistral format which is used if users already passed the
We're open to suggestions here on how we could improve this REST API and its usage. However as you rightfully explained before, users already had this possibility before to not use llama.cpp's tokenizers so we'd like to exploit it. Additional pointsThis PR does not want to get ridden of chat templates, we just want users to be informed of the risks and be able to convert their models if the chat template does not already exist. If it helps, we could also change the behavior by requiring the user to explicitly declare it does not want a chat template instead that it wants it. This way, the error raised when the chat template does not exist can be skipped. |
Sorry, I was just being snarky. :) The thing is though that the constant revisioning of your chat format (quite often by just changing pre/post-pending spaces/newlines) is the main reason community chat templates have required so many fixes, hence the snark. |
At the moment, you cannot create a GGUF file without a chat template - even if you want to convert it from a mistral-common tokenizer or maybe I'm misunderstanding something?
ok I understand. That's a good point and if that's the vision of llama.cpp then that makes sense to me. mistral_common + llama.cpp server is indeed less user-friendly. Just note that this is not necessarily beneficial for us if it comes at the expense of correctness. We value correctness clearly more than user-friendliness. => So if you feel very strongly about using chat templates, then let's maybe just revert the default and add a
Hmm ok not sure what point you're trying to make here 😅 When we go from v1 -> v2, it's also not really a bug fix it's just a new, different capability is added. Also just a "v13" mistral tokenizer type doesn't fully capture the full capabilities of the tokenizer. You can have a "v13" tokenizer with different configurations: https://github.com/mistralai/mistral-common/blob/38ab6d4b15e67a8d284c1d98253f989e52819320/src/mistral_common/tokens/tokenizers/tekken.py#L78 for image, fim, audio, .... But anyways, probably not necessary to fight here. If you don't want it then that's it - you're the maintainers in the end. If we could have a |
For my part, an opt-in option to skip the chat template would be acceptable. It doesn't even have to be specific to mistral, it can be a generic option to disable exporting of the chat template, although I understand that may require more changes, and it is not strictly necessary. |
It doesn't have to be a jinja chat template. The reason we force something to be defined with the
Just that the "problem" with jinja chat templates is not as systemic as you may think. I realize that the variations mainly comes down to tokenization differences, but you sort of set yourself up for the current situation by not addressing this early on, leaving the community to pick up the slack.
Let's call it friendly banter. :)
I'm fine with that, will at least prevent broken drive-by GGUFs. |
I updated the PR accordingly to keep chat templates by default but allow disabling them with an arg. |
ggml-org#15420) * Make Mistral community chat templates optional * Change the flag arg to disable instead of enable community chat templates * Improve error message * Improve help message * Tone down the logger messages
Thanks. |
Unrelated, the dependencies changed (for simplicity) in #14737, you have to update your venv from |
I understand, but I never had to |
Just lucky then, |
Guess Mistral wanted to spoil the fun. I spin up |
Hi, we'd like to make the community chat templates optional for the Mistral format after its support thanks to #14737.
This is mainly due to two issues:
mistral-common
to handle tokenization and detokenization.As jinja templates can be overridden by llama.cpp when serving, this doesn't prevent users to either convert again when a template is released or simply passing the template to the CLI.
Happy to know your thoughts about it and address the different issues you might have.