Skip to content

Make Mistral community chat templates optional #15420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 21, 2025

Conversation

juliendenize
Copy link
Contributor

Hi, we'd like to make the community chat templates optional for the Mistral format after its support thanks to #14737.

This is mainly due to two issues:

  1. Community chat templates can contain errors in the first days/weeks after a release. Making it clear to the users and proposing an alternative would be a good add. Indeed, at Mistral we officially recommend to use mistral-common to handle tokenization and detokenization.
  2. For future releases, the tokenization version might not be supported by llama.cpp on day-0. Making the chat templates optional allows the community to convert the models without meeting an error.

As jinja templates can be overridden by llama.cpp when serving, this doesn't prevent users to either convert again when a template is released or simply passing the template to the CLI.

Happy to know your thoughts about it and address the different issues you might have.

@github-actions github-actions bot added the python python script changes label Aug 19, 2025
Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot ask llama.cpp users to use a different tokenizer and chat template formatting tool for every model. Consider putting this effort into improving the community template instead.

@CISC
Copy link
Collaborator

CISC commented Aug 19, 2025

We cannot ask llama.cpp users to use a different tokenizer and chat template formatting tool for every model. Consider putting this effort into improving the community template instead.

Absolutely agree, releasing GGUFs that do not work unless you use mistral_common is not acceptable.

Bare minimum is that llama-chat.cpp is updated with support for the new chat format.

@patrickvonplaten
Copy link

patrickvonplaten commented Aug 19, 2025

  • Community chat templates can contain errors in the first days/weeks after a release. Making it clear to the users and proposing an alternative would be a good add. Indeed, at Mistral we officially recommend to use mistral-common to handle tokenization and detokenization.

Hey @slaren, @CISC,

Mistral employee and open-source contributor since many years here.

Just want to chime in here to give some more context. The reasons for this PR are as follows:

1.) Freedom for the user to pick
The way I understand open-source is to give users the freedom to decide which format to use. In this PR we don't force users to use the mistral format we just give the opportunity to do. Before this PR it would have been impossible to convert a Mistral model without using the chat template. With this PR one can both use community chat templates and "mistral-certified" format. This PR doesn't delete community templates for mistral models - it just gives the user the possibility to decide between the two. Users valuing correctness with a server setup will want to prefer mistral-common server + ggml model server. Users that want to only rely on a single file will want to prefer community chat templates

2.) Correctness
As outlined by @ggerganov here: #14737 (comment) and by @juliendenize above, "ensured correctness" is the main reason why someone would want to use mistral_common. New mistral models will simply not have chat template support out of the box as we don't have the time to do the conversion. Arguably the format is also quite error prone. Putting some links showcasing how often new model releases need to fix chat templating:

It's totally normal that mistakes are made during releases, but arguably chat templates are hard to test and very often need to be corrected a couple days after. A well-tested library with Pydantic objects is safer here.

For example, right now the v13 chat template defaults to an unsloth devstral version (here). This community chat template is a) incorrect for our latest reasoning model here as we need think chunks for reasoning. The chat template would silently be loaded and give silent errors. b) It'll be wrong for future v13 releases where we might want to add more features.

If llama.cpp wants to force every model provider to use chat templates, it'll be hard for us to release ggml weights and users will have to rely on untested community versions which will often only be corrected a couple days after the release.

3.) Dependency injection
If we can inject chat templating logic via mistral_common we would not need to open a PR to llama.cpp upon releases and new models would be supported out of the box just by updating mistral_common. Also you have 0 maintenance to do for mistral_common. In addition it would make it easier for the community to create correct chat templates because they will always have a working & correct reference implementation with mistral_common. So it should be quite easy to make the correct chat template with unit testing.

4.) Flexibility
I would also make a case that it's good to also provide flexibility for the user to just send tokens and receive tokens back for inference. This is done in both vllm and transformers (which also both support mistral_common since some time).
OpenAI recently released Harmony: https://github.com/openai/harmony which is a replacement for chat templates as well. Forcing the community to stick to chat templates restricts use cases IMO.

Very curious to hear your thoughts on this and gentle ask to give this PR a second look. As said in the beginning, we would like to give users two options instead of one - not replace chat templates.

@CISC
Copy link
Collaborator

CISC commented Aug 19, 2025

In this PR we don't force users to use the mistral format we just give the opportunity to do. Before this PR it would have been impossible to convert a Mistral model without using the chat template. With this PR one can both use community chat templates and "mistral-certified" format.

The problem is that HF will be full of GGUFs long before a chat template exists this way, and they will simply not work without mistral_common, and I guarantee you that nary a single user will understand why.

New mistral models will simply not have chat template support out of the box as we don't have the time to do the conversion. Arguably the format is also quite error prone. Putting some links showcasing how often new model releases need to fix chat templating:

..and the Mistral chat format is at what version now? :)

If llama.cpp wants to force every model provider to use chat templates, it'll be hard for us to release ggml weights and users will have to rely on untested community versions which will often only be corrected a couple days after the release.

Of course not, support for chat templates is in fact quite new. However for ease of use we should be able to require an implementation of the chat format before conversion of any new model that does not have a chat template.

@slaren
Copy link
Member

slaren commented Aug 19, 2025

Hi @patrickvonplaten, I will try to address two main points of your comment.

  1. Freedom of the user to pick and flexibility

This PR does not give anybody more freedom or flexibility. What this PR does is remove the community template from the GGUF file, unless an obscure flag that nobody is going to know about is used. By doing so, you are leaving users without an option to use the community or the llama.cpp template. Users always have the option of submitting and receiving tokens for inference with llama.cpp, without involving the tokenizer or the chat formatter, regardless of if the GGUF file contains a chat template or not. Likewise, users have the option of using OpenAI Harmony if they wish, but they also have the built-in chat templates if they do not want to deal with this.

  1. Correctness and ease of supporting new models

Correctness and ease of implementation cannot come at the expense of usability. I think there is a fundamental misunderstanding about the way people use llama.cpp and the users that it serves. For most of our users, telling them that they need to create some python program to be able to use llama.cpp with this model, is effectively the same than telling them that they cannot use llama.cpp with this model. I don't think this is beneficial to either the llama.cpp community or Mistral.

@juliendenize
Copy link
Contributor Author

Answering @CISC

..and the Mistral chat format is at what version now? :)

To give you some context here, versioning for our tokenizers is not due to errors of previous formats but how we handle tokenization throughout releases. The process is not the same for every models (whether to use system prompts or not for example). This is why depending on the model we instantiate v1, v3, v7, v11, v13 ! This does not mean v1 was fixed with v2 it's just that our models handle chat completion requests differently.

The nice thing about it is that every versions are tested which is not the case for chat templates.

Answering @slaren

This PR does not give anybody more freedom or flexibility. What this PR does is remove the community template from the GGUF file, unless an obscure flag that nobody is going to know about is used.

It's not entirely True. This happens only for mistral format which is used if users already passed the --mistral-format flag. This is something that I intentionally did to not impact users of the HF format for example.

For most of our users, telling them that they need to create some python program to be able to use llama.cpp with this model, is effectively the same than telling them that they cannot use llama.cpp with this model. I don't think this is beneficial to either the llama.cpp community or Mistral.

We're open to suggestions here on how we could improve this REST API and its usage. However as you rightfully explained before, users already had this possibility before to not use llama.cpp's tokenizers so we'd like to exploit it.

Additional points

This PR does not want to get ridden of chat templates, we just want users to be informed of the risks and be able to convert their models if the chat template does not already exist.

If it helps, we could also change the behavior by requiring the user to explicitly declare it does not want a chat template instead that it wants it. This way, the error raised when the chat template does not exist can be skipped.

@CISC
Copy link
Collaborator

CISC commented Aug 19, 2025

..and the Mistral chat format is at what version now? :)

To give you some context here, versioning for our tokenizers is not due to errors of previous formats but how we handle tokenization throughout releases. The process is not the same for every models (whether to use system prompts or not for example). This is why depending on the model we instantiate v1, v3, v7, v11, v13 ! This does not mean v1 was fixed with v2 it's just that our models handle chat completion requests differently.

Sorry, I was just being snarky. :)

The thing is though that the constant revisioning of your chat format (quite often by just changing pre/post-pending spaces/newlines) is the main reason community chat templates have required so many fixes, hence the snark.

@patrickvonplaten
Copy link

patrickvonplaten commented Aug 19, 2025

This PR does not give anybody more freedom or flexibility. What this PR does is remove the community template from the GGUF file, unless an obscure flag that nobody is going to know about is used. By doing so, you are leaving users without an option to use the community or the llama.cpp template. Users always have the option of submitting and receiving tokens for inference with llama.cpp, without involving the tokenizer or the chat formatter, regardless of if the GGUF file contains a chat template or not. Likewise, users have the option of using OpenAI Harmony if they wish, but they also have the built-in chat templates if they do not want to deal with this.

At the moment, you cannot create a GGUF file without a chat template - even if you want to convert it from a mistral-common tokenizer or maybe I'm misunderstanding something?

Correctness and ease of supporting new models
Correctness and ease of implementation cannot come at the expense of usability. I think there is a fundamental misunderstanding about the way people use llama.cpp and the users that it serves. For most of our users, telling them that they need to create some python program to be able to use llama.cpp with this model, is effectively the same than telling them that they cannot use llama.cpp with this model. I don't think this is beneficial to either the llama.cpp community or Mistral.

ok I understand. That's a good point and if that's the vision of llama.cpp then that makes sense to me. mistral_common + llama.cpp server is indeed less user-friendly. Just note that this is not necessarily beneficial for us if it comes at the expense of correctness. We value correctness clearly more than user-friendliness.

=> So if you feel very strongly about using chat templates, then let's maybe just revert the default and add a disable_mistral_community_chat_template instead of use_mistral_community_chat_template so that we can still release gguf checkpoints without chat templates?

The thing is though that the constant revisioning of your chat format (quite often by just changing pre/post-pending spaces/newlines) is the main reason community chat templates have required so many fixes, hence the snark.

Hmm ok not sure what point you're trying to make here 😅
From our point of view we always work with pydantic classes (like OAI chat format) that map directly to token IDS (list[int]). We don't think about whitespaces / newlines etc.. - this is an idiosyncrasy of how our models have been translated by the community to chat templates that only work in string format. We're sorry if this was confusing for people, but it might also say something about chat templates in string format maybe not being the right representation of mistral tokenizers?

When we go from v1 -> v2, it's also not really a bug fix it's just a new, different capability is added. Also just a "v13" mistral tokenizer type doesn't fully capture the full capabilities of the tokenizer. You can have a "v13" tokenizer with different configurations: https://github.com/mistralai/mistral-common/blob/38ab6d4b15e67a8d284c1d98253f989e52819320/src/mistral_common/tokens/tokenizers/tekken.py#L78 for image, fim, audio, ....

But anyways, probably not necessary to fight here. If you don't want it then that's it - you're the maintainers in the end. If we could have a disable_mistral_community_chat_template: bool = False cli arg that we can set to True then I believe everyone in the community can by default use chat templates and we can also release a reference GGUF without chat template? Would that be ok?

@slaren
Copy link
Member

slaren commented Aug 19, 2025

If it helps, we could also change the behavior by requiring the user to explicitly declare it does not want a chat template instead that it wants it. This way, the error raised when the chat template does not exist can be skipped.

=> So if you feel very strongly about using chat templates, then let's maybe just revert the default and add a disable_mistral_community_chat_template instead of use_mistral_community_chat_template so that we can still release gguf checkpoints without chat templates?

For my part, an opt-in option to skip the chat template would be acceptable. It doesn't even have to be specific to mistral, it can be a generic option to disable exporting of the chat template, although I understand that may require more changes, and it is not strictly necessary.

@CISC
Copy link
Collaborator

CISC commented Aug 19, 2025

At the moment, you cannot create a GGUF file without a chat template - even if you want to convert it from a mistral-common tokenizer or maybe I'm misunderstanding something?

It doesn't have to be a jinja chat template. The reason we force something to be defined with the --mistral-format option is because the built-in chat format detection depends on there being something in the chat template metadata to detect what format to use when not using --jinja.

The thing is though that the constant revisioning of your chat format (quite often by just changing pre/post-pending spaces/newlines) is the main reason community chat templates have required so many fixes, hence the snark.

Hmm ok not sure what point you're trying to make here 😅

Just that the "problem" with jinja chat templates is not as systemic as you may think. I realize that the variations mainly comes down to tokenization differences, but you sort of set yourself up for the current situation by not addressing this early on, leaving the community to pick up the slack.

But anyways, probably not necessary to fight here.

Let's call it friendly banter. :)

If you don't want it then that's it - you're the maintainers in the end. If we could have a disable_mistral_community_chat_template: bool = False cli arg that we can set to True then I believe everyone in the community can by default use chat templates and we can also release a reference GGUF without chat template? Would that be ok?

I'm fine with that, will at least prevent broken drive-by GGUFs.

@juliendenize juliendenize requested a review from slaren August 19, 2025 15:33
@juliendenize
Copy link
Contributor Author

I updated the PR accordingly to keep chat templates by default but allow disabling them with an arg.

@ggerganov ggerganov requested a review from CISC August 21, 2025 07:55
@CISC CISC merged commit b2caf67 into ggml-org:master Aug 21, 2025
6 checks passed
qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 22, 2025
ggml-org#15420)

* Make Mistral community chat templates optional

* Change the flag arg to disable instead of enable community chat templates

* Improve error message

* Improve help message

* Tone down the logger messages
@TheLocalDrummer
Copy link

Traceback (most recent call last):
  File "/workspace/./llama.cpp/convert_hf_to_gguf.py", line 32, in <module>
    from mistral_common.tokens.tokenizers.base import TokenizerVersion
ModuleNotFoundError: No module named 'mistral_common'

Thanks.

@CISC
Copy link
Collaborator

CISC commented Aug 23, 2025

Traceback (most recent call last):
  File "/workspace/./llama.cpp/convert_hf_to_gguf.py", line 32, in <module>
    from mistral_common.tokens.tokenizers.base import TokenizerVersion
ModuleNotFoundError: No module named 'mistral_common'

Thanks.

Unrelated, the dependencies changed (for simplicity) in #14737, you have to update your venv from requirements/requirements-convert_hf_to_gguf.txt.

@TheLocalDrummer
Copy link

Traceback (most recent call last):
  File "/workspace/./llama.cpp/convert_hf_to_gguf.py", line 32, in <module>
    from mistral_common.tokens.tokenizers.base import TokenizerVersion
ModuleNotFoundError: No module named 'mistral_common'

Thanks.

Unrelated, the dependencies changed (for simplicity) in #14737, you have to update your venv from requirements/requirements-convert_hf_to_gguf.txt.

I understand, but I never had to pip install anything after git clone until now.

@CISC
Copy link
Collaborator

CISC commented Aug 23, 2025

I understand, but I never had to pip install anything after git clone until now.

Just lucky then, convert_hf_to_gguf.py does have dependencies. :)

@TheLocalDrummer
Copy link

I understand, but I never had to pip install anything after git clone until now.

Just lucky then, convert_hf_to_gguf.py does have dependencies. :)

Guess Mistral wanted to spoil the fun. I spin up runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04 all the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants