Skip to content

feat: added support for Mistral models in Pytorch workflow and HF quantization script #3843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

hypdeb
Copy link
Collaborator

@hypdeb hypdeb commented Apr 24, 2025

No description provided.

@hypdeb hypdeb requested a review from FrankD412 April 24, 2025 15:48
@hypdeb hypdeb self-assigned this Apr 24, 2025
@hypdeb
Copy link
Collaborator Author

hypdeb commented Apr 24, 2025

Hello @litaotju do you know who is the best person to review modelling changes in the Pytorch workflow?


## Quantizing from the HuggingFace format to the HuggingFace format
It is useful to be able to quantize models from HuggingFace, without changing their format, for example if you plan to use them in TensorRT-LLM's Pytorch-based workflow. The `quantize_hf_to_hf.py` script serves that purpose. It is a reduced version of ModelOpt's [example post-training quantization script](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/hf_ptq.py). Please refer to the original for a more up-to-date version. For example, it can be used to quantize a model to `fp8` for tensor-parallelism 4:
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hypdeb , I don't think it's a good idea to copy scripts from ModelOpt to TRT-LLM. Can we add some links to ModelOpt instead? like this one: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch.md#quantization

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that in general copying sources is not a good idea. I did it here as this is an example usage script, so it is basically made to be copied and adjusted by users, or at least that's how I see it. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please create a PR for Mistral models only? We can merge it first. For the HF quantization script, I think it's still under debate about the user workflow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the MR with only the added Mistral modelling code: #3845

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the HF quantization script, I think it's still under debate about the user workflow.

Could you maybe CC me on these discussions / ping me on Slack on this? I am interested as I would like to add a more lightweight and TRTLLM independent flow for quantization to my automation on the long term.

@hypdeb hypdeb closed this Apr 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants