Skip to content

Conversation

@hypdeb
Copy link
Collaborator

@hypdeb hypdeb commented Apr 9, 2025

  • Added an output-format argument to prepare_dataset.py allowing the generation of datasets compatible with tllm-bench without having to go through stdout
  • Added explicit support for Mistral models in the Pytorch modelling code, as their configurations are slightly different from the Llama models (even though the architecture is mostly similar)
  • Added an example script on how to quantize a HuggingFace checkpoint to a HuggingFace checkpoint for usage in the Pytorch workflow

@hypdeb hypdeb self-assigned this Apr 9, 2025
@hypdeb
Copy link
Collaborator Author

hypdeb commented Apr 24, 2025

/bot run

@hypdeb
Copy link
Collaborator Author

hypdeb commented Apr 24, 2025

@kaiyux @FrankD412 I would like to reopen the discussion on the output format of the synthetic dataset generation scripts.

Personally, I think it makes sense for the script to be able to generate different output formats. This set of changes should add support for the tllm-bench format without breaking the current behaviour as the default value for the new output-format argument is gptManagerBenchmark, which is what the script is generating currently.

@hypdeb hypdeb changed the title DRAFT: gathering changes to allow for smoother benchmarking in some scenarios feat: output-format arg for synthetic data generation script and improved support for Mistral models in the Pytorch workflow Apr 24, 2025
@hypdeb
Copy link
Collaborator Author

hypdeb commented Apr 24, 2025

That being said, of course, it would be preferrable if, in the future, tools that ingest datasets could conform to an existing format.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3302 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3302 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #2300 completed with status: 'FAILURE'

@FrankD412
Copy link
Collaborator

FrankD412 commented Apr 24, 2025

That being said, of course, it would be preferrable if, in the future, tools that ingest datasets could conform to an existing format.

I agree, the whole reason I introduced the second format was because gptManagerBenchmark was limited to input ids because it was in C++ (and didn't have access to the transformers library in Python). When I started trtllm-bench, an initial goal was to allow it to take either a prompt or the input ids and be able to take in requests from stdin. The existing JSON didn't lend itself to the latter, but the latter ended up becoming less important (though it might be coming back up again).

Also, maybe this is personal preference, but this PR now addresses two unrelated things which makes it harder to roll back or track changes in trtllm-bench. Mind if we split these sorts of things in the future?

@hypdeb
Copy link
Collaborator Author

hypdeb commented Apr 24, 2025

I will split the Mistral stuff out.

@hypdeb
Copy link
Collaborator Author

hypdeb commented Apr 24, 2025

Here is the Mistral changes: #3843. I'll rebase this one on top of it once merged, which should leave only the dataset changes.

@hypdeb
Copy link
Collaborator Author

hypdeb commented Apr 25, 2025

Moved the dataset generation stuff to #3866, and expanded it. Closing this one as all changes are now represented on other branches.

@hypdeb hypdeb closed this Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants