feat: output-format arg for synthetic data generation script and improved support for Mistral models in the Pytorch workflow #3419

hypdeb · 2025-04-09T12:51:49Z

Added an output-format argument to prepare_dataset.py allowing the generation of datasets compatible with tllm-bench without having to go through stdout
Added explicit support for Mistral models in the Pytorch modelling code, as their configurations are slightly different from the Llama models (even though the architecture is mostly similar)
Added an example script on how to quantize a HuggingFace checkpoint to a HuggingFace checkpoint for usage in the Pytorch workflow

Signed-off-by: jdebache <[email protected]>

…ystem cli in container Signed-off-by: jdebache <[email protected]>

Signed-off-by: jdebache <[email protected]>

Signed-off-by: Julien Debache <[email protected]>

Signed-off-by: jdebache <[email protected]>

hypdeb · 2025-04-24T14:05:05Z

/bot run

hypdeb · 2025-04-24T14:07:27Z

@kaiyux @FrankD412 I would like to reopen the discussion on the output format of the synthetic dataset generation scripts.

Personally, I think it makes sense for the script to be able to generate different output formats. This set of changes should add support for the tllm-bench format without breaking the current behaviour as the default value for the new output-format argument is gptManagerBenchmark, which is what the script is generating currently.

hypdeb · 2025-04-24T14:10:40Z

That being said, of course, it would be preferrable if, in the future, tools that ingest datasets could conform to an existing format.

tensorrt-cicd · 2025-04-24T14:11:11Z

PR_Github #3302 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-24T14:22:15Z

PR_Github #3302 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #2300 completed with status: 'FAILURE'

FrankD412 · 2025-04-24T14:40:01Z

That being said, of course, it would be preferrable if, in the future, tools that ingest datasets could conform to an existing format.

I agree, the whole reason I introduced the second format was because gptManagerBenchmark was limited to input ids because it was in C++ (and didn't have access to the transformers library in Python). When I started trtllm-bench, an initial goal was to allow it to take either a prompt or the input ids and be able to take in requests from stdin. The existing JSON didn't lend itself to the latter, but the latter ended up becoming less important (though it might be coming back up again).

Also, maybe this is personal preference, but this PR now addresses two unrelated things which makes it harder to roll back or track changes in trtllm-bench. Mind if we split these sorts of things in the future?

hypdeb · 2025-04-24T14:46:29Z

I will split the Mistral stuff out.

hypdeb · 2025-04-24T15:50:04Z

Here is the Mistral changes: #3843. I'll rebase this one on top of it once merged, which should leave only the dataset changes.

hypdeb · 2025-04-25T12:23:24Z

Moved the dataset generation stuff to #3866, and expanded it. Closing this one as all changes are now represented on other branches.

hypdeb and others added 6 commits April 1, 2025 13:35

add output_format argument

d27c872

Signed-off-by: jdebache <[email protected]>

adding quantization script

bd102e2

Signed-off-by: jdebache <[email protected]>

Merge branch 'NVIDIA:main' into prepare_dataset_output_format

b9b69f2

adding Mistral modeling

4a2a6c2

Signed-off-by: jdebache <[email protected]>

update ucxx in an attempt to fix segfault and install latest nsight s…

45055f8

…ystem cli in container Signed-off-by: jdebache <[email protected]>

Merge branch 'NVIDIA:main' into prepare_dataset_output_format

1a19e45

hypdeb self-assigned this Apr 9, 2025

hypdeb and others added 7 commits April 9, 2025 21:10

Merge branch 'NVIDIA:main' into prepare_dataset_output_format

be15e22

Merge branch 'NVIDIA:main' into prepare_dataset_output_format

1e5bedd

wip

28612dc

Signed-off-by: jdebache <[email protected]>

adjusted quantization script

40ded25

Signed-off-by: jdebache <[email protected]>

Merge branch 'main' into prepare_dataset_output_format

521576b

Signed-off-by: Julien Debache <[email protected]>

fixes after merge conflicts

5fa43b4

Signed-off-by: jdebache <[email protected]>

revert some unneeded changes

953ae0e

Signed-off-by: jdebache <[email protected]>

hypdeb changed the title ~~DRAFT: gathering changes to allow for smoother benchmarking in some scenarios~~ feat: output-format arg for synthetic data generation script and improved support for Mistral models in the Pytorch workflow Apr 24, 2025

hypdeb closed this Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: output-format arg for synthetic data generation script and improved support for Mistral models in the Pytorch workflow #3419

feat: output-format arg for synthetic data generation script and improved support for Mistral models in the Pytorch workflow #3419

Uh oh!

hypdeb commented Apr 9, 2025 •

edited

Loading

Uh oh!

hypdeb commented Apr 24, 2025

Uh oh!

hypdeb commented Apr 24, 2025

Uh oh!

hypdeb commented Apr 24, 2025

Uh oh!

tensorrt-cicd commented Apr 24, 2025

Uh oh!

tensorrt-cicd commented Apr 24, 2025

Uh oh!

FrankD412 commented Apr 24, 2025 •

edited

Loading

Uh oh!

hypdeb commented Apr 24, 2025

Uh oh!

hypdeb commented Apr 24, 2025

Uh oh!

hypdeb commented Apr 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: output-format arg for synthetic data generation script and improved support for Mistral models in the Pytorch workflow #3419

feat: output-format arg for synthetic data generation script and improved support for Mistral models in the Pytorch workflow #3419

Uh oh!

Conversation

hypdeb commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hypdeb commented Apr 24, 2025

Uh oh!

hypdeb commented Apr 24, 2025

Uh oh!

hypdeb commented Apr 24, 2025

Uh oh!

tensorrt-cicd commented Apr 24, 2025

Uh oh!

tensorrt-cicd commented Apr 24, 2025

Uh oh!

FrankD412 commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hypdeb commented Apr 24, 2025

Uh oh!

hypdeb commented Apr 24, 2025

Uh oh!

hypdeb commented Apr 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hypdeb commented Apr 9, 2025 •

edited

Loading

FrankD412 commented Apr 24, 2025 •

edited

Loading