-
Notifications
You must be signed in to change notification settings - Fork 2k
feat: output-format arg for synthetic data generation script and improved support for Mistral models in the Pytorch workflow #3419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: jdebache <[email protected]>
Signed-off-by: jdebache <[email protected]>
Signed-off-by: jdebache <[email protected]>
…ystem cli in container Signed-off-by: jdebache <[email protected]>
Signed-off-by: jdebache <[email protected]>
Signed-off-by: jdebache <[email protected]>
Signed-off-by: Julien Debache <[email protected]>
Signed-off-by: jdebache <[email protected]>
Signed-off-by: jdebache <[email protected]>
|
/bot run |
|
@kaiyux @FrankD412 I would like to reopen the discussion on the output format of the synthetic dataset generation scripts. Personally, I think it makes sense for the script to be able to generate different output formats. This set of changes should add support for the tllm-bench format without breaking the current behaviour as the default value for the new |
|
That being said, of course, it would be preferrable if, in the future, tools that ingest datasets could conform to an existing format. |
|
PR_Github #3302 [ run ] triggered by Bot |
|
PR_Github #3302 [ run ] completed with state |
I agree, the whole reason I introduced the second format was because Also, maybe this is personal preference, but this PR now addresses two unrelated things which makes it harder to roll back or track changes in |
|
I will split the Mistral stuff out. |
|
Here is the Mistral changes: #3843. I'll rebase this one on top of it once merged, which should leave only the dataset changes. |
|
Moved the dataset generation stuff to #3866, and expanded it. Closing this one as all changes are now represented on other branches. |
output-formatargument toprepare_dataset.pyallowing the generation of datasets compatible withtllm-benchwithout having to go throughstdout