-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enable streaming in data preprocessor #437
base: main
Are you sure you want to change the base?
feat: Enable streaming in data preprocessor #437
Conversation
…r future tests, add streaming to config Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Thanks for making a pull request! 😃 |
Signed-off-by: Will Johnson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @willmj for integrating usage of Iterable datasets. Just some initial thoughts.
Shouldn't streaming be a top level object instead of a per dataset object? Is it possible to mix streaming and non-streaming datasets using concat? |
…nstead of dataset config Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
@willmj Is this PR in a usable state? We need to run a EPT with large datasets. Without streaming the data processing is failing. We want the streaming feature to address this issue. |
@willmj I would request your attention to this fms-hf-tuning/tuning/utils/preprocessing_utils.py Lines 49 to 60 in 224f35b
map operations applied), so its good to be defensive on retrieving columns wherever necessary.
|
Signed-off-by: Will Johnson <[email protected]>
tests/test_sft_trainer.py
Outdated
[ | ||
( | ||
[TWITTER_COMPLAINTS_DATA_DIR_JSON], | ||
DATA_CONFIG_TOKENIZE_AND_APPLY_INPUT_MASKING_YAML, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this yaml file is to be used for the this test case: DATA_CONFIG_YAML_STREAMING
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
@seshapad I have now had a successful tuning job with streaming on multi GPU. You should be able to try it out, let me know if you run into any errors. |
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Tuning + inference works! Only 200 steps so the equivalent of less than an epoch, which is why the result is wrong - but format is right.
Inference result on
|
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
@willmj The
I can share the dataset with you if you wish to attempt reproducing this bug.
cc: @ashokponkumar |
… pretokenized case in data collator Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
data = processor.load_dataset( | ||
None, | ||
streaming=processor.processor_config.streaming, | ||
splitName="train[:1]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need to specify streaming if all we do is load just first line of the train
split?
Can you please check what does HF docs say about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For checking the columns, it seems fine in unit tests to not pass streaming - however it does load the example as a Dataset
instead of an IterableDataset
. If this is okay with you we can either pass in streaming
through kwargs
of load_dataset
, default streaming
to false in load_dataset
, or just set it to false
for loading this. Let me know what you think will work best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If for checking columns a single sample can be loaded without streaming you can choose that route and force disable streaming in this call..I would be fine with it
What my question was to ask if a single sample can be loaded in all cases without performance considerations even for large datasets...so I wanted to ask if 1) HF load the only 1 sample from disk? 2) HF loads all samples and then drops all but one
In (2) the performance can take a hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems from HF documentation on slice splits that HF load dataset goes for number 1.
tuning/sft_trainer.py
Outdated
"Setting `split_batches` to true - splitting batches among devices \ | ||
`per_device_train_batch_size` is now the global batch size, and \ | ||
should be treated as such." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I can live with this check for now but feel like we should be more clear in this and must not waste a run where user's run fails and they scramble through the logs to find this..
Is there a suggestion to make this explicit?
Also...please move this inside process_dataargs
itself...we do have train_args
available so can do this inside that function...and do we need to set accelerator_config
to this dict ... do we not need to append it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes per Mehant's suggestion we set accelerator_config to this dict. You bring up a good point that documentation should be added for this PR, I will add it soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added documentation in advanced-data-preprocessing.md
, and made the warning more explicit.
Signed-off-by: Will Johnson <[email protected]>
@seshapad @HarikrishnanBalagopal can we take this branch now and shoot an EPT run which ended in error before? @willmj has made all the code corrrectness changes so request you to do a sanity check before we go for merge. |
@HarikrishnanBalagopal please provide image for this branch. I will start an ept. |
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
…evious PRs Signed-off-by: Will Johnson <[email protected]>
… is merged in Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
According to this comment once trl is upgraded the case highlighted shouldn't be needed so I have removed it, which may cause some tests to fail while trl is not merged, specifically
This means this PR is waiting on upgrading TRL. |
Signed-off-by: Will Johnson <[email protected]>
Removing the following test case:
as it will be resolved by #468 |
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Description of the change
These changes enable streaming and test streaming datasets.
Added:
streaming
as an arg inDataSetConfig
similarly tosampling
IterableDatasets
can't be indexed, use first example where column names are neededmax_steps
instead ofnum_train_epochs
if using streamingRelated issue number
How to verify the PR
streaming
in dataconfig returns and IterableDatasetWas the PR tested