-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test:Addition of data preprocessing dry run script #459
base: main
Are you sure you want to change the base?
test:Addition of data preprocessing dry run script #459
Conversation
Signed-off-by: Abhishek <[email protected]>
Thanks for making a pull request! 😃 |
Thanks for making a pull request! 😃 |
Signed-off-by: Abhishek <[email protected]>
scripts/dry_run_data_processor.py
Outdated
# Save train dataset | ||
if save_train_dataset: | ||
logger.info("Saving processed train dataset to %s", save_train_dataset) | ||
formatted_train_dataset.to_json(save_train_dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just dump the datset to a json
should we not allow users to pass in additonal arguments like number of shards etc in case of very large datsets where this feature might be more used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Currently I saved entire dataset in single JSON as I assumed the usage to be small scale by user to just analyze the small processed data chunk in dry run (as in how preprocessing of data is done) and then use large dataset for actual processing and training. But if the use case involves processing of large datasets then yea it could be a good idea to do that.
We could do something like taking --num_dataset_shards
from user and save dataset in multiple JSON where name of those files are derived from --save-train-dataset
string passed by user ? Does this sound relevant ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes...Need not be a JSON to be honest but please see if we can save using HF APIs and not write anything on our end to handle this...I think the confusion is arising because we called this dry run
initially while the actual use case has moved to a more of only data preprocessing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea sure, have used dataset.shard()
as you have already written the code. Pushed the changes.
Signed-off-by: Abhishek <[email protected]>
Signed-off-by: Abhishek <[email protected]>
Description of the change
Addition of data preprocessing dry run script. User can run our script via command
python dry_run_data_processor.py
and pass arguments as command line args to our script along sidedata_config
JSON/YAML file.Related issue number
Issue: https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1578
How to verify the PR
Add arguments to
python dry_run_data_processor.py
includingdata_config_path
to test.Was the PR tested