Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test:Addition of data preprocessing dry run script #459

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Abhishek-TAMU
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU commented Feb 8, 2025

Description of the change

Addition of data preprocessing dry run script. User can run our script via command python dry_run_data_processor.py and pass arguments as command line args to our script along side data_config JSON/YAML file.

Related issue number

Issue: https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1578

How to verify the PR

Add arguments to python dry_run_data_processor.py including data_config_path to test.

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Copy link

github-actions bot commented Feb 8, 2025

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the test label Feb 8, 2025
Copy link

github-actions bot commented Feb 8, 2025

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Signed-off-by: Abhishek <[email protected]>
# Save train dataset
if save_train_dataset:
logger.info("Saving processed train dataset to %s", save_train_dataset)
formatted_train_dataset.to_json(save_train_dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just dump the datset to a json should we not allow users to pass in additonal arguments like number of shards etc in case of very large datsets where this feature might be more used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Currently I saved entire dataset in single JSON as I assumed the usage to be small scale by user to just analyze the small processed data chunk in dry run (as in how preprocessing of data is done) and then use large dataset for actual processing and training. But if the use case involves processing of large datasets then yea it could be a good idea to do that.

We could do something like taking --num_dataset_shards from user and save dataset in multiple JSON where name of those files are derived from --save-train-dataset string passed by user ? Does this sound relevant ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes...Need not be a JSON to be honest but please see if we can save using HF APIs and not write anything on our end to handle this...I think the confusion is arising because we called this dry run initially while the actual use case has moved to a more of only data preprocessing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea sure, have used dataset.shard() as you have already written the code. Pushed the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants