test:Addition of data preprocessing dry run script #459

Abhishek-TAMU · 2025-02-08T00:33:10Z

Description of the change

Addition of data preprocessing dry run script. User can run our script via command python dry_run_data_processor.py and pass arguments as command line args to our script along side data_config JSON/YAML file.

Related issue number

Issue: https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1578

How to verify the PR

Add arguments to python dry_run_data_processor.py including data_config_path to test.

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Abhishek <[email protected]>

github-actions · 2025-02-08T00:33:23Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

github-actions · 2025-02-08T00:33:25Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Signed-off-by: Abhishek <[email protected]>

dushyantbehl · 2025-02-08T10:07:40Z

scripts/dry_run_data_processor.py

+    # Save train dataset
+    if save_train_dataset:
+        logger.info("Saving processed train dataset to %s", save_train_dataset)
+        formatted_train_dataset.to_json(save_train_dataset)


Can we just dump the datset to a json should we not allow users to pass in additonal arguments like number of shards etc in case of very large datsets where this feature might be more used.

Sure. Currently I saved entire dataset in single JSON as I assumed the usage to be small scale by user to just analyze the small processed data chunk in dry run (as in how preprocessing of data is done) and then use large dataset for actual processing and training. But if the use case involves processing of large datasets then yea it could be a good idea to do that.

We could do something like taking --num_dataset_shards from user and save dataset in multiple JSON where name of those files are derived from --save-train-dataset string passed by user ? Does this sound relevant ?

yes...Need not be a JSON to be honest but please see if we can save using HF APIs and not write anything on our end to handle this...I think the confusion is arising because we called this dry run initially while the actual use case has moved to a more of only data preprocessing

Yea sure, have used dataset.shard() as you have already written the code. Pushed the changes.

Signed-off-by: Abhishek <[email protected]>

Add script for data preprocessing dry run

a382250

Signed-off-by: Abhishek <[email protected]>

github-actions bot added the test label Feb 8, 2025

fix lint/fmt

b0c08e3

Signed-off-by: Abhishek <[email protected]>

dushyantbehl reviewed Feb 8, 2025

View reviewed changes

Abhishek-TAMU added 4 commits February 10, 2025 19:38

Merge remote-tracking branch 'upstream/main' into dry_run_script

b3ed4ca

script changes

90b2e20

Signed-off-by: Abhishek <[email protected]>

Merge remote-tracking branch 'upstream/main' into dry_run_script

065762e

rename script

ca11ff8

Signed-off-by: Abhishek <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test:Addition of data preprocessing dry run script #459

test:Addition of data preprocessing dry run script #459

Abhishek-TAMU commented Feb 8, 2025 •

edited

Loading

github-actions bot commented Feb 8, 2025

github-actions bot commented Feb 8, 2025

dushyantbehl Feb 8, 2025

Abhishek-TAMU Feb 8, 2025

dushyantbehl Feb 10, 2025

Abhishek-TAMU Feb 11, 2025

test:Addition of data preprocessing dry run script #459

Are you sure you want to change the base?

test:Addition of data preprocessing dry run script #459

Conversation

Abhishek-TAMU commented Feb 8, 2025 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Feb 8, 2025

github-actions bot commented Feb 8, 2025

dushyantbehl Feb 8, 2025

Choose a reason for hiding this comment

Abhishek-TAMU Feb 8, 2025

Choose a reason for hiding this comment

dushyantbehl Feb 10, 2025

Choose a reason for hiding this comment

Abhishek-TAMU Feb 11, 2025

Choose a reason for hiding this comment

Abhishek-TAMU commented Feb 8, 2025 •

edited

Loading