-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
If you first create a custom dataset with a specific set of splits, generate metadata with datasets-cli test ... --save_info, then change your script to include more splits, it fails.
That's what happened in https://huggingface.co/datasets/mrdbourke/food_vision_199_classes/discussions/2#6385fd1269634850f8ddff48.
Steps to reproduce the bug
- create a dataset with a custom split that returns, for example, only
"train"split in_splits_generators'. specifically, if really want to reproduce, copy `https://huggingface.co/datasets/mrdbourke/food_vision_199_classes/blob/main/food_vision_199_classes.py - run
datasets-cli test dataset_script.py --save_info --all_configs- this would generate metadata yaml inREADME.mdthat would contain info about splits, for example, like this:
splits:
- name: train
num_bytes: 2973286
num_examples: 19747
- make changes to your script so that it returns another set of splits, for example,
"train"and"test"(uncomment these lines) - run
load_datasetand get the following error:
Traceback (most recent call last):
File "/home/daniel/code/pytorch/env/bin/datasets-cli", line 8, in <module>
sys.exit(main())
File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/commands/datasets_cli.py", line 39, in main
service.run()
File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/commands/test.py", line 141, in run
builder.download_and_prepare(
File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 822, in download_and_prepare
self._download_and_prepare(
File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 1555, in _download_and_prepare
super()._download_and_prepare(
File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 913, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 1356, in _prepare_split
split_info = self.info.splits[split_generator.name]
File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/splits.py", line 525, in __getitem__
instructions = make_file_instructions(
File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/arrow_reader.py", line 111, in make_file_instructions
name2filenames = {
File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/arrow_reader.py", line 112, in <dictcomp>
info.name: filenames_for_dataset_split(
File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/naming.py", line 78, in filenames_for_dataset_split
prefix = filename_prefix_for_split(dataset_name, split)
File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/naming.py", line 57, in filename_prefix_for_split
if os.path.basename(name) != name:
File "/home/daniel/code/pytorch/env/lib/python3.8/posixpath.py", line 143, in basename
p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType- bonus: try to regenerate metadata in
README.mdwithdatasets-clias in step 2 and get the same error.
This is because dataset.info.splits contains only "train" split so when we are doing self.info.splits[split_generator.name] it tries to infer smth like info.splits['train[50%]'] and that's not the case and it fails.
Expected behavior
to be discussed?
This can be solved by removing splits information from metadata file first. But I wonder if there is a better way.
Environment info
- Datasets version: 2.7.1
- Python version: 3.8.13
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working