Skip to content

Adding new splits to a dataset script with existing old splits info in metadata's dataset_info fails #5315

@polinaeterna

Description

@polinaeterna

Describe the bug

If you first create a custom dataset with a specific set of splits, generate metadata with datasets-cli test ... --save_info, then change your script to include more splits, it fails.

That's what happened in https://huggingface.co/datasets/mrdbourke/food_vision_199_classes/discussions/2#6385fd1269634850f8ddff48.

Steps to reproduce the bug

  1. create a dataset with a custom split that returns, for example, only "train" split in _splits_generators'. specifically, if really want to reproduce, copy `https://huggingface.co/datasets/mrdbourke/food_vision_199_classes/blob/main/food_vision_199_classes.py
  2. run datasets-cli test dataset_script.py --save_info --all_configs - this would generate metadata yaml in README.md that would contain info about splits, for example, like this:
  splits:
  - name: train
    num_bytes: 2973286
    num_examples: 19747
  1. make changes to your script so that it returns another set of splits, for example, "train" and "test" (uncomment these lines)
  2. run load_dataset and get the following error:
Traceback (most recent call last):
  File "/home/daniel/code/pytorch/env/bin/datasets-cli", line 8, in <module>
    sys.exit(main())
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/commands/datasets_cli.py", line 39, in main
    service.run()
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/commands/test.py", line 141, in run
    builder.download_and_prepare(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 822, in download_and_prepare
    self._download_and_prepare(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 1555, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 913, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 1356, in _prepare_split
    split_info = self.info.splits[split_generator.name]
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/splits.py", line 525, in __getitem__
    instructions = make_file_instructions(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/arrow_reader.py", line 111, in make_file_instructions
    name2filenames = {
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/arrow_reader.py", line 112, in <dictcomp>
    info.name: filenames_for_dataset_split(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/naming.py", line 78, in filenames_for_dataset_split
    prefix = filename_prefix_for_split(dataset_name, split)
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/naming.py", line 57, in filename_prefix_for_split
    if os.path.basename(name) != name:
  File "/home/daniel/code/pytorch/env/lib/python3.8/posixpath.py", line 143, in basename
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType
  1. bonus: try to regenerate metadata in README.md with datasets-cli as in step 2 and get the same error.

This is because dataset.info.splits contains only "train" split so when we are doing self.info.splits[split_generator.name] it tries to infer smth like info.splits['train[50%]'] and that's not the case and it fails.

Expected behavior

to be discussed?

This can be solved by removing splits information from metadata file first. But I wonder if there is a better way.

Environment info

  • Datasets version: 2.7.1
  • Python version: 3.8.13

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions