Custom dataset: unclarities / manual_dir confusion / errors

**Description of issue**

After following the tutorial to create a custom dataset (https://www.tensorflow.org/datasets/add_dataset) I get an error when trying to actually use it. The error dissappeared for no clear reason after I've built it again. Definition and build of dataset worked fine but there are unclarities and I think some clarifications would help.

**Dataset**

- custom segmentation dataset
- consists of two folders "images" and "annotations" and two text files containing image-lists/splits

**Problems I've encountered**

1. I was only able to come up with a useful dataset definition after looking at similar definitions, e.g. https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/oxford_iiit_pet.py. It was not clear to me however from the documentation (but maybe I'm just an idiot)
2. This led to me using the deprecated SplitGenerator, which leads to errors when building. But this was easily converted to new style of declaration. A hint would be nice though.
3. Part on "manual_dir" needs clarification. It sais to use "dl_manager.manual_dir", but not what to do with it (extract!) and use it. Also a hint towards TFDS CLI argument --manual_dir would be great
4. Error when trying to use to dataset with tfds.load. Initially I got Error "google.protobuf.json_format.ParseError: Message type "tensorflow_datasets.DatasetInfo" has no field named "moduleName"". It dissapeared after building the dataset again however..

**Setup**

- Windows 10
- Anaconda 3 environment
- Python 3.6
- tensorflow-datasets      4.1.0
- tfds-nightly-4.1.0.dev202012260107

**dataset definition**

`"""dataset_seg dataset."""
import os
import tensorflow_datasets as tfds
from tensorflow.io.gfile import GFile

_DESCRIPTION = """
Dataset for image segmentation of microscope images 

images: JPG, square shaped, zero-padded
annotations: PNG, color-indexed masks, 0=background, 1=foreground
"""

class DatasetSeg(tfds.core.GeneratorBasedBuilder):
    """DatasetBuilder for dataset_seg dataset."""
  
    MANUAL_DOWNLOAD_INSTRUCTIONS = """
        Register into https://example.org/login to get the data. Place the 
        file in the manual_dir/.
        """

    VERSION = tfds.core.Version('1.0.0')
    RELEASE_NOTES = {
      '1.0.0': 'Initial release.',
    }

    def _info(self) -> tfds.core.DatasetInfo:
        """Returns the dataset metadata."""
        # Specifies the tfds.core.DatasetInfo object
        return tfds.core.DatasetInfo(
            builder=self,
            description=_DESCRIPTION,
            features=tfds.features.FeaturesDict({
                # These are the features of your dataset like images, labels ...
                'image': tfds.features.Image(),
                'label': tfds.features.ClassLabel(names=["foreground"]),
                'file_name': tfds.features.Text(),
                'segmentation_mask': tfds.features.Image(shape=(None, None, 1)),
            }),
            # If there's a common (input, target) tuple from the
            # features, specify them here. They'll be used if
            # as_supervised=True in builder.as_dataset.
            supervised_keys=('image', 'label'),  # Set to None to disable
            homepage='https://dataset-homepage/',
        )

    def _split_generators(self, dl_manager: tfds.download.DownloadManager):
        """Returns SplitGenerators."""
        # Downloads the data and defines the splits
        data = dl_manager.manual_dir / "dataset_seg.zip"
        data_extracted = dl_manager.extract(data)
        
        images_path_dir = os.path.join(data_extracted, "images")
        annotations_path_dir = os.path.join(data_extracted, "annotations")

        # Setup train and test splits
        return {
            "train": self._generate_examples(images_path_dir, 
                                             annotations_path_dir,
                                             os.path.join(data_extracted, "train.txt")),
            "val": self._generate_examples(images_path_dir, 
                                           annotations_path_dir,
                                           os.path.join(data_extracted, "val.txt")),
        }

    def _generate_examples(self, images_dir_path, annotations_path_dir, images_list_file):
        with GFile(images_list_file, "r") as images_list:
            for image_name_no_suffix in images_list:
                mask_name = image_name_no_suffix.strip() + ".png"
                image_name = image_name_no_suffix.strip() + ".jpg"
                record = {
                    "image": os.path.join(images_dir_path, image_name),
                    "label": 0, # as of now there is only one label..
                    "file_name": image_name,
                    "segmentation_mask": os.path.join(annotations_path_dir, mask_name)
                }
                yield image_name, record`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Custom dataset: unclarities / manual_dir confusion / errors #2894

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Custom dataset: unclarities / manual_dir confusion / errors #2894

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions