Skip to content

Custom dataset: unclarities / manual_dir confusion / errors #2894

Open
@wegmatho

Description

@wegmatho

Description of issue

After following the tutorial to create a custom dataset (https://www.tensorflow.org/datasets/add_dataset) I get an error when trying to actually use it. The error dissappeared for no clear reason after I've built it again. Definition and build of dataset worked fine but there are unclarities and I think some clarifications would help.

Dataset

  • custom segmentation dataset
  • consists of two folders "images" and "annotations" and two text files containing image-lists/splits

Problems I've encountered

  1. I was only able to come up with a useful dataset definition after looking at similar definitions, e.g. https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/oxford_iiit_pet.py. It was not clear to me however from the documentation (but maybe I'm just an idiot)
  2. This led to me using the deprecated SplitGenerator, which leads to errors when building. But this was easily converted to new style of declaration. A hint would be nice though.
  3. Part on "manual_dir" needs clarification. It sais to use "dl_manager.manual_dir", but not what to do with it (extract!) and use it. Also a hint towards TFDS CLI argument --manual_dir would be great
  4. Error when trying to use to dataset with tfds.load. Initially I got Error "google.protobuf.json_format.ParseError: Message type "tensorflow_datasets.DatasetInfo" has no field named "moduleName"". It dissapeared after building the dataset again however..

Setup

  • Windows 10
  • Anaconda 3 environment
  • Python 3.6
  • tensorflow-datasets 4.1.0
  • tfds-nightly-4.1.0.dev202012260107

dataset definition

`"""dataset_seg dataset."""
import os
import tensorflow_datasets as tfds
from tensorflow.io.gfile import GFile

_DESCRIPTION = """
Dataset for image segmentation of microscope images

images: JPG, square shaped, zero-padded
annotations: PNG, color-indexed masks, 0=background, 1=foreground
"""

class DatasetSeg(tfds.core.GeneratorBasedBuilder):
"""DatasetBuilder for dataset_seg dataset."""

MANUAL_DOWNLOAD_INSTRUCTIONS = """
    Register into https://example.org/login to get the data. Place the 
    file in the manual_dir/.
    """

VERSION = tfds.core.Version('1.0.0')
RELEASE_NOTES = {
  '1.0.0': 'Initial release.',
}

def _info(self) -> tfds.core.DatasetInfo:
    """Returns the dataset metadata."""
    # Specifies the tfds.core.DatasetInfo object
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        features=tfds.features.FeaturesDict({
            # These are the features of your dataset like images, labels ...
            'image': tfds.features.Image(),
            'label': tfds.features.ClassLabel(names=["foreground"]),
            'file_name': tfds.features.Text(),
            'segmentation_mask': tfds.features.Image(shape=(None, None, 1)),
        }),
        # If there's a common (input, target) tuple from the
        # features, specify them here. They'll be used if
        # as_supervised=True in builder.as_dataset.
        supervised_keys=('image', 'label'),  # Set to None to disable
        homepage='https://dataset-homepage/',
    )

def _split_generators(self, dl_manager: tfds.download.DownloadManager):
    """Returns SplitGenerators."""
    # Downloads the data and defines the splits
    data = dl_manager.manual_dir / "dataset_seg.zip"
    data_extracted = dl_manager.extract(data)
    
    images_path_dir = os.path.join(data_extracted, "images")
    annotations_path_dir = os.path.join(data_extracted, "annotations")

    # Setup train and test splits
    return {
        "train": self._generate_examples(images_path_dir, 
                                         annotations_path_dir,
                                         os.path.join(data_extracted, "train.txt")),
        "val": self._generate_examples(images_path_dir, 
                                       annotations_path_dir,
                                       os.path.join(data_extracted, "val.txt")),
    }

def _generate_examples(self, images_dir_path, annotations_path_dir, images_list_file):
    with GFile(images_list_file, "r") as images_list:
        for image_name_no_suffix in images_list:
            mask_name = image_name_no_suffix.strip() + ".png"
            image_name = image_name_no_suffix.strip() + ".jpg"
            record = {
                "image": os.path.join(images_dir_path, image_name),
                "label": 0, # as of now there is only one label..
                "file_name": image_name,
                "segmentation_mask": os.path.join(annotations_path_dir, mask_name)
            }
            yield image_name, record`

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationPull Request or Issue related with comments or documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions