Skip to content

Minimizing memory usage with a large custom dataset (possible memory leak with first epoch) #4072

Open
@Dragonfire3900

Description

@Dragonfire3900

I've written a custom dataset with the tfds cli (a GeneratorBasedBuilder without Beam). Overall the dataset is ~60 GB and is sourced from manually downloaded hdf5 files with mostly float32s inside.

I'm encountering an issue where when iterating through the dataset it consumes a huge amount of memory; much more than I'm thinking it should. Seemingly as it's iterating the dataset, TensorFlow is attempting to cache or is losing track of memory. Specifically, when iterating over 20% (12 GB) of the dataset the memory usage tops out at around 17 GB. After the first epoch, it slows down its growth dramatically

What I'm wondering
I am wondering in what ways does tfds apply a cache when building? In addition, are there any configurations (when building or loading), that I might be able to try to limit the memory impact of my dataset?

On the tfds side of things I have already tried setting the following read configurations
tfds.ReadConfig(try_autocache=False, skip_prefetch=True)
However, this seemingly only affected the speed of iterating through the dataset and not the amount of memory used as I would expect.

I've been trying to read through the documentation of both tfds.ReadConfig and tfds.load but haven't really seen anything other than these two options.

In addition, I've profiled my heap using tcmalloc and have found that the allocations are coming from reading in the data. Most of those allocations are sitting in memory and not being used at any specific time.

Environment information

  • Operating System: WSL 2.0 with Ubuntu 20.04
  • Python version: 3.8.10
  • tfds-nightly version: 4.6.0.dev202207180044
  • tf-nightly version: 2.11.0.dev20220805

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions