Minimizing memory usage with a large custom dataset (possible memory leak with first epoch)

I've written a custom dataset with the tfds cli (a GeneratorBasedBuilder without Beam). Overall the dataset is ~60 GB and is sourced from manually downloaded hdf5 files with mostly float32s inside. 

I'm encountering an issue where when iterating through the dataset it consumes a huge amount of memory; much more than I'm thinking it should. Seemingly as it's iterating the dataset, TensorFlow is attempting to cache or is losing track of memory. Specifically, when iterating over 20% (12 GB) of the dataset the memory usage tops out at around 17 GB. After the first epoch, it slows down its growth dramatically

*What I'm wondering*
I am wondering in what ways does tfds apply a cache when building? In addition, are there any configurations (when building or loading), that I might be able to try to limit the memory impact of my dataset?

On the tfds side of things I have already tried setting the following read configurations
`tfds.ReadConfig(try_autocache=False, skip_prefetch=True)`
However, this seemingly only affected the speed of iterating through the dataset and not the amount of memory used as I would expect.

I've been trying to read through the documentation of both [tfds.ReadConfig](https://www.tensorflow.org/datasets/api_docs/python/tfds/ReadConfig) and [tfds.load](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) but haven't really seen anything other than these two options. 

In addition, I've profiled my heap using tcmalloc and have found that the allocations are coming from reading in the data. Most of those allocations are sitting in memory and not being used at any specific time.

**Environment information**
* Operating System: WSL 2.0 with Ubuntu 20.04
* Python version: 3.8.10
* tfds-nightly version: 4.6.0.dev202207180044
* tf-nightly version: 2.11.0.dev20220805

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minimizing memory usage with a large custom dataset (possible memory leak with first epoch) #4072

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Minimizing memory usage with a large custom dataset (possible memory leak with first epoch) #4072

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions