Description
I've written a custom dataset with the tfds cli (a GeneratorBasedBuilder without Beam). Overall the dataset is ~60 GB and is sourced from manually downloaded hdf5 files with mostly float32s inside.
I'm encountering an issue where when iterating through the dataset it consumes a huge amount of memory; much more than I'm thinking it should. Seemingly as it's iterating the dataset, TensorFlow is attempting to cache or is losing track of memory. Specifically, when iterating over 20% (12 GB) of the dataset the memory usage tops out at around 17 GB. After the first epoch, it slows down its growth dramatically
What I'm wondering
I am wondering in what ways does tfds apply a cache when building? In addition, are there any configurations (when building or loading), that I might be able to try to limit the memory impact of my dataset?
On the tfds side of things I have already tried setting the following read configurations
tfds.ReadConfig(try_autocache=False, skip_prefetch=True)
However, this seemingly only affected the speed of iterating through the dataset and not the amount of memory used as I would expect.
I've been trying to read through the documentation of both tfds.ReadConfig and tfds.load but haven't really seen anything other than these two options.
In addition, I've profiled my heap using tcmalloc and have found that the allocations are coming from reading in the data. Most of those allocations are sitting in memory and not being used at any specific time.
Environment information
- Operating System: WSL 2.0 with Ubuntu 20.04
- Python version: 3.8.10
- tfds-nightly version: 4.6.0.dev202207180044
- tf-nightly version: 2.11.0.dev20220805