Questions about Google Cloud Storage Usage in bits_and_tpu Branch #1297

zw615 · 2022-06-12T07:43:35Z

zw615
Jun 12, 2022

Hi! I have several questions about how to use google cloud storage with timm bits_and_tpu branch:

In the Gotcha and Known Issues in the README in bits, it says 'For validation especially, getting all the samples evenly divided across BOTH the distributed processes AND the dataset workers is a bit annoying'. My understanding is that for a ImageNet dataset with 1024 train shards and 64 validation shards, both 1024 and 64 have to be evenly devided by the product of the number of processes and the number of workers, right? Say for example, num_workers=8 and num_cores=8 works, but num_workers=6 and num_cores=8 doesn't.
I have managed to train a deit_small model for 5 epochs with data from gcs. But at the beginning of the 6th epoch, the program throw an error
RuntimeError: DataLoader worker (pid 42397) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit. How should I fix this? My guess is that increase memory or reduce batchsize will do the job, but I'm not so sure.
in line 246-249 of parser_tfds.py, 'to prevent excessive drop_last batch behaviour w/ IterableDatasets', ds.repeat() is called on a tensorflow dataset. I learn why Iterable dataset drops more items than expected with drop_last=True from this page. I just wonder why ds.repeat() can prevent it?

Thanks a lot!

Jun 13, 2022

@zeyuwang615 you can specify a different batch size for validation, and the bits_and_tpu train script will automatically lower the number of workers for the validation dataset to reduce number of issues with higher parallelism for small val set. You should really only run into problems if you have a really small validation set.

You cannot use samplers with Iterable datasets, that's a limitation of the approach, each dataloader worker is completely independant for iterable datasets. So yes, repeat aug doesn't work. I tried doing repeat aug at a local (per worker) level but it didn't have the same impact (repeating within the same batch (each worker generates it's own batches) vs repeating …

View full answer

zw615 · 2022-06-13T05:52:30Z

zw615
Jun 13, 2022
Author

Another problem is that I find PyTorch IterableDataset doesn't work with any sampler, which is ok for most scenarios as parser_tfds itself works as an implicit distributed sampler. But if I want to use any other sampler, say, RASampler (repeated augmentaion) used in DeiT training, the program just throws an error. The PyTorch doc says

Neither sampler nor batch_sampler is compatible with iterable-style datasets, since such datasets have no notion of a key or an index.

I wonder if there is any work around for RASampler with GCS?

0 replies

rwightman · 2022-06-13T17:52:23Z

rwightman
Jun 13, 2022
Maintainer

@zeyuwang615 you can specify a different batch size for validation, and the bits_and_tpu train script will automatically lower the number of workers for the validation dataset to reduce number of issues with higher parallelism for small val set. You should really only run into problems if you have a really small validation set.

You cannot use samplers with Iterable datasets, that's a limitation of the approach, each dataloader worker is completely independant for iterable datasets. So yes, repeat aug doesn't work. I tried doing repeat aug at a local (per worker) level but it didn't have the same impact (repeating within the same batch (each worker generates it's own batches) vs repeating across baches on the different distributed processes...

Your crash is likely due to not enough system memory (or possibly you are running in a docker container without enabling shmem flag?). You can try using tcmalloc via LD_PRELOAD if it's the overall system memory, or you can reduce the size of the buffer/shuffle constants a bit , TFDS dataset is very memory hungry, I always use LD_PRELOAD.

0 replies

Dreamer312 · 2022-06-25T14:05:02Z

Dreamer312
Jun 25, 2022

Well, I am quite confused how could you access to the tfrecords data on the Google Storage. The dataset of bits_and_tpu is comlex to me.
I know with tfds.load() can access the data on the GCS, but I cannot find such things in your code. Could please give me some insights about how did you access the data.

2 replies

rwightman Jun 26, 2022
Maintainer

@MrDaVinci you need to get the data in gs buckets first, using the tfds builder, there is a command line utility you can use to set it up but it's a bit annoying with ImageNet because it won't autodownload for you, you have to download from imagenet website, put the tars in a folder, then run the tfds build command making sure where you put the folders is in one of the args, I think it was --manual_dir

https://www.tensorflow.org/datasets/catalog/imagenet2012

there are also some points worth reading on the bits branch https://github.com/rwightman/pytorch-image-models/tree/bits_and_tpu/timm/bits

Dreamer312 Jun 27, 2022

haha, thank you very much. I am just feel sad that I cannot even read and understand your code, which is totally an art for me.
Still, there is a lot of things I can learn from your code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Questions about Google Cloud Storage Usage in bits_and_tpu Branch #1297

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Questions about Google Cloud Storage Usage in bits_and_tpu Branch #1297

Uh oh!

zw615 Jun 12, 2022

Replies: 3 comments · 2 replies

Uh oh!

zw615 Jun 13, 2022 Author

Uh oh!

rwightman Jun 13, 2022 Maintainer

Uh oh!

Dreamer312 Jun 25, 2022

Uh oh!

rwightman Jun 26, 2022 Maintainer

Uh oh!

Dreamer312 Jun 27, 2022

zw615
Jun 12, 2022

Replies: 3 comments 2 replies

zw615
Jun 13, 2022
Author

rwightman
Jun 13, 2022
Maintainer

Dreamer312
Jun 25, 2022

rwightman Jun 26, 2022
Maintainer