-
Hi! I have several questions about how to use google cloud storage with timm bits_and_tpu branch:
Thanks a lot! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Another problem is that I find PyTorch IterableDataset doesn't work with any sampler, which is ok for most scenarios as parser_tfds itself works as an implicit distributed sampler. But if I want to use any other sampler, say, RASampler (repeated augmentaion) used in DeiT training, the program just throws an error. The PyTorch doc says
I wonder if there is any work around for RASampler with GCS? |
Beta Was this translation helpful? Give feedback.
-
@zeyuwang615 you can specify a different batch size for validation, and the bits_and_tpu train script will automatically lower the number of workers for the validation dataset to reduce number of issues with higher parallelism for small val set. You should really only run into problems if you have a really small validation set. You cannot use samplers with Iterable datasets, that's a limitation of the approach, each dataloader worker is completely independant for iterable datasets. So yes, repeat aug doesn't work. I tried doing repeat aug at a local (per worker) level but it didn't have the same impact (repeating within the same batch (each worker generates it's own batches) vs repeating across baches on the different distributed processes... Your crash is likely due to not enough system memory (or possibly you are running in a docker container without enabling shmem flag?). You can try using tcmalloc via LD_PRELOAD if it's the overall system memory, or you can reduce the size of the buffer/shuffle constants a bit , TFDS dataset is very memory hungry, I always use LD_PRELOAD. |
Beta Was this translation helpful? Give feedback.
-
Well, I am quite confused how could you access to the tfrecords data on the Google Storage. The dataset of bits_and_tpu is comlex to me. |
Beta Was this translation helpful? Give feedback.
@zeyuwang615 you can specify a different batch size for validation, and the bits_and_tpu train script will automatically lower the number of workers for the validation dataset to reduce number of issues with higher parallelism for small val set. You should really only run into problems if you have a really small validation set.
You cannot use samplers with Iterable datasets, that's a limitation of the approach, each dataloader worker is completely independant for iterable datasets. So yes, repeat aug doesn't work. I tried doing repeat aug at a local (per worker) level but it didn't have the same impact (repeating within the same batch (each worker generates it's own batches) vs repeating …