-
Hi, I am trying to adapt the main_finetune.py and engine.py in mae codebase. I followed the official pytorch-xla tutorial to make three changes:
Besides, I have also managed to merge the parser-tfds related code to enable reading data from gcs bucket (Thanks, Ross!). I thought this would be easy, as the modification is not so complicated. The problem is, I find as the training goes, my program eats more and more cpu memory, and eventurally leads to a crash. This is easpecially serious for larger models, like ViT-Large. Another problem is that the first few steps take extremely long time to finish in my code. The larger the batchsize, the longer it takes. And only after several epochs, the training time for every epoch shrinks to a stable value. I think this has something to do with the computational graph compilation. Interestingly, I find TIMM doesn't sufffer from these two problems. Both the training time and cpu memory converge to stable values very quickly. I cannot find why from the pytorch-xla documentation or just from reading the code. But I assume someone must have encountered similar issues so that timm can avoid them. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
@zeyuwang615 there is a separate branch for TPU use |
Beta Was this translation helpful? Give feedback.
-
Some more tidbits in the |
Beta Was this translation helpful? Give feedback.
-
Another question I want to ask is what is the correct way to implement divide_iter, namely read a large batch from the dataloader, split it into small batches, and forward them one at a time. The reason why I need to do this is the model I am training is very sensitive to the diversity of the samples used for mixup/cutmix. If the batch size is too small, its performance will degrade. However, if I create a large batch, the total batch size will be too large and I will encounter a new training unstability issue. Maybe advanced optimizer like LARS can fix it, but I want to keep it simple and don't want spend more time finding a new training recipe. So the workaround I come up with is to read a large batch so it is diverse enough, but only use a subset of it to train the model each time. Here are the code I have used to replace these lines for divide_step_idx, (divide_sample, divide_target) in enumerate(
` |
Beta Was this translation helpful? Give feedback.
@zeyuwang615 there is a separate branch for TPU use
bits_and_tpu
with tested modifications .... I use a launch_xla helper for the multiprocess launching. When using TFDS it is extremely useful to use setLD_PRELOAD=/lib/x86_64-linux-gnu/libtcmalloc.so.4
to hook the malloc calls with tcmalloc which improves the memory allocations significantly the TFDS buffering eats up a LOT of memory, keep workers per process at around 6-8 but no more