CPU Memory Leakage on TPU #1466

zw615 · 2022-09-14T00:51:04Z

zw615
Sep 14, 2022

Hi, I am trying to adapt the main_finetune.py and engine.py in mae codebase. I followed the official pytorch-xla tutorial to make three changes:

xmp.spawn()
MpDeviceLoader
xm.optimizer_step(optimizer)

Besides, I have also managed to merge the parser-tfds related code to enable reading data from gcs bucket (Thanks, Ross!).

I thought this would be easy, as the modification is not so complicated. The problem is, I find as the training goes, my program eats more and more cpu memory, and eventurally leads to a crash. This is easpecially serious for larger models, like ViT-Large.

Another problem is that the first few steps take extremely long time to finish in my code. The larger the batchsize, the longer it takes. And only after several epochs, the training time for every epoch shrinks to a stable value. I think this has something to do with the computational graph compilation.

Interestingly, I find TIMM doesn't sufffer from these two problems. Both the training time and cpu memory converge to stable values very quickly. I cannot find why from the pytorch-xla documentation or just from reading the code. But I assume someone must have encountered similar issues so that timm can avoid them.

Answered by rwightman

Sep 14, 2022

@zeyuwang615 there is a separate branch for TPU use bits_and_tpu with tested modifications .... I use a launch_xla helper for the multiprocess launching. When using TFDS it is extremely useful to use set LD_PRELOAD=/lib/x86_64-linux-gnu/libtcmalloc.so.4 to hook the malloc calls with tcmalloc which improves the memory allocations significantly the TFDS buffering eats up a LOT of memory, keep workers per process at around 6-8 but no more

View full answer

rwightman · 2022-09-14T03:47:03Z

rwightman
Sep 14, 2022
Maintainer

@zeyuwang615 there is a separate branch for TPU use bits_and_tpu with tested modifications .... I use a launch_xla helper for the multiprocess launching. When using TFDS it is extremely useful to use set LD_PRELOAD=/lib/x86_64-linux-gnu/libtcmalloc.so.4 to hook the malloc calls with tcmalloc which improves the memory allocations significantly the TFDS buffering eats up a LOT of memory, keep workers per process at around 6-8 but no more

2 replies

zw615 Sep 15, 2022
Author

Yes, I know the bits_and_tpu branch. I have been using an old timm bits_and_tpu codebase for some time (pulled at the end of 2021). So far it behaves good. It doesn't matter if it is a common tpu v3-8 or tpu pods. The only weird thing is that when I tried to use timm to reproduce the mae finetune code, I merged the layer-decay related code (necessary for large models) from the latest timm bits_and_tpu branch, and surprisingly find it maybe the culprit behind. If I include this line, I find the compilation time tripled, and the cpu memory keeps going up as the training goes. On the contrary, if I comment out this line and only leaves the line without lr_scale. These problems don't exist anymore. The first few steps are quick, and the cpu memory consumption is very stable. I wonder if you have encountered similar issues? Thanks!

zw615 Sep 17, 2022
Author

I just confirmed that even with the latest timm code (just pulled from bits_and_tpu branch), this problem still exists. I am opening a new issue. For now this discussion is closed

rwightman · 2022-09-14T03:48:18Z

rwightman
Sep 14, 2022
Maintainer

Some more tidbits in the bits specific README here https://github.com/rwightman/pytorch-image-models/tree/bits_and_tpu/timm/bits and some pointers for using POD with the code #1237

0 replies

zw615 · 2022-09-15T07:12:19Z

zw615
Sep 15, 2022
Author

Another question I want to ask is what is the correct way to implement divide_iter, namely read a large batch from the dataloader, split it into small batches, and forward them one at a time. The reason why I need to do this is the model I am training is very sensitive to the diversity of the samples used for mixup/cutmix. If the batch size is too small, its performance will degrade. However, if I create a large batch, the total batch size will be too large and I will encounter a new training unstability issue. Maybe advanced optimizer like LARS can fix it, but I want to keep it simple and don't want spend more time finding a new training recipe. So the workaround I come up with is to read a large batch so it is diverse enough, but only use a subset of it to train the model each time.

Here are the code I have used to replace these lines
`
batch_size = sample.shape[0]
assert batch_size % divide_iter == 0
divide_batch_size = int(batch_size / divide_iter)
divide_samples = torch.split(sample, divide_batch_size)
divide_targets = torch.split(target, divide_batch_size)

for divide_step_idx, (divide_sample, divide_target) in enumerate(
zip(divide_samples, divide_targets)):
tracker.mark_iter_data_end()
# FIXME move forward + loss into model 'task' wrapper
with dev_env.autocast():
divide_output = state.model(divide_sample)
divide_loss = state.train_loss(divide_output, divide_target)

  state.updater.apply(divide_loss)

  tracker.mark_iter_step_end()

  state.updater.after_step(
      after_train_step,
      state,
      services,
      dev_env,
      step_idx,
      step_end_idx,
      tracker,
      loss_meter,
      (divide_output, divide_target, divide_loss),
      accum_iter,
  )

  tracker.mark_iter()

`
Now I have a new problem. The training speed is halved compared to the normal training code. Do you have any idea how I can fix it? Thanks a lot!

1 reply

zw615 Sep 17, 2022
Author

This one I have solved. It is simply because the batchsize for each core is too small. So that cpu has to wait until tpu finish the last batch, and parallelism power cannot be fully unleashed. The implementation itself is correct. I guess I still have to update the optimizer for using a larger batchsize....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CPU Memory Leakage on TPU #1466

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

CPU Memory Leakage on TPU #1466

Uh oh!

zw615 Sep 14, 2022

Replies: 3 comments · 3 replies

Uh oh!

rwightman Sep 14, 2022 Maintainer

Uh oh!

zw615 Sep 15, 2022 Author

Uh oh!

zw615 Sep 17, 2022 Author

Uh oh!

rwightman Sep 14, 2022 Maintainer

Uh oh!

zw615 Sep 15, 2022 Author

Uh oh!

zw615 Sep 17, 2022 Author

zw615
Sep 14, 2022

Replies: 3 comments 3 replies

rwightman
Sep 14, 2022
Maintainer

zw615 Sep 15, 2022
Author

zw615 Sep 17, 2022
Author

rwightman
Sep 14, 2022
Maintainer

zw615
Sep 15, 2022
Author

zw615 Sep 17, 2022
Author