GPU memory usage going up every epoch #1196

AlejandroRigau · 2022-03-28T19:52:47Z

AlejandroRigau
Mar 28, 2022

I am trying to figure out why the GPU memory usage keeps going up after each epoch. What causes this to happen? I was in the understanding that this should remain constant.

These are my hyperparameters:

python scripts/train.py "datasets/data (2)_balanced_train_val_test" \
--workers 6 \
--model efficientnet_b0 \
--batch-size 64 \
--num-classes 2 \
--lr 0.05 \
--opt sgd \
--momentum 0.9 \
--epochs 100 \
--warmup-epochs 3 \
--seed 42 \
--drop 0.0 \
--sched step \
--weight-decay 0.03 \
--clip-grad 1 \
--clip-mode norm \
--train-interpolation bicubic \
--hflip 0 \
--vflip 0 \
--color-jitter 0.0 \
--cutmix 0.0 \
--scale 1 1 \
--ratio 1 1 \
--log-wandb \
--save-images

Answered by rwightman

Mar 30, 2022

@AlejandroRigau with caching allocators (like the pytorch one) that's not a reliable indicator of a leak, memory churn can increase the overall allocated without actually causing any issues (just more cached allocations that may get reused, only recovered if needed).

I've used the scripts enough for training on the order of months sometimes so I'm fairly certain no issues. If you're only using 70-somthing% at the start you should try pushing your batch size up such that you use 90-95%, going all the way to the limit can result in a OOM when you transition between train-eval-train

View full answer

rwightman · 2022-03-30T05:49:21Z

rwightman
Mar 30, 2022
Maintainer

@AlejandroRigau with caching allocators (like the pytorch one) that's not a reliable indicator of a leak, memory churn can increase the overall allocated without actually causing any issues (just more cached allocations that may get reused, only recovered if needed).

I've used the scripts enough for training on the order of months sometimes so I'm fairly certain no issues. If you're only using 70-somthing% at the start you should try pushing your batch size up such that you use 90-95%, going all the way to the limit can result in a OOM when you transition between train-eval-train

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GPU memory usage going up every epoch #1196

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

GPU memory usage going up every epoch #1196

Uh oh!

AlejandroRigau Mar 28, 2022

Replies: 1 comment

Uh oh!

rwightman Mar 30, 2022 Maintainer

AlejandroRigau
Mar 28, 2022

rwightman
Mar 30, 2022
Maintainer