Efficientnet_b0 overfitting using "known good hparams" #1159

dimitry12 · 2022-02-25T21:46:42Z

dimitry12
Feb 25, 2022

I've been trying to train tf_efficientnetv2_b0 and has been running into the problem of eval (non-EMA) loss going up while training loss is going down.

Decided to try efficientnet_b0 (because it has known good hparams) and observing very similar behavior.

I suspect I am doing something wrong and it's the user (my) error, but I fail to see. This is the first time I am using timm.

Here are my hparams (I'm on this commit):

        python train.py /mnt/ssd_dataset/imagenet/ILSVRC/Data/CLS-LOC \
        --workers 24 \
        --num-classes 1000 \
        --experiment efficientnetv1b0a \
        --start-epoch 0 \
        --log-wandb \
        --log-interval 10 \
        --model efficientnet_b0 \
        --input-size 3 224 224 \
        --batch-size 384 \
        --lr .024 \
        --warmup-lr 0.0032 \
        --warmup-epochs 3 \
        --sched step \
        --epochs 87 \
        --decay-epochs 2.4 \
        --decay-rate .97 \
        --opt rmsproptf \
        --opt-eps .001 \
        --weight-decay 1e-5 \
        --drop 0.2 \
        --drop-path 0.2 \
        --model-ema \
        --model-ema-decay 0.9999 \
        --aa rand-m9-mstd0.5 \
        --mixup 0.0 \
        --remode pixel \
        --reprob 0.2 \
        --native-amp

And here is what I am seeing:

Answered by rwightman

Feb 26, 2022

@dimitry12 I don't see anything obviously wrong with the hparams, there is a significant lag between when EMA results start getting decent (and they can go the wrong direction for a while) ... the non-EMA numbers are more important to look at early on and they don't appear good.

You might want to check your dataset setup.. CLS_LOC looks like it might be the kaggle version? some ImageNet data layouts are a bit odd. timm expects folder per class with order by lexical sort of the nxxxxx wordnet id. And that holds for validation too, often validation is flat so you need to turn it into folders, 1000 folders for both the /train and /val (/validation works too) folder.

View full answer

dimitry12 · 2022-02-25T22:15:12Z

dimitry12
Feb 25, 2022
Author

I wonder if I'm being very impatient:

with tf_efficientnetv2_b0 I've waited for 10 epochs with different hparams - seeing eval (non-EMA) and train losses moving in different directions;
with efficientnet_b0 I'm now past 7 epochs with eval (non-EMA) loss staying up and far from train loss.

Looking at summary of efficientnet_b2 run in #45 (comment) it seems that I should be seeing eval loss and metrics improvement by now.

@michaelklachko can you recall what the training curves looked like when you trained efficientnet_b0?

1 reply

michaelklachko Feb 25, 2022

@dimitry12 I don't remember much about how I trained it, but looking at your plots it does appear somewhat sluggish. Try removing warmup and see how it goes. Losses should go down right away.

rwightman · 2022-02-26T01:25:13Z

rwightman
Feb 26, 2022
Maintainer

@dimitry12 I don't see anything obviously wrong with the hparams, there is a significant lag between when EMA results start getting decent (and they can go the wrong direction for a while) ... the non-EMA numbers are more important to look at early on and they don't appear good.

You might want to check your dataset setup.. CLS_LOC looks like it might be the kaggle version? some ImageNet data layouts are a bit odd. timm expects folder per class with order by lexical sort of the nxxxxx wordnet id. And that holds for validation too, often validation is flat so you need to turn it into folders, 1000 folders for both the /train and /val (/validation works too) folder.

1 reply

dimitry12 Feb 26, 2022
Author

Thank you @rwightman! You are right, I did download ImageNet dataset from Kaggle and yes, /val was flat.

I fixed the dataset using this script and it solved the problem:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Efficientnet_b0 overfitting using "known good hparams" #1159

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Efficientnet_b0 overfitting using "known good hparams" #1159

Uh oh!

Uh oh!

dimitry12 Feb 25, 2022

Replies: 2 comments · 2 replies

Uh oh!

dimitry12 Feb 25, 2022 Author

Uh oh!

michaelklachko Feb 25, 2022

Uh oh!

rwightman Feb 26, 2022 Maintainer

Uh oh!

dimitry12 Feb 26, 2022 Author

dimitry12
Feb 25, 2022

Replies: 2 comments 2 replies

dimitry12
Feb 25, 2022
Author

rwightman
Feb 26, 2022
Maintainer

dimitry12 Feb 26, 2022
Author