Replies: 1 comment
-
@mayukh18 have you tried taking the ema out? and are sure it's not the delayed EMA? Also, are you haven't forced any bfloat16 mode? that doesn't work well.... otherwise the hparams look reasonable |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have been trying out the training for ViT Base on imagenet-1k on a TPU v3-8. Somehow my model's Top-1 accuracy falls down to ~0.1 after 20 epochs and doesn't show any improvement anymore. It reaches a peak of 0.2 to 0.3 in those early epochs and then falls off and stays mostly constant. I am not sure if this is some kind of overfitting and is there something I am doing wrong?
I have closely followed the README in
bits_and_tpu
branch and also tried different variations of the hyperparameters. Below is kind of the median of the hyperparameters I tried.python3 launch_xla.py --num-devices 8 train.py /imagenet/path/ --model vit_base_patch16_224 --opt adamw --opt-eps 1e-6 --clip-grad 1.0 --drop-path 0.2 --mixup 0.8 --cutmix 1.0 --aa rand-m6-n4-mstd1.0-inc1 --weight-decay .08 --model-ema --model-ema-decay 0.999 --sched cosine -j 4 --warmup-lr 1e-6 --warmup-epochs 10 --epochs 100 --lr 5e-4 -b 128
.It did seem that with more
warmup_epochs
the falloff is delayed more. I have tried the 5-20 range.Also I need to mention that I am using persistent disks, don't think that makes a difference though.
Any help is appreciated. Thanks.
Beta Was this translation helpful? Give feedback.
All reactions