You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks like all transformers are trained around 224 or 384 (and a little 512) square image size.
Unfortunately, I need to trains mine on 1024x512.
So, a few questions:
It seemed like I can fine tune vit on the new resolution if I just specify img_size in the create_model. However, I get a very choppy training that does not go as well as, say, efficientnet. Are there any hyperparams considerations that might effect that? learning rate? I tried several and it didn't seem to do much. Also, it looked like I could train a pretty big batch of vit 1024x512. Is this normal or indication that I did something wrong?
Is training a better behaved transformer like beit or swin from scratch feasible? My dataset is not too similar to imagenet (medical) but from previous experience training definitely benefits from the imagenet wisdom. the dataset is about 70k images for 2 classes.
If it is feasible, any hints about the training? LR, etc?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
It looks like all transformers are trained around 224 or 384 (and a little 512) square image size.
Unfortunately, I need to trains mine on 1024x512.
So, a few questions:
Best regards,
Moshe
Beta Was this translation helpful? Give feedback.
All reactions