Training transformers on higher resolution images #1550

mosheliv · 2022-11-16T20:14:46Z

mosheliv
Nov 16, 2022

It looks like all transformers are trained around 224 or 384 (and a little 512) square image size.
Unfortunately, I need to trains mine on 1024x512.

So, a few questions:

It seemed like I can fine tune vit on the new resolution if I just specify img_size in the create_model. However, I get a very choppy training that does not go as well as, say, efficientnet. Are there any hyperparams considerations that might effect that? learning rate? I tried several and it didn't seem to do much. Also, it looked like I could train a pretty big batch of vit 1024x512. Is this normal or indication that I did something wrong?
Is training a better behaved transformer like beit or swin from scratch feasible? My dataset is not too similar to imagenet (medical) but from previous experience training definitely benefits from the imagenet wisdom. the dataset is about 70k images for 2 classes.
If it is feasible, any hints about the training? LR, etc?

Best regards,
Moshe