Fine-tuning Flow #873

dalistarh · 2021-09-17T07:57:36Z

dalistarh
Sep 17, 2021

Hi again,

I've been looking on compressing some of the models in the repo; for this, one usually needs the augmentation hyper-parameters used for training or fine-tuning the model (so the input is the same), and the LR sequence (since fine-tuning usually starts at a slightly higher LR).
I often run into the issue that these hyper-parameters are not always available, although @rwightman is always very helpful in providing them to people who ask for them in the issues.

I was wondering if

the default hyper-parameters which are used after loading the model are somehow related to the ones used for obtaining fine-tuned translated weights;
it would be possible to have more consistent reporting of the arguments used for training (like in the case of the MLP models)

Many thanks,
Dan

Answered by rwightman

Sep 17, 2021

There is research to suggest a relationship between (pre)training hyper params, augmentations, etc and how well the weights transfer to various tasks. Usually the fine-tuning aug + reg are held constant in these analysis though. One of the most extensive set of experiments here were for the ViT architectures in How to train your ViT? paper that I was involved with, there is a big spreadsheet with ~50k transfer weights (https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/index.csv) with hparams for each pre-training weight listed, in1k vs in21k for pretraining, and a LR sweep for each transfer weights (but aug was low for transfer and fixed). I'd say it's a fairly…

View full answer

rwightman · 2021-09-17T16:10:55Z

rwightman
Sep 17, 2021
Maintainer

There is research to suggest a relationship between (pre)training hyper params, augmentations, etc and how well the weights transfer to various tasks. Usually the fine-tuning aug + reg are held constant in these analysis though. One of the most extensive set of experiments here were for the ViT architectures in How to train your ViT? paper that I was involved with, there is a big spreadsheet with ~50k transfer weights (https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/index.csv) with hparams for each pre-training weight listed, in1k vs in21k for pretraining, and a LR sweep for each transfer weights (but aug was low for transfer and fixed). I'd say it's a fairly common strategy for fine-tune to keep the aug lower and keep the number of steps/epochs on the low side as well, but you can crank both up.
MLP model hparams are already out there https://gist.github.com/rwightman/d6c264a9001f9167e06c209f630b2cc6 ... for augmentations there is a fairly common theme in all recent timm weights

RandAugment is always used with m/n/mstd roughly in 6-9/2-4/1.0 and, inc1 is always enabled
Rand Erasing is often on (but not in this MLP case) btw 0.3 to .5.
Mixup is usually active w/ alpha 0.2 or 0.5
CutMix sometimes active on with alpha 1.0 (sometimes 0.8)
LabelSmoothing, classifier dropout, stochastic depth (drop_path) usually enabled to varying amounts

I'm not sure when/if I'll end up reporting all pretraining hparams with consistency. It is extra overhead and more importantly it will possibly have a big compat break at some point when I change the config system for the future timm bits code.

There will be a dump of quite a few recent hparam sets and coverage of several different strategies or model specific procedures involving a few different optimizer + lr schedule combos. This will be timed with an upcoming paper.

2 replies

dalistarh Sep 18, 2021
Author

Thanks a lot for the fast response. I'll try these out, and I'll report back if I find anything interesting.

One side question: would the Inception-style preprocessing used by SAM be reproducible in your framework?

rwightman Sep 19, 2021
Maintainer

@dalistarh yes 'inception style preprocessing' is just random resized crop (with the defaults) and hflip w/ a 0.5, 0.5, 0.5 mean/std. So it's pretty much the default base aug for most imagenet training recipes and is the default here, although the mean/std is based on the model default when no arg specified. I believe this was the source of that (https://arxiv.org/abs/1409.4842), hence how it ended up with that name...

It's not clear if color jitter is in the SAM def of inception preprocessing, probably not. color jitter is enabled by default when Rand/AutoAugment isn't used, but you can disable it with a 0. arg.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning Flow #873

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Fine-tuning Flow #873

dalistarh Sep 17, 2021

Replies: 1 comment · 2 replies

rwightman Sep 17, 2021 Maintainer

dalistarh Sep 18, 2021 Author

rwightman Sep 19, 2021 Maintainer

dalistarh
Sep 17, 2021

Replies: 1 comment 2 replies

rwightman
Sep 17, 2021
Maintainer

dalistarh Sep 18, 2021
Author

rwightman Sep 19, 2021
Maintainer