Replies: 2 comments
-
Hi @vedantroy By using
def add_weight_decay(model, weight_decay=1e-5, skip_list=()):
decay = []
no_decay = []
for name, param in model.named_parameters():
if not param.requires_grad:
continue # frozen weights
if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list:
no_decay.append(param)
else:
decay.append(param)
return [
{'params': no_decay, 'weight_decay': 0.},
{'params': decay, 'weight_decay': weight_decay}] Thank you hankyul |
Beta Was this translation helpful? Give feedback.
-
@vedantroy I feel @hankyul2 should ans your question, the optim factory specifically does exactly that by default (splits params into two groups, those with and without weight decay based on the shape of the param, or it existing in the dict returned by
Also, 0.6.x (recent master) has additional functionality for fine-tune, namely layerwise weight decay. This uses per-model metadata to logically group parameters (roughly by stem / block / head) and apply a exp decay (via a lr_scale in param groups) to those groupings as they move away from the head... it is used in MAE / BEiT and related fine-tuning to weight the application of the LR towards the head. |
Beta Was this translation helpful? Give feedback.
-
I'm a bit confused about the following snippet from: https://github.com/facebookresearch/mae
Specifically:
https://github.com/facebookresearch/mae/blob/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py#L179
By default,
AdamW
will specifyweight_decay=1e-2
. How does this interact with timm's optim factory?Beta Was this translation helpful? Give feedback.
All reactions