How does `param_groups_weight_decay` interact with AdamW? #1434

vedantroy · 2022-08-25T01:42:12Z

vedantroy
Aug 25, 2022

I'm a bit confused about the following snippet from: https://github.com/facebookresearch/mae

Specifically:
https://github.com/facebookresearch/mae/blob/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py#L179

    param_groups = optim_factory.add_weight_decay(model_without_ddp, args.weight_decay)
    optimizer = torch.optim.AdamW(param_groups, lr=args.lr, betas=(0.9, 0.95))

By default, AdamW will specify weight_decay=1e-2. How does this interact with timm's optim factory?

hankyul2 · 2022-08-28T07:30:23Z

hankyul2
Aug 28, 2022

Hi @vedantroy

By using add_weight_decay(), nn.linear.bias, nn.LayerNorm.weight and nn.LayerNorm.bias will have weight_decay=0 and other parameters such as nn.Linear.weight will have weight_decay=args.weight_decay. And default weight decay value (1e-2) won't be applied to model because it's already been set to 0 or args.weight_decay in add_weight_decay() method.

add_weight_decay(): return 2 parameter groups: no_decay group, weight_decay group. (see below code snippets, got this from here)

def add_weight_decay(model, weight_decay=1e-5, skip_list=()):
    decay = []
    no_decay = []
    for name, param in model.named_parameters():
        if not param.requires_grad:
            continue  # frozen weights
        if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list:
            no_decay.append(param)
        else:
            decay.append(param)
    return [
        {'params': no_decay, 'weight_decay': 0.},
        {'params': decay, 'weight_decay': weight_decay}]

Thank you

hankyul

0 replies

rwightman · 2022-08-29T16:22:03Z

rwightman
Aug 29, 2022
Maintainer

@vedantroy I feel @hankyul2 should ans your question, the optim factory specifically does exactly that by default (splits params into two groups, those with and without weight decay based on the shape of the param, or it existing in the dict returned by no_weightg_decay method, as many vit derivatives had 2D parameters that should not be decayed

    @torch.jit.ignore
    def no_weight_decay(self):
        return {k for k, _ in self.named_parameters()
                if any(n in k for n in ["pos_embed", "rel_pos_h", "rel_pos_w", "cls_token"])}

Also, 0.6.x (recent master) has additional functionality for fine-tune, namely layerwise weight decay.
https://github.com/rwightman/pytorch-image-models/blob/e6a43613063670439b01d5a6300f799c5da9261c/timm/optim/optim_factory.py#L90-L148

This uses per-model metadata to logically group parameters (roughly by stem / block / head) and apply a exp decay (via a lr_scale in param groups) to those groupings as they move away from the head... it is used in MAE / BEiT and related fine-tuning to weight the application of the LR towards the head.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How does `param_groups_weight_decay` interact with AdamW? #1434

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

How does param_groups_weight_decay interact with AdamW? #1434

Uh oh!

vedantroy Aug 25, 2022

Replies: 2 comments

Uh oh!

Uh oh!

hankyul2 Aug 28, 2022

Uh oh!

Uh oh!

rwightman Aug 29, 2022 Maintainer

How does `param_groups_weight_decay` interact with AdamW? #1434

vedantroy
Aug 25, 2022

hankyul2
Aug 28, 2022

rwightman
Aug 29, 2022
Maintainer