For which params do we not want weight decay? #894
Replies: 1 comment 1 reply
-
@alexander-soare I feel it's pretty standard, that condition is actually a bit redundant, way back I started with bias and then realized it was supposed to include other 1d weights (there are references, I don't have them handy). See Flax example as a minimal case: https://github.com/google/flax/blob/main/examples/imagenet/train.py#L122-L126 There may be situations where this is incorrect, but in most situations this is far more correct than the defaults (not doing this) and produces better results in majority of situations. I believe the best approach to improve this would be to make that weight decay block a fn that can be overriden as an arg... I also have a note to do this for learning rate so that both learning rate and wd can be applied per param based on name/shape to get a set of param groups to pass to opt.... |
Beta Was this translation helpful? Give feedback.
-
create_optimizer_v2
has a kwarg:Seems a little intuitive for some of these that we wouldn't want weight decay. But wondering why it's clear that we'd do this for the broad case of all 1d params. And further to that, shouldn't we be careful because this is somewhat dependent on how PT or the author of a custom module decides to represent params?
While I'm at it, any idea why we don't consider bias as 1d here? https://github.com/rwightman/pytorch-image-models/blob/3f9959cdd28cb959980abf81fc4cf34f32e18399/timm/optim/optim_factory.py#L37
Beta Was this translation helpful? Give feedback.
All reactions