Centred RMSProp #51

mcabbott · 2022-02-05T05:41:19Z

But if "Parameters other than learning rate generally don't need tuning", then having to type them out to get to the boolean one seems awkward. Cleaner to call it a new optimiser?

ToucheSir · 2022-02-05T05:46:14Z

I have a bit of a hard time believing the no tuning required part...we could always make a kwarg constructor if it gets annoying.

mcabbott · 2022-02-05T16:09:25Z

Maybe a keyword is best, it's different from the others and 4 positional arguments is a lot. Probably FluxML/Flux.jl#1778 should match this.

Needs a few words before merging.

cossio · 2022-05-11T12:20:52Z

Why centre instead of centered?

FluxML/Flux.jl#1778 is using centered. PyTorch too, https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html.

Besides this point, is there anything else holding this from merge?

mcabbott · 2022-05-11T13:06:11Z

centred would be fine.

All that holds this back is needing a sentence or two saying what this option actually does.

cossio · 2022-05-11T13:18:12Z

centred would be fine.

Honestly centered seems more common, see for instance https://proceedings.neurips.cc/paper/2021/hash/eddea82ad2755b24c4e168c5fc2ebd40-Abstract.html, http://arxiv.org/abs/2010.07468. And also the Flux PR.

All that holds this back is needing a sentence or two saying what this option actually does.

I could write it. I'd probably also change how the rule is implemented, since it should look very similar to AdaBelief. I think it's easier if I make another PR?

mcabbott · 2022-05-11T13:28:06Z

Note the package name, with an "s" --- it does not follow US spellings, although the keyword here accepts both.

Maybe make suggestions if you have ideas for what to change.

mcabbott · 2022-05-11T15:02:57Z

Besides matters of taste on naming things, this scatter of suggestions makes some logic changes. Can you please write these (and these alone) clearly in one place with an explanation of what & why? Before/after. What source is this following? Etc.

Is there a paper with clear formulas?

Make it easy for when someone has bandwidth to look closely.

cossio · 2022-05-11T15:07:36Z

Besides matters of taste on naming things, this scatter of suggestions makes some logic changes. Can you please write these (and these alone) clearly in one place with an explanation of what & why? What source is this following? Etc.

Ok I agree, that's why I suggested a separate PR. The naming is of course not essential, but the logic changes amounts to just that now it's carrying an estimate of the variance of the gradient directly instead of the second moment and then subtracting the squared mean. As far as I understand that's how it's implemented by Jax, https://github.com/deepmind/optax/blob/b4aa6657bbf79985279dea76eaf6d53b25d7e8d9/optax/_src/transform.py#L247.

I can make a separate PR since I think that'll make it easier to compare.

cossio · 2022-05-11T15:30:50Z

Is there a paper we can follow with clear formulas?

I don't think so. Centered RMSprop was introduced by http://arxiv.org/abs/1308.0850 without discussing details and without giving the implementation. There he gives the formulas:

which coincide with the current status of this PR (notation: epsilon = minibatch gradient, n_i = gradient second moment, g_i gradient first moment).

However in Jax they implement it by estimating the variance of the gradient directly.

This is the same AdaBelief does:

So what I was proposing amounts to setting beta1 = 0 in this AdaBelief pseudo-code. You also need to put epsilon inside the square root because the variance can get numerically negative.

Sorry for the noise with all the scattered suggestions.

However I'm not sure if this is actually better in practice.

cossio · 2022-05-11T15:35:20Z

Actually, looking at Optax code more closely, I now think they do this difference of squares thing instead of estimating the variance directly. See https://github.com/deepmind/optax/blob/a124552d0fc9f81812cd82da0d22528b7a17a847/optax/_src/transform.py#L247.

So then that's probably the way to go here too. I have removed my previous suggestions.

src/rules.jl

test/rules.jl

cossio · 2022-05-12T08:42:39Z

I re-added the suggestions for the name (centre -> centred), without the logic changes.

A suggestion for the docstring (I cannot add this as a Github suggestion bc it's not in the PR):

# Parameters
...
- Centred RMSProp (`centred`): if `false` (default), gradients are normalized by an estimation of their second moments; if `true`, normalizes by the gradient variance instead of the second moment (http://arxiv.org/abs/1308.08500).

Perhaps we can merge this?

ToucheSir · 2022-05-23T16:23:27Z

@mcabbott saw you added a couple changes. Is there anything left on the docket or is this good to go?

mcabbott · 2022-05-23T16:36:39Z

Maybe the only Q is whether you give the constructor a verb "centre this" or an adjective "make the centred version".

ToucheSir · 2022-05-23T16:42:52Z

PyTorch and TF both use the adjective form, let's go with that.

mcabbott · 2022-05-23T17:01:57Z

Done. But what is wrong with the tests? (Locally fine now.)

It's getting IRTools v0.3.3, Zygote v0.4.20, maybe because of Compat v4.1.0

mcabbott force-pushed the centred branch from 616cbe3 to 7230840 Compare February 5, 2022 06:08

darsnack approved these changes Feb 5, 2022

View reviewed changes

mcabbott mentioned this pull request Feb 5, 2022

add centered version of RMSProp FluxML/Flux.jl#1778

Open

4 tasks

mcabbott force-pushed the centred branch from 7230840 to fd5ffe6 Compare February 5, 2022 21:00

mcabbott added the enhancement New feature or request label Mar 23, 2022