Problem resuming training with RectifiedAdam+Lookahead (Ranger)

**System information**
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
- TensorFlow version and how it was installed (source or binary): Tensorflow 2.1 from official Tensorflow Docker container
- TensorFlow-Addons version and how it was installed (source or binary): 0.9.1
- Python version: 3.6.9
- Is GPU used? (yes/no): Yes.

**Describe the bug**

If I train a model using the Ranger scheme of a RectifiedAdam optimizer paired with a LookAhead optimizer, I cannot interrupt and resume training as normal. Using the exact same code, but with a standard Adam optimizer training resumes as expected. When using the Ranger scheme, training does not resume as expected. 

When we resume training, the model restores to the same accuracy it paused at. But once training steps resume the accuracy curve will drop for many training steps before slowly moving back to where its upward progression was trending before pausing training. The result is a much slower and choppier convergence than a run where the experiment is never paused.

If the Ranger setup is used for the full run, training progresses smoothly as expected and converges to the expected accuracy in the expected number of steps smoothly.

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

```python
# Setup a model
model = tf.keras.Model(...)

# Setting up the optimizer:
optimizer = tfa.optimizers.RectifiedAdam(lr=learning_rate, total_steps=max_train_steps,
warmup_proportion=0.1, min_lr=min_learning_rate)
optimizer = tfa.optimizers.Lookahead(optimizer, sync_period=6, slow_step_size=0.5)

# Create the checkpoint manager
ckpt = tf.train.Checkpoint(model=model, optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=2)

total_steps = 0
for i in range(total_steps):
    total_steps += 1
    # run a generic train step on the model using the optimizer
    train_step()
    if i % 1000 == 0:
        ckpt_save_path = ckpt_manager.save(total_steps)
```

Run training and stop at a global_step > 1000 but < max_train_steps

Then when I try to resume training from a saved model:
```python
# Setup a model:
model = tf.keras.Model(...)

# Setup the optimizer:
optimizer = tfa.optimizers.RectifiedAdam(lr=learning_rate, total_steps=max_train_steps,
warmup_proportion=0.1, min_lr=min_learning_rate)
optimizer = tfa.optimizers.Lookahead(optimizer, sync_period=6, slow_step_size=0.5)

# Restore the model:
ckpt = tf.train.Checkpoint(model=model, optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=2)
status = ckpt.restore(ckpt_manager.latest_checkpoint)
```
And re-enter the training loop above except with total_steps starting at the restored number of steps, the model restores to its previous accuracy. However, as soon as training steps resume there is an immediate dip in accuracy as if the optimizer has to "warm-up" again. Possibly due to the LookAhead slow weights?

**Other info / logs**

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problem resuming training with RectifiedAdam+Lookahead (Ranger) #1911

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problem resuming training with RectifiedAdam+Lookahead (Ranger) #1911

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions