Skip to content

Problem resuming training with RectifiedAdam+Lookahead (Ranger) #1911

Closed
@gtg740x

Description

@gtg740x

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • TensorFlow version and how it was installed (source or binary): Tensorflow 2.1 from official Tensorflow Docker container
  • TensorFlow-Addons version and how it was installed (source or binary): 0.9.1
  • Python version: 3.6.9
  • Is GPU used? (yes/no): Yes.

Describe the bug

If I train a model using the Ranger scheme of a RectifiedAdam optimizer paired with a LookAhead optimizer, I cannot interrupt and resume training as normal. Using the exact same code, but with a standard Adam optimizer training resumes as expected. When using the Ranger scheme, training does not resume as expected.

When we resume training, the model restores to the same accuracy it paused at. But once training steps resume the accuracy curve will drop for many training steps before slowly moving back to where its upward progression was trending before pausing training. The result is a much slower and choppier convergence than a run where the experiment is never paused.

If the Ranger setup is used for the full run, training progresses smoothly as expected and converges to the expected accuracy in the expected number of steps smoothly.

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

# Setup a model
model = tf.keras.Model(...)

# Setting up the optimizer:
optimizer = tfa.optimizers.RectifiedAdam(lr=learning_rate, total_steps=max_train_steps,
warmup_proportion=0.1, min_lr=min_learning_rate)
optimizer = tfa.optimizers.Lookahead(optimizer, sync_period=6, slow_step_size=0.5)

# Create the checkpoint manager
ckpt = tf.train.Checkpoint(model=model, optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=2)

total_steps = 0
for i in range(total_steps):
    total_steps += 1
    # run a generic train step on the model using the optimizer
    train_step()
    if i % 1000 == 0:
        ckpt_save_path = ckpt_manager.save(total_steps)

Run training and stop at a global_step > 1000 but < max_train_steps

Then when I try to resume training from a saved model:

# Setup a model:
model = tf.keras.Model(...)

# Setup the optimizer:
optimizer = tfa.optimizers.RectifiedAdam(lr=learning_rate, total_steps=max_train_steps,
warmup_proportion=0.1, min_lr=min_learning_rate)
optimizer = tfa.optimizers.Lookahead(optimizer, sync_period=6, slow_step_size=0.5)

# Restore the model:
ckpt = tf.train.Checkpoint(model=model, optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=2)
status = ckpt.restore(ckpt_manager.latest_checkpoint)

And re-enter the training loop above except with total_steps starting at the restored number of steps, the model restores to its previous accuracy. However, as soon as training steps resume there is an immediate dip in accuracy as if the optimizer has to "warm-up" again. Possibly due to the LookAhead slow weights?

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions