Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pathfinder task cannot converge. #37

Closed
liuyang148 opened this issue Aug 31, 2021 · 12 comments
Closed

Pathfinder task cannot converge. #37

liuyang148 opened this issue Aug 31, 2021 · 12 comments

Comments

@liuyang148
Copy link

liuyang148 commented Aug 31, 2021

I try to run pathfinder32 based on this dataset and run 5 times, 3 out of 5 cannot converge, and the loss keep 0.6933 until the end, but 2 of them can converge normally, and get final acc of 75%(bigbird). It is pretty random. Then I try different models(performer), and it never converge again. But the cifar10 task, which using the same train code with pathfinder32, converge all the times. It's the problem of the dataset?

I0831 15:32:43.578928 140327465420608 train.py:276] eval in step: 16224, loss: 0.6931, acc: 0.5017
I0831 15:33:00.912956 140327465420608 train.py:242] train in step: 16536, loss: 0.6932, acc: 0.5011
I0831 15:33:02.813938 140327465420608 train.py:276] eval in step: 16536, loss: 0.6932, acc: 0.4983
I0831 15:33:21.293757 140327465420608 train.py:242] train in step: 16848, loss: 0.6931, acc: 0.5018
I0831 15:33:23.183998 140327465420608 train.py:276] eval in step: 16848, loss: 0.6932, acc: 0.4983
I0831 15:33:41.210031 140327465420608 train.py:242] train in step: 17160, loss: 0.6932, acc: 0.4997
I0831 15:33:43.294295 140327465420608 train.py:276] eval in step: 17160, loss: 0.6931, acc: 0.4983
@MostafaDehghani
Copy link
Collaborator

@liuyang148 I think by "coverage" you mean "converge" (or please correct me if I'm wrong)?
In that case, I want to say that Pathfinder is a difficult task for transformers (and any other architecture that has no recurrence or an inductive bias for modeling transitivity). So what you're observing is simply the struggle of these models to pick up the task. That's actually one of the main reasons that we included the pathfinder in LRA.

@liuyang148
Copy link
Author

liuyang148 commented Aug 31, 2021

Yes, I mean 'converge', forgive my bad english.
Then, which result did the paper record. Only 'converge' one and ignore 'none-converge' results?

@MostafaDehghani
Copy link
Collaborator

No problem at all!
As far as I remember, you will observe no improvement in the metrics we care about in almost all models after some number of training steps even if the loss is still changing (most of the time fluctuating). So we have chosen to fix the number of epochs to 200. I had runs with 1000 epochs but you don't see significant improvement in the "accuracy".

@liuyang148 liuyang148 changed the title Pathfinder task cannot coverage. Pathfinder task cannot converge. Aug 31, 2021
@liuyang148
Copy link
Author

OK, I got it. Thanks for your help.

@jnhwkim
Copy link
Contributor

jnhwkim commented Sep 2, 2021

@MostafaDehghani I understand the task is difficult to converge and learn. I tried three times with different config.random_seed for Performer, but it keeps failing to converge and test accuracies are around 50%. How can I reproduce the number in the paper, i.e., 77.05 (the best score in Table 1)

@yinzhangyue
Copy link

@jnhwkim I encountered the same situation as you.

@MostafaDehghani
Copy link
Collaborator

@jnhwkim @yinzhangyue
Can you point me to the exact config file you're using in LRA codebase?

@yinzhangyue
Copy link

I don't change the config file, here is the base_pathfinder32_config.py.

# Copyright 2021 Google LLC

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Base Configuration."""

import ml_collections

NUM_EPOCHS = 200
TRAIN_EXAMPLES = 160000
VALID_EXAMPLES = 20000


def get_config():
  """Get the default hyperparameter configuration."""
  config = ml_collections.ConfigDict()
  config.batch_size = 512
  config.eval_frequency = TRAIN_EXAMPLES // config.batch_size
  config.num_train_steps = (TRAIN_EXAMPLES // config.batch_size) * NUM_EPOCHS
  config.num_eval_steps = VALID_EXAMPLES // config.batch_size
  config.weight_decay = 0.
  config.grad_clip_norm = None

  config.save_checkpoints = True
  config.restore_checkpoints = True
  config.checkpoint_freq = (TRAIN_EXAMPLES //
                            config.batch_size) * NUM_EPOCHS // 2
  config.random_seed = 0

  config.learning_rate = .001
  config.factors = 'constant * linear_warmup * cosine_decay'
  config.warmup = (TRAIN_EXAMPLES // config.batch_size) * 1
  config.steps_per_cycle = (TRAIN_EXAMPLES // config.batch_size) * NUM_EPOCHS

  # model params
  config.model = ml_collections.ConfigDict()
  config.model.num_layers = 1
  config.model.num_heads = 2
  config.model.emb_dim = 32
  config.model.dropout_rate = 0.1

  config.model.qkv_dim = config.model.emb_dim // 2
  config.model.mlp_dim = config.model.qkv_dim * 2
  config.model.attention_dropout_rate = 0.1
  config.model.classifier_pool = 'MEAN'
  config.model.learn_pos_emb = False

  config.trial = 0  # dummy for repeated runs.
  return config

@yinzhangyue
Copy link

My Run Script.

PYTHONPATH="$(pwd)":"$PYTHON_PATH" python lra_benchmarks/image/train.py \
      --config=lra_benchmarks/image/configs/pathfinder32/performer_base.py \
      --model_dir=./tmp/pathfinder_F \
      --task_name=pathfinder32_hard

@MostafaDehghani
Copy link
Collaborator

I just checked and seems the configs in the repo is not synced with the internal config that we have for getting the results in the paper. Not sure what went wrong, but sorry for that. I'll work on updating the repo, but in the meantime, here are the configs that you should use in the performer config file to be able to get the reported score:

def get_config():
  """Get the default hyperparameter configuration."""
  config = base_pathfinder32_config.get_config()
  config.model_type = "performer"

  config.model.num_layers = 1
  config.model.num_heads = 8
  config.model.emb_dim = 128
  config.model.dropout_rate = 0.2
  config.model.qkv_dim = 64
  config.model.mlp_dim = 128

  return config

@yinzhangyue
Copy link

Thank you! I will try it immediately.

@yinzhangyue
Copy link

It works! Thank you very much! ^o^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants