Pathfinder task cannot converge. #37

liuyang148 · 2021-08-31T07:40:31Z

I try to run pathfinder32 based on this dataset and run 5 times, 3 out of 5 cannot converge, and the loss keep 0.6933 until the end, but 2 of them can converge normally, and get final acc of 75%(bigbird). It is pretty random. Then I try different models(performer), and it never converge again. But the cifar10 task, which using the same train code with pathfinder32, converge all the times. It's the problem of the dataset?

I0831 15:32:43.578928 140327465420608 train.py:276] eval in step: 16224, loss: 0.6931, acc: 0.5017
I0831 15:33:00.912956 140327465420608 train.py:242] train in step: 16536, loss: 0.6932, acc: 0.5011
I0831 15:33:02.813938 140327465420608 train.py:276] eval in step: 16536, loss: 0.6932, acc: 0.4983
I0831 15:33:21.293757 140327465420608 train.py:242] train in step: 16848, loss: 0.6931, acc: 0.5018
I0831 15:33:23.183998 140327465420608 train.py:276] eval in step: 16848, loss: 0.6932, acc: 0.4983
I0831 15:33:41.210031 140327465420608 train.py:242] train in step: 17160, loss: 0.6932, acc: 0.4997
I0831 15:33:43.294295 140327465420608 train.py:276] eval in step: 17160, loss: 0.6931, acc: 0.4983

The text was updated successfully, but these errors were encountered:

MostafaDehghani · 2021-08-31T07:48:45Z

@liuyang148 I think by "coverage" you mean "converge" (or please correct me if I'm wrong)?
In that case, I want to say that Pathfinder is a difficult task for transformers (and any other architecture that has no recurrence or an inductive bias for modeling transitivity). So what you're observing is simply the struggle of these models to pick up the task. That's actually one of the main reasons that we included the pathfinder in LRA.

liuyang148 · 2021-08-31T07:57:40Z

Yes, I mean 'converge', forgive my bad english.
Then, which result did the paper record. Only 'converge' one and ignore 'none-converge' results?

MostafaDehghani · 2021-08-31T08:22:37Z

No problem at all!
As far as I remember, you will observe no improvement in the metrics we care about in almost all models after some number of training steps even if the loss is still changing (most of the time fluctuating). So we have chosen to fix the number of epochs to 200. I had runs with 1000 epochs but you don't see significant improvement in the "accuracy".

liuyang148 · 2021-08-31T08:43:45Z

OK, I got it. Thanks for your help.

jnhwkim · 2021-09-02T01:45:28Z

@MostafaDehghani I understand the task is difficult to converge and learn. I tried three times with different config.random_seed for Performer, but it keeps failing to converge and test accuracies are around 50%. How can I reproduce the number in the paper, i.e., 77.05 (the best score in Table 1)

yinzhangyue · 2021-11-16T10:27:12Z

@jnhwkim I encountered the same situation as you.

MostafaDehghani · 2021-11-16T11:30:14Z

@jnhwkim @yinzhangyue
Can you point me to the exact config file you're using in LRA codebase?

yinzhangyue · 2021-11-16T13:47:52Z

I don't change the config file, here is the base_pathfinder32_config.py.

# Copyright 2021 Google LLC

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Base Configuration."""

import ml_collections

NUM_EPOCHS = 200
TRAIN_EXAMPLES = 160000
VALID_EXAMPLES = 20000


def get_config():
  """Get the default hyperparameter configuration."""
  config = ml_collections.ConfigDict()
  config.batch_size = 512
  config.eval_frequency = TRAIN_EXAMPLES // config.batch_size
  config.num_train_steps = (TRAIN_EXAMPLES // config.batch_size) * NUM_EPOCHS
  config.num_eval_steps = VALID_EXAMPLES // config.batch_size
  config.weight_decay = 0.
  config.grad_clip_norm = None

  config.save_checkpoints = True
  config.restore_checkpoints = True
  config.checkpoint_freq = (TRAIN_EXAMPLES //
                            config.batch_size) * NUM_EPOCHS // 2
  config.random_seed = 0

  config.learning_rate = .001
  config.factors = 'constant * linear_warmup * cosine_decay'
  config.warmup = (TRAIN_EXAMPLES // config.batch_size) * 1
  config.steps_per_cycle = (TRAIN_EXAMPLES // config.batch_size) * NUM_EPOCHS

  # model params
  config.model = ml_collections.ConfigDict()
  config.model.num_layers = 1
  config.model.num_heads = 2
  config.model.emb_dim = 32
  config.model.dropout_rate = 0.1

  config.model.qkv_dim = config.model.emb_dim // 2
  config.model.mlp_dim = config.model.qkv_dim * 2
  config.model.attention_dropout_rate = 0.1
  config.model.classifier_pool = 'MEAN'
  config.model.learn_pos_emb = False

  config.trial = 0  # dummy for repeated runs.
  return config

yinzhangyue · 2021-11-16T13:49:02Z

My Run Script.

PYTHONPATH="$(pwd)":"$PYTHON_PATH" python lra_benchmarks/image/train.py \
      --config=lra_benchmarks/image/configs/pathfinder32/performer_base.py \
      --model_dir=./tmp/pathfinder_F \
      --task_name=pathfinder32_hard

MostafaDehghani · 2021-11-16T15:10:59Z

I just checked and seems the configs in the repo is not synced with the internal config that we have for getting the results in the paper. Not sure what went wrong, but sorry for that. I'll work on updating the repo, but in the meantime, here are the configs that you should use in the performer config file to be able to get the reported score:

def get_config():
  """Get the default hyperparameter configuration."""
  config = base_pathfinder32_config.get_config()
  config.model_type = "performer"

  config.model.num_layers = 1
  config.model.num_heads = 8
  config.model.emb_dim = 128
  config.model.dropout_rate = 0.2
  config.model.qkv_dim = 64
  config.model.mlp_dim = 128

  return config

yinzhangyue · 2021-11-16T15:16:52Z

Thank you! I will try it immediately.

yinzhangyue · 2021-11-17T04:19:01Z

It works! Thank you very much! ^o^

liuyang148 changed the title ~~Pathfinder task cannot coverage.~~ Pathfinder task cannot converge. Aug 31, 2021

liuyang148 closed this as completed Aug 31, 2021

MostafaDehghani mentioned this issue Nov 20, 2021

Pathfinder task #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pathfinder task cannot converge. #37

Pathfinder task cannot converge. #37

liuyang148 commented Aug 31, 2021 •

edited

Loading

MostafaDehghani commented Aug 31, 2021

liuyang148 commented Aug 31, 2021 •

edited

Loading

MostafaDehghani commented Aug 31, 2021

liuyang148 commented Aug 31, 2021

jnhwkim commented Sep 2, 2021

yinzhangyue commented Nov 16, 2021

MostafaDehghani commented Nov 16, 2021

yinzhangyue commented Nov 16, 2021

yinzhangyue commented Nov 16, 2021

MostafaDehghani commented Nov 16, 2021

yinzhangyue commented Nov 16, 2021

yinzhangyue commented Nov 17, 2021

Pathfinder task cannot converge. #37

Pathfinder task cannot converge. #37

Comments

liuyang148 commented Aug 31, 2021 • edited Loading

MostafaDehghani commented Aug 31, 2021

liuyang148 commented Aug 31, 2021 • edited Loading

MostafaDehghani commented Aug 31, 2021

liuyang148 commented Aug 31, 2021

jnhwkim commented Sep 2, 2021

yinzhangyue commented Nov 16, 2021

MostafaDehghani commented Nov 16, 2021

yinzhangyue commented Nov 16, 2021

yinzhangyue commented Nov 16, 2021

MostafaDehghani commented Nov 16, 2021

yinzhangyue commented Nov 16, 2021

yinzhangyue commented Nov 17, 2021

liuyang148 commented Aug 31, 2021 •

edited

Loading

liuyang148 commented Aug 31, 2021 •

edited

Loading