-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: RNG state is wrong size #18
Comments
Hi @Lzcstan, I guess this might be due to the cuda and pytorch version. My cuda version is 11 (and pytorch-1.9). Can you try to downgrade them? |
But I guess [2024-01-10 03:20:35,380] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/oem/anaconda3/envs/mol_stm/lib/python3.7/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
arguments Namespace(CL_neg_samples=1, JK='last', SSL_emb_dim=256, SSL_loss='EBM_NCE', T=0.1, batch_size=32, dataset='PubChemSTM', dataspace_path='../data', decay=0, device=0, dropout_ratio=0.5, epochs=2, gnn_emb_dim=300,
gnn_type='gin', graph_pooling='mean', max_seq_len=512, megamolbart_input_dir='../data/pretrained_MegaMolBART/checkpoints', mol_lr=1e-05, mol_lr_scale=1, molecule_type='SMILES', normalize=True, num_layer=5, num_workers=8,
output_model_dir=None, pretrain_gnn_mode='GraphMVP_G', representation_frozen=False, seed=42, text_lr=0.0001, text_lr_scale=1, text_type='SciBERT', verbose=True, vocab_path='../MoleculeSTM/bart_vocab.txt')
Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/home/oem/anaconda3/envs/mol_stm/lib/python3.7/site-packages/torch/cuda/__init__.py:143: UserWarning:
NVIDIA H800 with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 sm_80 sm_86 compute_37.
If you want to use the NVIDIA H800 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) Here is the support for my guess. |
Hi @Lzcstan, I am afraid that I don't have H800 to check the code, and according to the exception messages you listed above, it is the incompatibility between H800 (cuda and pytorch) and megatron.
|
Hi, I checked the shape of RNG state of CUDA and found that H800 cannot fit with the checkpoint. Switching the GPU solves my problem, I will close this issue. Thank you for your kind reply:-) |
Hello, can you tell me how to fix this problem? |
I just switched H800 to A800. |
Hello!
Thank you for your excellent work! I hope to try the scripts you provided and have downloaded the relevant checkpoints following your tutorial. But when I used
python pretrain.py --verbose --batch_size=32 --molecule_type=SMILES --epochs=2
to run the pre-trained script, the following error occurred:How should I fix it? I'm using a server with
NVIDIA H800
, which hascuda==12.1
andpytorch==2.1.2
Thanks again 🙏
The text was updated successfully, but these errors were encountered: