Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I reproduce the results from the article? #16

Open
yxliu0907 opened this issue Jan 3, 2024 · 10 comments
Open

How can I reproduce the results from the article? #16

yxliu0907 opened this issue Jan 3, 2024 · 10 comments
Assignees

Comments

@yxliu0907
Copy link

作者你好!我在使用您给出的code和checkpoint进行molecule editing,但是我使用默认参数似乎无法复现出文章里给出的结果,请问是我的哪些参数设置有问题吗?:)

@yxliu0907
Copy link
Author

比如我想复现p1这个结果,运行代码得到的结果为p2,得到的smiles似乎不是p1中所给出的那样。我使用的checkpoint是'MoleculeSTM/pretrained_MoleculeSTM/SciBERT-Graph-3e-5-1-1e-4-1-InfoNCE-0.1-32-32',输入的text是'This molecule issoluble in water.',输入的SMILES是FC(F)(F)OC(C=C1)=CC=C1C(C=N2)=CC=C2OC‘’
46acc857e47c7c8c14eb51f881a066a
7ef93161fa7e9f9dd3ffdeb6930d7f8

@chao1224
Copy link
Owner

chao1224 commented Jan 3, 2024

Hi @yxliu0907,

Thank you for raising this question. The results are reproducible if you follow :

  • The exact checkpoints listed here, which should be pretrained_MoleculeSTM/SciBERT-Graph-3e-5-1-1e-4-1-EBM_NCE-0.1-32-32 pretrained_MoleculeSTM_Raw/SciBERT-MegaMolBART-3e-5-1-1e-4-1-InfoNCE-0.1-32-32 for this demo.
  • The text prompt is This molecule insoluble in water.
  • The SMILES are canonical ones, like the 200 listed here.

@yxliu0907
Copy link
Author

Many thanks for your advice!
I followed your lead: using the checkpoints you mentionedpretrained_MoleculeSTM/SciBERT-Graph-3e-5-1-1e-4-1-EBM_NCE-0.1-32-32, using the text prompt you mentionedThis molecule insoluble in water and using the canonical smiles on the listCOc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 , but I still can't reproduce it. The smiles I get are either the same smiles or invalid smiles.
1a39fb585193111df90fab76bb5820f

@chao1224
Copy link
Owner

chao1224 commented Jan 8, 2024

Hi @yxliu0907,

We just checked the log files, and here are more details.

  • Use checkpoint pretrained_MoleculeSTM_Raw/SciBERT-MegaMolBART-3e-5-1-1e-4-1-InfoNCE-0.1-32-32 checkpoint here.
    • Sorry for the mistake before. It should be SMILES, not Graph.
  • Key hyper: --use_noise_for_init, --normalize
  • The optimal l2_lambda is 0.1.

The result w.r.t. this subfigure is:

l2 lambda: 0.1
Use random noise for init
clip loss: -0.96586	L2 loss: 0.07372
WARNING:foundation.models.mega_molbart.mega_mol_bart:WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1cnc(-c2ccc(OC(F)(F)F)cc2)cn1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1cnc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.05080

If you use our script (with all 200 SMILES as inputs), more complete results for this molecule are:

===== for SMILES COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 =====
Use random noise for init
l2 lambda: 10.0
Use random noise for init
clip loss: -0.13243	L2 loss: 0.09747
WARNING:foundation.models.mega_molbart.mega_mol_bart:WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580

l2 lambda: 1.0
Use random noise for init
clip loss: -0.94003	L2 loss: 0.18903
WARNING:foundation.models.mega_molbart.mega_mol_bart:WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ncc(-c2ccc(OC(F)(F)F)cc2)cc1-c1cnn(C)c1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ncc(-c2ccc(OC(F)(F)F)cc2)cc1-c1cnn(C)c1 & 4.05630

l2 lambda: 0.1
Use random noise for init
clip loss: -0.96586	L2 loss: 0.07372
WARNING:foundation.models.mega_molbart.mega_mol_bart:WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1cnc(-c2ccc(OC(F)(F)F)cc2)cn1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1cnc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.05080

l2 lambda: 0.01
Use random noise for init
clip loss: -0.94089	L2 loss: 0.02474
WARNING:foundation.models.mega_molbart.mega_mol_bart:WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'C(OC(F)(F)F)(=O)N[C@H](C)CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)CCCCCCC)COCCC)CCCCOCCCCCOCOCOCOCOCOCOCCCCCCCCOCOCOCCCCCCCCCCCCCCCCCCCCC(=COCCCCCCCCCCCC(=COCOC(=OC(=COCCCCCCOCOCCCCCCCCCCCCCCCCCCC(=OC(=OCCCCCCCCCCCCC(=OC(=OC(=OCC(=C(=OCOCOC(=OC(=OC(=OC(=OC(=OCC(=OCCCCCCC(=OC(=OC(=OC(=OC(=OC(=OCC(=OC(=OC)(=OC(=OC)C)C)(=OC(=OC)C)C(=OC)C)C)C)C)COC(=OC)(=OC)(=OC)C(=C(=C)C)CCOC(=OC(=OCC(=OC(=OC)C(=OC(=OC)C(=OC(=OCOCOCOC(=OC)(=C)(=OC)C(=OC(F)(=OC(=OC)(=OC(=OC(=OCOC)(=C)(']
valid mol list: 2

l2 lambda: 0.001
Use random noise for init
clip loss: -0.93510	L2 loss: 0.00295
WARNING:foundation.models.mega_molbart.mega_mol_bart:WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'C(OC(F)(F)C)[C@H]1C[C@H]1CC[C@H]1CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCOCOCOCOCOCOC)CCCCOCOCOCCCOCOCOCOCOCOCCCCCCCCOCOCOCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCOCOCCCCCOCOCCCCCCCCOCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCOCOCCCCCCCCCCOCOCOCOCOCOCCCCCCCCC(=COCOC(=CCCCCCCCCCCCCCCCCCCCCCOCCCCCCCCCCCCCCCCOCOCOCOCOCOCCCCCCCCCOCOCCCCCCCCCC(=CCCCCCC(=COC(=COCOCOCOCOCC(=C(=COCCCCCOCOCOCOCCCCCOCOC(=COCOC(=C(=CCCCC(=C(=C(=COCOC(=C(=C(=CCCCCC(=C(=C(FC(=C(F)(F)C(=C(=C(=C(=C(=C(=C']
valid mol list: 2

@yxliu0907
Copy link
Author

yxliu0907 commented Jan 9, 2024

I'm really sorry, but I still can't reproduce the same results.😭
Am I using other incorrect parameters?
Here are my parameter Settings:

parser = argparse.ArgumentParser()
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--device", type=int, default=0)
parser.add_argument("--verbose", type=int, default=1)

########## for editing ##########
parser.add_argument("--input_description", type=str, default='This molecule is insoluble in water')
parser.add_argument("--input_description_id", type=int, default=None)
parser.add_argument("--input_SMILES", type=str, default='COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1')
parser.add_argument("--input_SMILES_file", type=str, default=None)
parser.add_argument("--output_model_dir", type=str, default=None)
parser.add_argument("--use_noise_for_init", dest="use_noise_for_init", action="store_true")
parser.add_argument("--no_noise_for_init", dest="use_noise_for_init", action="store_false")
parser.set_defaults(use_noise_for_init=True)
parser.add_argument('--normalize', dest='normalize', action='store_true')
parser.add_argument('--no_normalize', dest='normalize', action='store_false')
parser.set_defaults(normalize=True)

parser.add_argument("--dataspace_path", type=str, default="../data")
parser.add_argument("--SSL_emb_dim", type=int, default=256)
parser.add_argument("--max_seq_len", type=int, default=512)

########## for MoleculeSTM ##########
parser.add_argument("--MoleculeSTM_model_dir", type=str, default="../model_save")
parser.add_argument("--MoleculeSTM_molecule_type", type=str, default="SMILES", choices=["SMILES", "Graph"])

########## for MegaMolBART ##########
parser.add_argument("--MegaMolBART_generation_model_dir", type=str, default="../data/pretrained_MegaMolBART/checkpoints")
parser.add_argument("--vocab_path", type=str, default="../MoleculeSTM/bart_vocab.txt")

########## for MoleculeSTM and generation projection ##########
parser.add_argument("--language_edit_model_dir", type=str, default="../model_save")   

########## for editing ##########
parser.add_argument("--lr_rampup", type=float, default=0.05)
parser.add_argument("--lr", type=float, default=0.1)
parser.add_argument("--epochs", type=int, default=50)
args = parser.parse_args()

and here is my result:

description_list ['This molecule is insoluble in water']
===== for description This molecule is insoluble in water =====
===== for SMILES COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 =====
Use random noise for init
l2 lambda: 10.0
Use random noise for init

  0%|          | 0/50 [00:00<?, ?it/s]
 10%|█         | 5/50 [00:00<00:01, 41.95it/s]
 22%|██▏       | 11/50 [00:00<00:00, 49.55it/s]
 34%|███▍      | 17/50 [00:00<00:00, 53.23it/s]
 48%|████▊     | 24/50 [00:00<00:00, 56.40it/s]
 62%|██████▏   | 31/50 [00:00<00:00, 58.02it/s]
 76%|███████▌  | 38/50 [00:00<00:00, 58.97it/s]
 90%|█████████ | 45/50 [00:00<00:00, 59.56it/s]
100%|██████████| 50/50 [00:00<00:00, 57.24it/s]
WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
clip loss: 0.07312	L2 loss: 0.13768
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580

l2 lambda: 1.0
Use random noise for init

  0%|          | 0/50 [00:00<?, ?it/s]
 14%|█▍        | 7/50 [00:00<00:00, 60.37it/s]
 28%|██▊       | 14/50 [00:00<00:00, 60.68it/s]
 42%|████▏     | 21/50 [00:00<00:00, 60.80it/s]
 56%|█████▌    | 28/50 [00:00<00:00, 60.88it/s]
 70%|███████   | 35/50 [00:00<00:00, 60.83it/s]
 84%|████████▍ | 42/50 [00:00<00:00, 60.78it/s]
 98%|█████████▊| 49/50 [00:00<00:00, 60.82it/s]
100%|██████████| 50/50 [00:00<00:00, 60.77it/s]
WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
clip loss: -0.43997	L2 loss: 0.17291
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2cc(-c3ccc(OC(F)(F)F)cc3)cnc2OC)cn1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2cc(-c3ccc(OC(F)(F)F)cc3)cnc2OC)cn1 & 4.72640

l2 lambda: 0.1
Use random noise for init

  0%|          | 0/50 [00:00<?, ?it/s]
 14%|█▍        | 7/50 [00:00<00:00, 60.72it/s]
 28%|██▊       | 14/50 [00:00<00:00, 60.82it/s]
 42%|████▏     | 21/50 [00:00<00:00, 60.92it/s]
 56%|█████▌    | 28/50 [00:00<00:00, 60.91it/s]
 70%|███████   | 35/50 [00:00<00:00, 60.96it/s]
 84%|████████▍ | 42/50 [00:00<00:00, 60.98it/s]
 98%|█████████▊| 49/50 [00:00<00:00, 60.99it/s]
100%|██████████| 50/50 [00:00<00:00, 60.93it/s]
WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
clip loss: -0.38958	L2 loss: 0.09160
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580

l2 lambda: 0.01
Use random noise for init

  0%|          | 0/50 [00:00<?, ?it/s]
 14%|█▍        | 7/50 [00:00<00:00, 60.73it/s]
 28%|██▊       | 14/50 [00:00<00:00, 60.81it/s]
 42%|████▏     | 21/50 [00:00<00:00, 60.88it/s]
 56%|█████▌    | 28/50 [00:00<00:00, 60.97it/s]
 70%|███████   | 35/50 [00:00<00:00, 60.96it/s]
 84%|████████▍ | 42/50 [00:00<00:00, 60.97it/s]
 98%|█████████▊| 49/50 [00:00<00:00, 60.86it/s]
100%|██████████| 50/50 [00:00<00:00, 60.87it/s]
WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
clip loss: -0.35857	L2 loss: 0.02138
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'C(F)(F)(F)(F)OCCN[C@H](C)CCOCC)c1ccc(-c2cnn(C)c2)n[nH]1']
valid mol list: 2

l2 lambda: 0.001
Use random noise for init

  0%|          | 0/50 [00:00<?, ?it/s]
 14%|█▍        | 7/50 [00:00<00:00, 60.65it/s]
 28%|██▊       | 14/50 [00:00<00:00, 60.78it/s]
 42%|████▏     | 21/50 [00:00<00:00, 60.89it/s]
 56%|█████▌    | 28/50 [00:00<00:00, 60.91it/s]
 70%|███████   | 35/50 [00:00<00:00, 60.91it/s]
 84%|████████▍ | 42/50 [00:00<00:00, 60.92it/s]
 98%|█████████▊| 49/50 [00:00<00:00, 60.95it/s]
100%|██████████| 50/50 [00:00<00:00, 60.89it/s]
WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
clip loss: -0.35523	L2 loss: 0.00235
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'C(F)(F)(F)(F)F']
valid mol list: 2

result_eval_list_one_pair
 [[ True]]

@chao1224
Copy link
Owner

chao1224 commented Jan 9, 2024

Hi @yxliu0907,

It seems that you are using This molecule is insoluble in water, not soluble, which might be the issue.

For insoluble, the result with l2-lambda=1 gives the right answer.

@yxliu0907
Copy link
Author

Hello @chao1224! I think it is caused by random seeds, which random seed have been used?😶‍🌫️😶‍🌫️😶‍🌫️

@chao1224
Copy link
Owner

@yxliu0907

The random seed is 1.

@AmT42
Copy link

AmT42 commented Apr 9, 2024

Hey Chao, Do you have the right hyperparameters for different tasks of editing, please?
It's about this answer you gave:
'> For insoluble, the result with l2-lambda=1 gives the right answer.'
I imagine you also have this kind of optimization to be done for editing for binding, multi-objective, or drug-like?

@chao1224
Copy link
Owner

Hi @AmT42

Yes, I have them in the log files. I will add them ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants