Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification Needed in Preprocessing PubChem dataset #29

Open
Syzseisus opened this issue Jul 15, 2024 · 8 comments
Open

Clarification Needed in Preprocessing PubChem dataset #29

Syzseisus opened this issue Jul 15, 2024 · 8 comments

Comments

@Syzseisus
Copy link

Dear authors,

Thanks for the exciting work.
While working with the code of preprocessing PubChem dataset, I came across a specific line in a file that I find confusing. Could you please clarify its purpose?

File: ./preprocessing/PubChemstep_01_description_extraction.py
Line number: 160
Code on that line: assert description_data["TotalPages"] == total_page_num

I've observed that the variable total_page_num is set to 290, but the execution result shows description_data["TotalPages"] as 422. When I commented out this line, the code ran without any issues.

I'm not sure why this line is necessary and how it fits into the overall functionality of the script. Understanding its purpose would help me a lot in my current work and in contributing more effectively to the project.

Thank you for your assistance!

Best regards,

Syzseisus

@Syzseisus
Copy link
Author

ps. when I ran the code with commented out that line, the result is as:

Total CID (with raw name) 242673
Total CID (with extracted name) 244717
Total CID 244889

@chao1224
Copy link
Owner

Hi @Syzseisus , this is because we constructed the script in 2022. The PubChem group has been updating this TotalPages, so this number should be increased now.

BTW. In README, we mentioned this:

python step_01_description_extraction.py. This step extracts and merge all the textual descriptions into a single json file. We run this on May 30th, 2022. The APIs will keep updating, so you may have slightly different versions if you run this script yourself.

@Syzseisus
Copy link
Author

Thank you for your quick response!

@Syzseisus
Copy link
Author

Hello again, as I mentioned, it seems to be TotalPages=422 as of October 31, 24. However, it does not seem to be a problem that can be applied by simply changing total_page_num in the file ./preprocessing/PubChemstep_01_description_extraction.py to 422. As total_page_num increases, specific cases in the clean_up_description function might also be updated.

However, I know that this update will require a lot of expert time.
Therefore, I would like to ask you to provide the "CID2name_raw.json", "CID2name.json", "CID2text_raw.json", "CID2text.json", "CID2SMILES.csv", "molecules.sdf" files that have been pre-processed on May 30, 2022.
Alternatively, I would like to ask you to provide the "281K chemical structure and text pairs" itself mentioned in the "Results" section on page 3 of the paper among many if-cases below line 243 in the file ./scripts/pretrain.py .

Thank you again for your hard work and wonderful research.

Sincerely, Syzseisus

@Syzseisus Syzseisus reopened this Oct 30, 2024
@chao1224
Copy link
Owner

Hi @Syzseisus,

  • Four out of six files you mentioned have already been uploaded to this HuggingFace link.
  • The two other files (CID2text_raw.json and CID2text.json) cannot be released due to the policy issue from PubChem.

The specific cases could be different, but at least the special cases discussed in the paper can still be handled using these lines of scripts.

@Syzseisus
Copy link
Author

Thank you so much for the incredibly quick response.

I’m reopening an issue, even though the question might seem minor, because I am trying to reproduce the results from your paper.

I suspect that due to the clean_up_description function not being updated to handle the additional data, the performance could be drop, despite the increase in data.

Given the current situation, what preprocessing steps would you recommend to ensure I get results closer to those in the original paper?

@chao1224
Copy link
Owner

Hi @Syzseisus,

Since the checkpoints have been reproduced, you should be able to reproduce the results on downstream tasks.

@Syzseisus
Copy link
Author

I’m aware that you’ve provided a checkpoint for the pretrained model.

However, for my research, I’m looking to reproduce the pretraining process itself.

Thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants