Clarification Needed in Preprocessing PubChem dataset #29

Syzseisus · 2024-07-15T13:40:27Z

Dear authors,

Thanks for the exciting work.
While working with the code of preprocessing PubChem dataset, I came across a specific line in a file that I find confusing. Could you please clarify its purpose?

File: ./preprocessing/PubChemstep_01_description_extraction.py
Line number: 160
Code on that line: assert description_data["TotalPages"] == total_page_num

I've observed that the variable total_page_num is set to 290, but the execution result shows description_data["TotalPages"] as 422. When I commented out this line, the code ran without any issues.

I'm not sure why this line is necessary and how it fits into the overall functionality of the script. Understanding its purpose would help me a lot in my current work and in contributing more effectively to the project.

Thank you for your assistance!

Best regards,

Syzseisus

The text was updated successfully, but these errors were encountered:

Syzseisus · 2024-07-15T13:41:46Z

ps. when I ran the code with commented out that line, the result is as:

Total CID (with raw name) 242673
Total CID (with extracted name) 244717
Total CID 244889

chao1224 · 2024-07-15T14:38:38Z

Hi @Syzseisus , this is because we constructed the script in 2022. The PubChem group has been updating this TotalPages, so this number should be increased now.

BTW. In README, we mentioned this:

python step_01_description_extraction.py. This step extracts and merge all the textual descriptions into a single json file. We run this on May 30th, 2022. The APIs will keep updating, so you may have slightly different versions if you run this script yourself.

Syzseisus · 2024-07-16T09:58:54Z

Thank you for your quick response!

Syzseisus · 2024-10-30T23:38:33Z

Hello again, as I mentioned, it seems to be TotalPages=422 as of October 31, 24. However, it does not seem to be a problem that can be applied by simply changing total_page_num in the file ./preprocessing/PubChemstep_01_description_extraction.py to 422. As total_page_num increases, specific cases in the clean_up_description function might also be updated.

However, I know that this update will require a lot of expert time.
Therefore, I would like to ask you to provide the "CID2name_raw.json", "CID2name.json", "CID2text_raw.json", "CID2text.json", "CID2SMILES.csv", "molecules.sdf" files that have been pre-processed on May 30, 2022.
Alternatively, I would like to ask you to provide the "281K chemical structure and text pairs" itself mentioned in the "Results" section on page 3 of the paper among many if-cases below line 243 in the file ./scripts/pretrain.py .

Thank you again for your hard work and wonderful research.

Sincerely, Syzseisus

chao1224 · 2024-10-30T23:52:51Z

Hi @Syzseisus,

Four out of six files you mentioned have already been uploaded to this HuggingFace link.
The two other files (CID2text_raw.json and CID2text.json) cannot be released due to the policy issue from PubChem.

The specific cases could be different, but at least the special cases discussed in the paper can still be handled using these lines of scripts.

Syzseisus · 2024-10-31T00:16:07Z

Thank you so much for the incredibly quick response.

I’m reopening an issue, even though the question might seem minor, because I am trying to reproduce the results from your paper.

I suspect that due to the clean_up_description function not being updated to handle the additional data, the performance could be drop, despite the increase in data.

Given the current situation, what preprocessing steps would you recommend to ensure I get results closer to those in the original paper?

chao1224 · 2024-10-31T00:24:56Z

Hi @Syzseisus,

Since the checkpoints have been reproduced, you should be able to reproduce the results on downstream tasks.

Syzseisus · 2024-10-31T00:26:54Z

I’m aware that you’ve provided a checkpoint for the pretrained model.

However, for my research, I’m looking to reproduce the pretraining process itself.

Thank you for your help.

Syzseisus closed this as completed Jul 16, 2024

Syzseisus reopened this Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification Needed in Preprocessing PubChem dataset #29

Clarification Needed in Preprocessing PubChem dataset #29

Syzseisus commented Jul 15, 2024

Syzseisus commented Jul 15, 2024

chao1224 commented Jul 15, 2024

Syzseisus commented Jul 16, 2024

Syzseisus commented Oct 30, 2024

chao1224 commented Oct 30, 2024

Syzseisus commented Oct 31, 2024

chao1224 commented Oct 31, 2024

Syzseisus commented Oct 31, 2024

Clarification Needed in Preprocessing PubChem dataset #29

Clarification Needed in Preprocessing PubChem dataset #29

Comments

Syzseisus commented Jul 15, 2024

Syzseisus commented Jul 15, 2024

chao1224 commented Jul 15, 2024

Syzseisus commented Jul 16, 2024

Syzseisus commented Oct 30, 2024

chao1224 commented Oct 30, 2024

Syzseisus commented Oct 31, 2024

chao1224 commented Oct 31, 2024

Syzseisus commented Oct 31, 2024