-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification Needed in Preprocessing PubChem dataset #29
Comments
ps. when I ran the code with commented out that line, the result is as:
|
Hi @Syzseisus , this is because we constructed the script in 2022. The PubChem group has been updating this BTW. In README, we mentioned this:
|
Thank you for your quick response! |
Hello again, as I mentioned, it seems to be However, I know that this update will require a lot of expert time. Thank you again for your hard work and wonderful research. Sincerely, Syzseisus |
Hi @Syzseisus,
The specific cases could be different, but at least the special cases discussed in the paper can still be handled using these lines of scripts. |
Thank you so much for the incredibly quick response. I’m reopening an issue, even though the question might seem minor, because I am trying to reproduce the results from your paper. I suspect that due to the Given the current situation, what preprocessing steps would you recommend to ensure I get results closer to those in the original paper? |
Hi @Syzseisus, Since the checkpoints have been reproduced, you should be able to reproduce the results on downstream tasks. |
I’m aware that you’ve provided a checkpoint for the pretrained model. However, for my research, I’m looking to reproduce the pretraining process itself. Thank you for your help. |
Dear authors,
Thanks for the exciting work.
While working with the code of preprocessing PubChem dataset, I came across a specific line in a file that I find confusing. Could you please clarify its purpose?
File:
./preprocessing/PubChemstep_01_description_extraction.py
Line number: 160
Code on that line:
assert description_data["TotalPages"] == total_page_num
I've observed that the variable
total_page_num
is set to290
, but the execution result showsdescription_data["TotalPages"]
as422
. When I commented out this line, the code ran without any issues.I'm not sure why this line is necessary and how it fits into the overall functionality of the script. Understanding its purpose would help me a lot in my current work and in contributing more effectively to the project.
Thank you for your assistance!
Best regards,
Syzseisus
The text was updated successfully, but these errors were encountered: