The sample datasets are distributed within the library. In the repository they can be found here. Hereby the datasets sources are cited.
Name Formatting: type_size_name_num_of_classes.csv
- type: R->Numerical and C->Categorical
- size: Number of instances in the dataset
- name: Name of dataset
- num_of_classes: Number of classes (Categorical only)
- Clintox dataset [1-4] (Toxicity) -> C_1484_CLINTOX_2.csv
- BACE dataset [5] (Inhibitor) -> C_1513_BACE_2.csv
- BBBP dataset [6] (Blood-brain barrier penetration) -> C_2039_BBBP_2.csv
- HIV dataset [7] -> C_41127_HIV_2.csv
- HIV dataset [7] -> C_41127_HIV_3.csv
- SAMPL dataset [8] (Hydration free energy) -> R_642_SAMPL.csv
- BACE dataset [5] (Binding affinity) -> R_1513_BACE.csv
- LOGP dataset [9] (Lipophilicity) -> R_4200_LOGP.csv
- LOGS dataset [10] (Aqueous Solubility) -> R_1291_LOGS.csv
- AQSOLDB dataset [11] (Aqueous Solubility) -> R_9982_AQSOLDB.csv
Note: Datasets 1-8 are edited versions of the MoleculeNet repository [12].
[1] Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. "A data-driven approach to predicting successes and failures of clinical trials." Cell chemical biology 23.10 (2016): 1294-1301.
[2] Artemov, Artem V., et al. "Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes." bioRxiv (2016): 095653.
[3] Novick, Paul A., et al. "SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery." PloS one 8.11 (2013): e79568.
[4] Aggregate Analysis of ClincalTrials.gov (AACT) Database. https://www.ctti-clinicaltrials.org/aact-database
[5] Subramanian, Govindan, et al. "Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches." Journal of chemical information and modeling 56.10 (2016): 1936-1949.
[6] Martins, Ines Filipa, et al. "A Bayesian approach to in silico blood-brain barrier penetration modeling." Journal of chemical information and modeling 52.6 (2012): 1686-1697.
[7] AIDS Antiviral Screen Data. https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data
[8] Mobley, David L., and J. Peter Guthrie. "FreeSolv: a database of experimental and calculated hydration free energies, with input files." Journal of computer-aided molecular design 28.7 (2014): 711-720.
[9] Hersey, A. ChEMBL Deposited Data Set - AZ dataset; 2015. https://doi.org/10.6019/chembl3301361
[10] Huuskonen, J. (2000). Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. Journal of Chemical Information and Computer Sciences, 40(3), 773-777.
[11] Sorkun, M. C., Khetan, A., & Er, S. (2019). AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Scientific data, 6(1), 1-8.
[12] Wu, Zhenqin, et al. "MoleculeNet: a benchmark for molecular machine learning." Chemical science 9.2 (2018): 513-530.