|
| 1 | +# Pseudocode and Explanation for CancerMine |
| 2 | + |
| 3 | +This file gives an overview for the following core files. Most of the functionality is managed by the [Kindred Python package](https://github.com/jakelever/kindred) which is described in the [documentation](http://kindred.readthedocs.io) and [associated paper](http://aclweb.org/anthology/W17-2322). |
| 4 | + |
| 5 | +- **buildModels.py** : Use training data to build Kindred relation classifier models |
| 6 | +- **wordlistLoader.py** : Preprepare parsed wordlists using gene names, cancer types and more |
| 7 | +- **parseAndFindEntities.py** : Parse documents and find sentences that mention cancer types and gene names with additional filtering |
| 8 | +- **applyModelsToSentences.py** : Apply Kindred relation classifiers to find mentions of drivers, oncogenes and tumor suppressors |
| 9 | +- **filterAndCollate.py** : Filter for mentions with higher certainty and collate them for counts |
| 10 | + |
| 11 | +## buildModels.py |
| 12 | + |
| 13 | +- For each relation type (Driver, Oncogene, Tumor_Suppressor) |
| 14 | + - Load the Kindred corpus (1500 annotated sentences) |
| 15 | + - Strip all relations that do not match the relation type of interest |
| 16 | + - Create a Kindred classifier with a logistic regression model and threshold of 0.5 |
| 17 | + - Train it on the filtered corpus |
| 18 | + - Save the classifier to a file |
| 19 | + |
| 20 | +## wordlistLoader.py |
| 21 | + |
| 22 | +- Take in wordlists for genes, cancers, drugs (to identify ambigiuity) and conflicting terms |
| 23 | +- Get Kindred to parse them and prepare a data structure ready for matching |
| 24 | +- Save it to a file as a Python pickle |
| 25 | + |
| 26 | +## parseAndFindEntities.py |
| 27 | + |
| 28 | +- Create a Kindred parser and EntityRecognizer with the terms prepared by wordlistLoader.py |
| 29 | +- Read in a BioC corpus file (of abstracts or articles) in chunks: |
| 30 | + - Filter the corpus by removing documents that don't contain keywords (in filterTerms.txt) |
| 31 | + - Parse the documents |
| 32 | + - Annotate them with the EntityRecognizer (for cancer types, genes, etc) |
| 33 | + - For each sentence in each document |
| 34 | + - Ignore if it doesn't contain any of the keywords (in filterTerms.txt) |
| 35 | + - Check if a gene and cancer are mentioned in the sentence and add to output with metadata if so |
| 36 | +- Dump all matching sentences to output JSON with metadata of the source of the sentence |
| 37 | + |
| 38 | +## applyModelsToSentences.py |
| 39 | + |
| 40 | +- Load all the models created by buildModels.py |
| 41 | +- Create a Kindred parser and EntityRecognizer with the terms prepared by wordlistLoader.py |
| 42 | +- Open the JSON file with sentences |
| 43 | +- Parse them and annotated with EntityRecognizer (for cancer types, genes, etc) |
| 44 | +- Apply the Kindred relation classifier models to this corpus |
| 45 | +- Iterate over every relation extracted |
| 46 | + - Normalize gene names and cancer names where possible |
| 47 | + - Output the relation with all metadata and normalized terms |
| 48 | + |
| 49 | +## filterAndCollate.py |
| 50 | + |
| 51 | +- Define thresholds for relations: |
| 52 | + - Driver = 0.80 |
| 53 | + - Oncogene = 0.76 |
| 54 | + - Tumor_Suppressor = 0.92 |
| 55 | +- Iterate over the combined outputs of all runs of applyModelsToSentences.py |
| 56 | + - Check the probability of the relation (as the output for the model) and see if it is above the required thresholds |
| 57 | + - Get the core relation info of relation type, cancer type and gene name |
| 58 | + - Create a matching ID key that can link back this core relation info |
| 59 | + - Add the PubMed ID to the number of citations for this core relation info |
| 60 | +- Output all the core relations with the number of citations |
| 61 | +- Output all the sentences with an additional field of the sentence with simple HTML formatting for the location of entities |
| 62 | + |
0 commit comments