Skip to content

Commit 0426179

Browse files
committed
Add pseudocode
1 parent 4a8597e commit 0426179

File tree

1 file changed

+62
-0
lines changed

1 file changed

+62
-0
lines changed

pseudocode.md

+62
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Pseudocode and Explanation for CancerMine
2+
3+
This file gives an overview for the following core files. Most of the functionality is managed by the [Kindred Python package](https://github.com/jakelever/kindred) which is described in the [documentation](http://kindred.readthedocs.io) and [associated paper](http://aclweb.org/anthology/W17-2322).
4+
5+
- **buildModels.py** : Use training data to build Kindred relation classifier models
6+
- **wordlistLoader.py** : Preprepare parsed wordlists using gene names, cancer types and more
7+
- **parseAndFindEntities.py** : Parse documents and find sentences that mention cancer types and gene names with additional filtering
8+
- **applyModelsToSentences.py** : Apply Kindred relation classifiers to find mentions of drivers, oncogenes and tumor suppressors
9+
- **filterAndCollate.py** : Filter for mentions with higher certainty and collate them for counts
10+
11+
## buildModels.py
12+
13+
- For each relation type (Driver, Oncogene, Tumor_Suppressor)
14+
- Load the Kindred corpus (1500 annotated sentences)
15+
- Strip all relations that do not match the relation type of interest
16+
- Create a Kindred classifier with a logistic regression model and threshold of 0.5
17+
- Train it on the filtered corpus
18+
- Save the classifier to a file
19+
20+
## wordlistLoader.py
21+
22+
- Take in wordlists for genes, cancers, drugs (to identify ambigiuity) and conflicting terms
23+
- Get Kindred to parse them and prepare a data structure ready for matching
24+
- Save it to a file as a Python pickle
25+
26+
## parseAndFindEntities.py
27+
28+
- Create a Kindred parser and EntityRecognizer with the terms prepared by wordlistLoader.py
29+
- Read in a BioC corpus file (of abstracts or articles) in chunks:
30+
- Filter the corpus by removing documents that don't contain keywords (in filterTerms.txt)
31+
- Parse the documents
32+
- Annotate them with the EntityRecognizer (for cancer types, genes, etc)
33+
- For each sentence in each document
34+
- Ignore if it doesn't contain any of the keywords (in filterTerms.txt)
35+
- Check if a gene and cancer are mentioned in the sentence and add to output with metadata if so
36+
- Dump all matching sentences to output JSON with metadata of the source of the sentence
37+
38+
## applyModelsToSentences.py
39+
40+
- Load all the models created by buildModels.py
41+
- Create a Kindred parser and EntityRecognizer with the terms prepared by wordlistLoader.py
42+
- Open the JSON file with sentences
43+
- Parse them and annotated with EntityRecognizer (for cancer types, genes, etc)
44+
- Apply the Kindred relation classifier models to this corpus
45+
- Iterate over every relation extracted
46+
- Normalize gene names and cancer names where possible
47+
- Output the relation with all metadata and normalized terms
48+
49+
## filterAndCollate.py
50+
51+
- Define thresholds for relations:
52+
- Driver = 0.80
53+
- Oncogene = 0.76
54+
- Tumor_Suppressor = 0.92
55+
- Iterate over the combined outputs of all runs of applyModelsToSentences.py
56+
- Check the probability of the relation (as the output for the model) and see if it is above the required thresholds
57+
- Get the core relation info of relation type, cancer type and gene name
58+
- Create a matching ID key that can link back this core relation info
59+
- Add the PubMed ID to the number of citations for this core relation info
60+
- Output all the core relations with the number of citations
61+
- Output all the sentences with an additional field of the sentence with simple HTML formatting for the location of entities
62+

0 commit comments

Comments
 (0)