Skip to content

Commit

Permalink
Trove refactor commit
Browse files Browse the repository at this point in the history
  • Loading branch information
jason-fries committed Nov 27, 2020
1 parent ed9da6f commit 0785777
Show file tree
Hide file tree
Showing 64 changed files with 13,357 additions and 0 deletions.
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
# OS generated files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.d

# Trove specific cache
.trove/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Trove

**Preprint**: [Trove: Ontology-driven Weak Supervision for Medical Entity Classification] (https://arxiv.org/abs/2008.01972)

See the `manuscript` branch for the preprint's code


Trove is a weakly supervised framework for training medical named entity recognition (NER) classifiers without hand-labeled training data.
***Trove is currently in-development software!***. Let us know when you find bugs.

## Installation

Requirements: python 3.6, pytorch 1.0+, snorkel 0.9.5+

## Tutorials

See `tutorials/`

## Requirements

Tested on OSX and Linux.

## Citation
If use Trove in your research, please cite.

```bibtex
@ARTICLE{Fries2020-wg,
title = "Trove: Ontology-driven weak supervision
for medical entity classification",
author = "Fries, Jason A and Steinberg, Ethan and Khattar, Saelig and
Fleming, Scott L and Posada, Jose and Callahan, Alison and Shah,
Nigam H",
journal = "ArXiv",
month = aug,
year = 2020,
language = "en"
}
```

11 changes: 11 additions & 0 deletions applications/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Contributed Applications

###NOTE: These are currently being refactored.

Labeling functions for various weakly supervised biomedical classification tasks

- `bc5cdr/` - BioCreative V Chemical Disease Relations Chemical and Disease NER (literature)
- `i2b2drugs/` - n2c2 (formally i2b2) Drug NER (clinical)
- `shareclef2014/` - ShARe/CLEF 2014 Disorder NER (clinical)
- `thyme/` - DocTimeRel (clinical)
- `covid19/` - COVID-19 exposure (clinical)
40 changes: 40 additions & 0 deletions applications/bc5cdr/cdr_chemical_regexes.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
LABEL NAME TIER IGNORE_CASE REGEX NOTES
0 non_chemicals 4 1 \b([A-Za-z0-9]+?[rlntd]ase[s]*)\b
0 non_chemicals 4 1 [A-Za-z0-9]+ factor[s]*
0 non_chemicals 4 1 (angiotensinogen)
0 non_chemicals 4 1 (guarana|(panax )*ginseng)
0 non_chemicals 4 1 \b(anti[a-z]+)\b
0 non_chemicals 4 1 (cocaine (abuse|addiction|overdose))
0 non_chemicals 4 1 ((renal)\s*(angiotensinogen) (mRNA|expression))
0 non_chemicals 4 1 (atrial natriuretic factor( \[\s*ANF\s*\])*)
0 non_chemicals 4 1 (fibrinolysis inhibitor[s]*)
0 non_chemicals 4 1 ((brain )*biogenic amines)
0 non_chemicals 4 1 ([-]\s*(associated|dependent|related|treated|acting|controlled|induced|containing|fold|increasing|adjusted|month|specific))
1 misc_chemicals 4 1 (oral contraceptives)
1 misc_chemicals 4 1 ([A-Z]){2}[0-9]{3,}
1 misc_chemicals 4 1 \b(ACEi|ACE inhibitor[s]*)\b
1 misc_chemicals 4 1 (corticosteroid[s]*|(oral[- ])*contraceptive[s]*)
1 misc_chemicals 4 1 (calcium|cacl[(]2[)])
1 misc_chemicals 4 1 ([l][- ](glutathione|arginine))
1 misc_chemicals 4 1 (appetite[- ]suppressant[s]*( drugs)*)
1 misc_chemicals 4 1 (calcium channel blocker[s]*|calcium chloride|CaCl)
1 misc_chemicals 4 1 (simvastatin[- ]ezetimibe)
1 misc_chemicals 4 1 ([snp][-](perillyl alcohol|pyrimidinyl|choloroaniline|acetylcysteine|limonene))
1 misc_chemicals 4 1 ((alpha|beta|gamma)[-][T])
1 misc_chemicals 4 0 (PG[-]9|U[-]II)
1 misc_chemicals 4 0 (BPO|GSH|DFU|CsA|Srl|HOE|GVG|PAN|NMDA)
1 misc_chemicals 4 0 (TCR|MZ|HBsAg|AraG|LR132|SSRI[s]*|HBeAg|LR132|BD10[0-9]{2}|GNC92H2|SSR103800|CGRP)
1 misc_chemicals 4 1 (angiotensin([- ]ii)*)
1 misc_chemicals 4 1 (u[- ]ii|urotensin[- ]ii)
1 misc_chemicals 4 1 bradykinin comprised of 9 amino acids
1 misc_chemicals 4 1 (d[- ]pen(icillamine)*)
1 misc_chemicals 4 1 (lipopolysaccharide|alkylating agents)
1 misc_chemicals 4 1 (pegylated (interferon|IFN)( alpha[- ]2[ab])*)
1 misc_chemicals 4 1 (\[3H\])
1 misc_chemicals 4 1 (CaCl|LAM|GSH|PAN|H2O|AVP|LR132)
1 glue_tokens 4 1 (\[[0-9][-.])
1 glue_tokens 4 1 (\[\s*3H\s*\])
1 glue_tokens 4 1 (thiazolyl|amino|phenyl|ethyl|butyl|nonane|3H)[\]]
1 glue_tokens 4 1 ([-](methyl|ethyl|carboline|dimethoxy|alpha|beta|delta|gamma|glyceryl|thiazolyl)[-])
1 glue_tokens 4 0 [-]([1-9]|[A-Z])[-]
0 parentheses 4 1 [(](P|p|n)\s*([><=]+|(less|great)(er)*)|(ml|mg|kg|g|(year|day|month)[s]*)[)]|[(][0-9]+[%][)]
31 changes: 31 additions & 0 deletions applications/bc5cdr/cdr_disease_regexes.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
LABEL NAME TIER IGNORE_CASE REGEX NOTES
1 char_grams 4 1 ^(psych|necro|nephro|hyper|throm|hypo|acro|hemo)[A-Za-z]+?([rlt]ic)$
1 char_grams 4 1 ^(hepato|hemato|nephro|cardio|neuro|myelo|oto)*toxic(ities|ity)*$
1 diseases_1 4 1 (([A-Za-z]+\s*) and ([A-Za-z]+\s*) ((neuro)*toxicity|injury|lesion[s]*|impairment|effusion[s]*|deficit[s]*))
1 diseases_1 4 1 ([A-Za-z]+) and ([A-Za-z]+) (insufficiency|(dysfunction|carcinoma|cancer|syndrome|disorder|disease)[s]*)
1 diseases_1 4 1 \b(non[-](small|hodgkin)|veno[-]occlusive|end[-]stage|HBV[-]HIV|Q[-]T)\b # hyphens
1 diseases_1 4 1 (increase[s]* in (blood pressure|heart rate|locomotor activity|dural( and cortical) blood flow)) # increases/decreases in X
1 diseases_1 4 1 ((reduction|decrease)[s*] in (MAP|glomerular number|(arterial )*blood pressure)) # increases/decreases in X
1 diseases_1 4 1 ((respiratory|hypothalamic|corticostriatal|tubular|biventricular|myocardial|hepatic|systolic|cranial nerve|sexual) dysfunction[s]*) # increases/decreases in X
1 diseases_1 4 1 (myocardial( cell)*|hepatocellular|mitochondrial|proteinuric|hippocampal|cerebellum|myocardial|neuronal|cardiac|hepatic|bladder|tissue|axonal|kidney|renal|liver|cord) (injury|damage) # injuries
1 diseases_1 4 1 (malignant ([A-Za-z]+ )*(glioma|tumor)[s]*)
1 diseases_1 4 1 (([A-Za-z]+)'s|wolff[- ]+parkinson[- ]+white|haemolytic[- ]+uraemic|guillain[- ]+barr|hematologic|cholestatic|rabbit)([- ]+like)* syndrome
1 diseases_1 4 1 diabetic( hyperalgesia)*|diabetes
1 diseases_1 4 1 (adenocarcinoma|calcification|angiosarcoma|enlargement|disorders|cirrhosis|carcinoma|cancer|injury) (in|of) the (central nervous system|oral cavity|bladder|artery|ureter|brain|aorta|liver) # anatomy findings
1 diseases_1 4 1 \b[A-Za-z-]+'s (syndrome|disease)\b # common disease patterns
1 diseases_1 4 1 ((artery )*calcification)|(calcification of the [A-Za-z]+) # common disease patterns
1 diseases_1 4 1 ([Dd]uchenne('s)* (muscular )*dystrophy|DMD) # dystrophy
1 diseases_1 4 1 (ventricular tachyarrhythmias|loss of consciousness|tachyarrhythmias|hyperhidrosis|hypertensive|cardiomegaly|weight gain|hypotension|weight loss|glucosuria|hoarding) # common findings
1 diseases_2 4 1 (hyperactive|convulsive|haemorrhage|depressed|deformation[s]*)
1 diseases_2 4 1 \b((sugar|drug) dependency|nicotine-induced nystagmus|nystagmus|NIN)\b
1 diseases_2 4 1 (weakness of extremities|transverse limb deficiency|increase in locomotor activity|palpebral twitching) # movement/muscule issues
1 diseases_2 4 1 (choreoathetoid movement[s]*|choreatiform hyperkinesias) # movement/muscule issues
1 diseases_2 4 1 (tender joints|tenderness|swelling|morning stiffness|excessive flexion) # movement/muscule issues
1 diseases_2 4 1 (valve|valvular|valvular heart) (regurgitation|abnormalit(y|ies)) # cardiac
1 diseases_2 4 1 (atherosclerotic obstruction|cardiac remodelling) # cardiac
1 diseases_2 4 1 (cholestatic|renovascular|renal and kidney) disease[s]* # neurological/renal
1 diseases_2 4 1 (cranial nerve|hepatic and renal|cardiac|renal) dysfunction[s]* # neurological/renal
1 diseases_2 4 1 (neuronal loss|cranial nerve deficits|hippocampal injury|behavioral abnormalities|deficits in communication|repetitive behaviors|impaired immediate free recall) # neurological/renal
1 diseases_2 4 1 (vanishing bile duct|renal and hepatic failure|hepatic impairment|deterioration of renal function|abnormal liver function) # neurological/renal
0 non_diseases 4 1 ([-]\s*(associated|dependent|related|treated|acting|controlled|induced|containing|fold|increasing|adjusted|month|specific)) # drug induced / associated effects aren't labeled
0 non_diseases 4 1 (toxic ((side )*effect[s]*|agent[s]*|action|state|reaction|range|death[s]*|profile|assault[s]*)|(highly|minimally) toxic) # toxic effects aren't diseases
Loading

0 comments on commit 0785777

Please sign in to comment.