Skip to content

Commit

Permalink
Updated README with Nature Communications manuscript link.
Browse files Browse the repository at this point in the history
Added skeleton code for integrating unit tests.
Added additional details to applications README.
Misc small code formatting changes
  • Loading branch information
jason-fries committed Apr 1, 2021
1 parent 9dd17fd commit 9775d41
Show file tree
Hide file tree
Showing 15 changed files with 246 additions and 1,636 deletions.
52 changes: 40 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,58 @@
# Trove
--
<!--[![Build Status](https://travis-ci.com/som-shahlab/trove.svg?branch=main)](https://travis-ci.com/som-shahlab/trove)-->
<!--[![Documentation Status](https://readthedocs.org/projects/trove/badge/?version=latest)](https://trove.readthedocs.io/en/latest/?badge=latest)-->
[![Documentation Status](https://readthedocs.org/projects/trove/badge/?version=latest)](https://trove.readthedocs.io/en/latest/?badge=latest)
[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

Trove is a framework for training weakly supervised (bio)medical named entity recognition (NER) and other entity attribute classifiers without hand-labeled training data.
Trove is a research framework for building weakly supervised (bio)medical named entity recognition (NER) and other entity attribute classifiers without hand-labeled training data.

We combine a range of supervision signal common medical ontologies such as the Unified Medical Language System (UMLS), clinical text heuristics, and other noisy labeling sources for use with weak supervision frameworks such as [Snorkel](https://github.com/snorkel-team/snorkel).
The COVID-19 pandemic has underlined the need for faster, more flexible ways of building and sharing state-of-the-art NLP/NLU tools to analyze electronic health records (EHR), scientific literature, and social media. Trove provides tools for combining freely available supervision sources such as medical ontologies from the [Unified Medical Language System (UMLS)](https://www.nlm.nih.gov/research/umls/index.html), common text heuristics, and other noisy labeling sources for use as entity *labelers* in weak supervision frameworks such as [Snorkel](https://github.com/snorkel-team/snorkel), [FlyingSquid ](https://github.com/HazyResearch/flyingsquid) and others. Technical details are available in our [manuscript](https://www.nature.com/articles/s41467-021-22328-4).


Technical details are available in our [manuscript](https://arxiv.org/abs/2008.01972).

Trove has been used as part of several COVID-19 reseach efforts at Stanford.

## Installation
- [Continuous symptom profiling of patients screened for SARS-CoV-2](https://med.stanford.edu/covid19/research.html#data-science-and-modeling). We used a daily feed of patient notes from Stanford Health Care emergency departments to generate up-to-date [COVID-19 symptom frequency](https://docs.google.com/spreadsheets/d/1iZZvbv94fpZdC6XaiPosiniMOh18etSPliAXVlLLr1w/edit#gid=344371264) data. Funded by the [Bill & Melinda Gates Foundation](https://www.gatesfoundation.org/about/committed-grants/2020/04/inv017214).
- [Estimating the efficacy of symptom-based screening for COVID-19](https://rdcu.be/chSrv) published in *npj Digitial Medicine*.
- Our COVID-19 symptom data was used by CMU's [DELPHI group](https://covidcast.cmu.edu/) to prioritize selection of informative features from [Google's Symptom Search Trends dataset](https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/docs/table-search-trends.md).

Requirements: python 3.6, pytorch 1.0+, snorkel 0.9.5+

## Tutorials
## Getting Started

See `tutorials/`
### Tutorials

## Requirements
See [`tutorials/`](https://github.com/som-shahlab/trove/tree/dev/tutorials) for Jupyter notebooks walking through an example NER application.

### Installation

Requirements: Python 3.6 or later. We recomend using `pip` to install

`pip install -r requirements.txt`

## Contributions
We welcome all contributions to the code base! Please submit a pull request and/or start a discussion on GitHub Issues.

Weakly supervised methods for programatically building and maintaining training labels provides new opportunities for the larger community to participate in the creation of important datasets. This is especially exciting in domains such as medicine, where sharing labeled data is often challening due to patient privacy concerns.

Inspired by recent efforts such as [HuggingFace's Datasets](ttps://github.com/huggingface/datasets) library,
we would love to start a conversation around how to support sharing labelers in service of mantaining an open task library, so that it is easier to create, deploy, and version control weakly supervised models.

Tested on OSX and Linux.

## Citation
If use Trove in your research, please cite [Ontology-driven weak supervision for clinical entity classification in electronic health records]()
If use Trove in your research, please cite us!

Fries, J.A., Steinberg, E., Khattar, S. et al. Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat Commun 12, 2017 (2021). https://doi-org.stanford.idm.oclc.org/10.1038/s41467-021-22328-4

```
@article{fries2021trove,
title={Ontology-driven weak supervision for clinical entity classification in electronic health records},
author={Fries, Jason A and Steinberg, Ethan and Khattar, Saelig and Fleming, Scott L and Posada, Jose and Callahan, Alison and Shah, Nigam H},
journal={Nature Communications},
volume={12},
number={1},
year={2021},
publisher={Nature Publishing Group}
}
```

See the `manuscript` branch for the code used

14 changes: 7 additions & 7 deletions applications/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@

Labeling functions for various weakly supervised biomedical classification tasks

| Name | Task | Domain | Type | Source |
|------------------|------------------|------------|------|-----------------------------------------------|
| `bc5cdr/` | Chemical/Disease | Literature | NER | BioCreative V Chemical-Disease Relation (CDR) |
| `i2b2drugs/` | Drug | Clinical | NER | n2c2/i2b2 2009 Medication Challenge |
| `shareclef2014/` | Disorder | Clinical | NER | ShARe/CLEF 2014 |
| `thyme/` | DocRelaTime | Clinical | Span | THYME 2017 |
| `covid19/` | Exposure | Clinical | Span | COVID-19 exposure |
| Name | Task | Domain | Type | Source | Access |
|------------------|------------------|------------|------|-----------------------------------------------|------------|
| `bc5cdr/` | Chemical/Disease | Literature | NER | BioCreative V Chemical-Disease Relation (CDR) | Public |
| `i2b2drugs/` | Drug | Clinical | NER | n2c2/i2b2 2009 Medication Challenge | DUA |
| `shareclef2014/` | Disorder | Clinical | NER | ShARe/CLEF 2014 | DUA |
| `thyme/` | DocRelaTime | Clinical | Span | THYME 2017 | DUA|
| `covid19/` | Exposure | Clinical | Span | COVID-19 exposure | - |
13 changes: 10 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))


# -- Project information -----------------------------------------------------
Expand All @@ -31,8 +31,15 @@
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.coverage',
'sphinx.ext.napoleon',
'sphinx.ext.autosummary'
]

autosummary_generate = True


# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

Expand Down
27 changes: 19 additions & 8 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,28 @@
.. trove documentation master file, created by
sphinx-quickstart on Mon Mar 22 00:23:28 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to trove's documentation!
Welcome to Trove's documentation!
=================================

Trove is a research framework for building weakly supervised (bio)medical
named entity recognition (NER) and other entity attribute classifiers without hand-labeled training data.

The COVID-19 pandemic has underlined the need for faster, more flexible ways of building
and sharing state-of-the-art NLP/NLU tools to analyze electronic health records (EHR),
scientific literature, and social media. Trove provides tools for combining freely
available supervision sources such as medical ontologies from the Unified Medical
Language System (UMLS), common text heuristics, and other noisy labeling sources for use
as entity *labelers* in weak supervision frameworks such as Snorkel, FlyingSquid, and
others. Technical details are available in our manuscript.

.. autosummary::
:toctree: _autosummary
:recursive:

trove

.. toctree::
:maxdepth: 2
:maxdepth: 10
:caption: Contents:



Indices and tables
==================

Expand Down
16 changes: 16 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
toolz==0.11.1
tqdm==4.59.0
torch==1.8.0
requests==2.25.1
pandas==1.1.5
scipy==1.5.2
lxml==4.6.2
spacy==3.0.5
numpy==1.19.2
joblib==1.0.1
msgpack_python==0.5.6
norm==1.6.0
pytorch_pretrained_bert==0.6.2
scikit_learn==0.24.1
seqeval==1.2.2
stopwords==1.0.0
Empty file added test/__init__.py
Empty file.
Empty file added test/metrics/__init__.py
Empty file.
12 changes: 12 additions & 0 deletions test/metrics/test_metrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
import unittest
import numpy as np


class MetricsTest(unittest.TestCase):
def test_convert_tag_fmt(self):
return True



if __name__ == "__main__":
unittest.main()
30 changes: 14 additions & 16 deletions trove/labelers/abbreviations.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
"""
import re
import collections
from typing import Set
from trove.dataloaders.contexts import Span
from trove.labelers.labeling import (
LabelingFunction,
Expand All @@ -23,13 +22,13 @@
)
from typing import List, Set, Dict

def is_short_form(s, min_length=2):
def is_short_form(text, min_length=2):
""" Rule-based function for determining if a token is likely
an abbreviation, acronym or other "short form" mention
Parameters
----------
s
text
min_length
Returns
Expand All @@ -39,22 +38,21 @@ def is_short_form(s, min_length=2):
accept_rgx = '[0-9A-Z-]{2,8}[s]*'
reject_rgx = '([0-9]+/[0-9]+|[0-9]+[-][0-7]+)'

keep = re.search(accept_rgx, s) != None
keep &= re.search(reject_rgx, s) == None
keep &= not s.strip("-").isdigit()
keep &= "," not in s
keep &= len(s) < 15
keep = re.search(accept_rgx, text) is not None
keep &= re.search(reject_rgx, text) is None
keep &= not text.strip("-").isdigit()
keep &= "," not in text
keep &= len(text) < 15

# reject if too short too short or contains lowercase single letters
reject = (len(s) > 3 and not keep)
reject |= (len(s) <= 3 and re.search("[/,+0-9-]", s) != None)
reject |= (len(s) < min_length)
reject |= (len(s) <= min_length and s.islower()) #
reject = (len(text) > 3 and not keep)
reject |= (len(text) <= 3 and re.search("[/,+0-9-]", text) is not None)
reject |= (len(text) < min_length)
reject |= (len(text) <= min_length and text.islower())

return False if reject else True



def get_parenthetical_short_forms(sentence):
"""Generator that returns indices of all words directly
wrapped by parentheses or brackets.
Expand All @@ -67,10 +65,10 @@ def get_parenthetical_short_forms(sentence):
-------
"""
for i, w in enumerate(sentence.words):
for i, _ in enumerate(sentence.words):
if i > 0 and i < len(sentence.words) - 1:
window = sentence.words[i - 1:i + 2]
if (window[0] == "(" and window[-1] == ")"):
if window[0] == "(" and window[-1] == ")":
if is_short_form(window[1]):
yield i

Expand All @@ -83,7 +81,7 @@ def extract_long_form(i, sentence, max_dup_chars=2):
short_form = sentence.words[i]
left_window = [w for w in sentence.words[0:i]]

# strip brackets/parantheses
# strip brackets/parentheses
while left_window and left_window[-1] in ["(", "[", ":"]:
left_window.pop()

Expand Down
17 changes: 9 additions & 8 deletions trove/labelers/core.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
import logging
import itertools
import numpy as np
from scipy import sparse
from functools import partial
from toolz import partition_all
from joblib import Parallel, delayed
from abc import ABCMeta, abstractmethod

logger = logging.getLogger(__name__)

class Distributed:

Expand All @@ -14,7 +15,7 @@ def __init__(self, num_workers=1, backend='multiprocessing'):
backend=backend,
prefer="processes")
self.num_workers = num_workers
print(self.client)
logger.info(self.client)


class SequenceLabelingServer(Distributed):
Expand All @@ -29,15 +30,15 @@ def apply(self, lfs, Xs, block_size=None):
block_size = int(
np.ceil(np.sum([len(x) for x in Xs]) / self.num_workers)
)
print(f'auto block size={block_size}')
logger.info("auto block size %s", block_size)

if block_size:
blocks = list(
partition_all(block_size, itertools.chain.from_iterable(Xs))
)

print(f"Partitioned into {len(blocks)} blocks, "
f"{np.unique([len(x) for x in blocks])} sizes")
lens = np.unique([len(x) for x in blocks])
logger.info("Partitioned into %s blocks %s sizes ", len(blocks), lens)

do = delayed(partial(SequenceLabelingServer.worker, lfs))
jobs = (do(batch) for batch in blocks)
Expand Down Expand Up @@ -67,15 +68,15 @@ def apply(self, lfs, Xs, block_size=None):
block_size = int(
np.ceil(np.sum([len(x) for x in Xs]) / self.num_workers)
)
print(f'auto block size={block_size}')
logger.info("auto block size %s", block_size)

if block_size:
blocks = list(
partition_all(block_size, itertools.chain.from_iterable(Xs))
)

print(f"Partitioned into {len(blocks)} blocks, "
f"{np.unique([len(x) for x in blocks])} sizes")
lens = np.unique([len(x) for x in blocks])
logger.info("Partitioned into %s blocks %s sizes ", len(blocks), lens)

do = delayed(partial(LabelingServer.worker, lfs))
jobs = (do(batch) for batch in blocks)
Expand Down
31 changes: 17 additions & 14 deletions trove/labelers/labeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ def __init__(self,
name: str,
ontology: Dict[str, np.array],
case_sensitive: bool = False,
max_ngrams: int = 4,
max_ngrams: int = 8,
stopwords = None) -> None:

super().__init__(name, None)
Expand All @@ -103,9 +103,17 @@ def __init__(self,
else int(np.argmax(proba) + 1)
self.ontology = frozenset(ontology)

def _get_term_label(self, t):
def _get_term_label(self, term):
"""
Check for term match, given set of simple transformations
(e.g., lowercasing, simple pluralization)
for key in [t, t.lower(), t.rstrip('s'), t + 's']:
TODO: Consider a proper abstraction for handling valid aliases.
:param term:
:return:
"""
for key in [term, term.lower(), term.rstrip('s'), term + 's']:
if key in self.stopwords:
return self.stopwords[key]
if key in self._labels:
Expand Down Expand Up @@ -202,17 +210,12 @@ def _get_term_label(self, t):
return None

def _merge_matches(self, matches):
""" Merge all contiguous spans with the same label.
Parameters
----------
matches
Returns
-------
"""
Merge all contiguous spans with the same label.
:param matches:
:return:
"""
terms = [m[-1] for m in matches]
labels = [self._get_term_label(m[-1]) for m in matches]

Expand Down Expand Up @@ -387,7 +390,7 @@ def __call__(self, sentence):

class SynSetLabelingFunction(LabelingFunction):
"""
Given a map of TERM -> {t \in SYNONYMS}, if the TERM AND any t
Given a map of TERM -> {t \\in SYNONYMS}, if the TERM AND any t
appear in document, label as a positive instance of the entity.
"""
def __init__(self,
Expand Down Expand Up @@ -466,4 +469,4 @@ def __call__(self, sentence):

spans = self._get_contiguous_spans(spans)
spans = list(itertools.chain.from_iterable([s for s in spans if len(s) >= self.min_length]))
return {i:L[i] for i in spans}
return {i:L[i] for i in spans}
Loading

0 comments on commit 9775d41

Please sign in to comment.