Updated README with Nature Communications manuscript link.

Added skeleton code for integrating unit tests. Added additional details to applications README. Misc small code formatting changes
som-shahlab · Apr 1, 2021 · 9775d41 · 9775d41
1 parent 9dd17fd
commit 9775d41
Show file tree

Hide file tree

Showing 15 changed files with 246 additions and 1,636 deletions.
diff --git a/README.md b/README.md
@@ -1,30 +1,58 @@
 # Trove 
+--
 <!--[![Build Status](https://travis-ci.com/som-shahlab/trove.svg?branch=main)](https://travis-ci.com/som-shahlab/trove)-->
-<!--[![Documentation Status](https://readthedocs.org/projects/trove/badge/?version=latest)](https://trove.readthedocs.io/en/latest/?badge=latest)-->
+[![Documentation Status](https://readthedocs.org/projects/trove/badge/?version=latest)](https://trove.readthedocs.io/en/latest/?badge=latest)
 [![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 
-Trove is a framework for training weakly supervised (bio)medical named entity recognition (NER) and other entity attribute classifiers without hand-labeled training data. 
+Trove is a research framework for building weakly supervised (bio)medical named entity recognition (NER) and other entity attribute classifiers without hand-labeled training data.
 
-We combine a range of supervision signal common medical ontologies such as the Unified Medical Language System (UMLS), clinical text heuristics, and other noisy labeling sources for use with weak supervision frameworks such as [Snorkel](https://github.com/snorkel-team/snorkel). 
+The COVID-19 pandemic has underlined the need for faster, more flexible ways of building and sharing state-of-the-art NLP/NLU tools to analyze electronic health records (EHR), scientific literature, and social media. Trove provides tools for combining freely available supervision sources such as medical ontologies from the [Unified Medical Language System (UMLS)](https://www.nlm.nih.gov/research/umls/index.html), common text heuristics, and other noisy labeling sources for use as entity *labelers* in weak supervision frameworks such as [Snorkel](https://github.com/snorkel-team/snorkel), [FlyingSquid ](https://github.com/HazyResearch/flyingsquid) and others. Technical details are available in our [manuscript](https://www.nature.com/articles/s41467-021-22328-4).
 
 
-Technical details are available in our [manuscript](https://arxiv.org/abs/2008.01972).
 
+Trove has been used as part of several COVID-19 reseach efforts at Stanford. 
 
-## Installation
+- [Continuous symptom profiling of patients screened for SARS-CoV-2](https://med.stanford.edu/covid19/research.html#data-science-and-modeling). We used a daily feed of patient notes from Stanford Health Care emergency departments to generate up-to-date [COVID-19 symptom frequency](https://docs.google.com/spreadsheets/d/1iZZvbv94fpZdC6XaiPosiniMOh18etSPliAXVlLLr1w/edit#gid=344371264) data. Funded by the [Bill & Melinda Gates Foundation](https://www.gatesfoundation.org/about/committed-grants/2020/04/inv017214).
+- [Estimating the efficacy of symptom-based screening for COVID-19](https://rdcu.be/chSrv) published in *npj Digitial Medicine*.
+- Our COVID-19 symptom data was used by CMU's [DELPHI group](https://covidcast.cmu.edu/) to prioritize selection of informative features from [Google's Symptom Search Trends dataset](https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/docs/table-search-trends.md).
 
-Requirements: python 3.6, pytorch 1.0+, snorkel 0.9.5+
 
-## Tutorials
+## Getting Started
 
-See `tutorials/`
+### Tutorials
 
-## Requirements
+See [`tutorials/`](https://github.com/som-shahlab/trove/tree/dev/tutorials) for Jupyter notebooks walking through an example NER application.
+
+### Installation
+
+Requirements: Python 3.6 or later. We recomend using `pip` to install 
+
+`pip install -r requirements.txt`
+
+## Contributions
+We welcome all contributions to the code base! Please submit a pull request and/or start a discussion on GitHub Issues.
+
+Weakly supervised methods for programatically building and maintaining training labels provides new opportunities for the larger community to participate in the creation of important datasets. This is especially exciting in domains such as medicine, where sharing labeled data is often challening due to patient privacy concerns.
+
+Inspired by recent efforts such as [HuggingFace's Datasets](ttps://github.com/huggingface/datasets) library,
+we would love to start a conversation around how to support sharing labelers in service of mantaining an open task library, so that it is easier to create, deploy, and version control weakly supervised models. 
 
-Tested on OSX and Linux.
 
 ## Citation
-If use Trove in your research, please cite [Ontology-driven weak supervision for clinical entity classification in electronic health records]()
+If use Trove in your research, please cite us!
+
+Fries, J.A., Steinberg, E., Khattar, S. et al. Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat Commun 12, 2017 (2021). https://doi-org.stanford.idm.oclc.org/10.1038/s41467-021-22328-4
+
+```
+@article{fries2021trove,
+  title={Ontology-driven weak supervision for clinical entity classification in electronic health records},
+  author={Fries, Jason A and Steinberg, Ethan and Khattar, Saelig and Fleming, Scott L and Posada, Jose and Callahan, Alison and Shah, Nigam H},
+  journal={Nature Communications},
+  volume={12},
+  number={1},
+  year={2021},
+  publisher={Nature Publishing Group}
+}
+```
 
-See the `manuscript` branch for the code used 
 
diff --git a/applications/README.md b/applications/README.md
@@ -5,10 +5,10 @@
 
 Labeling functions for various weakly supervised biomedical classification tasks
 
-| Name             | Task             | Domain     | Type | Source                                        |
-|------------------|------------------|------------|------|-----------------------------------------------|
-| `bc5cdr/`        | Chemical/Disease | Literature | NER  | BioCreative V Chemical-Disease Relation (CDR) |
-| `i2b2drugs/`     | Drug             | Clinical   | NER  | n2c2/i2b2 2009 Medication Challenge           |
-| `shareclef2014/` | Disorder         | Clinical   | NER  | ShARe/CLEF 2014                               |
-| `thyme/`       | DocRelaTime         | Clinical   | Span  | THYME 2017                           |
-| `covid19/`       | Exposure         | Clinical   | Span  | COVID-19 exposure                            |
+| Name             | Task             | Domain     | Type | Source                                        | Access |
+|------------------|------------------|------------|------|-----------------------------------------------|------------|
+| `bc5cdr/`        | Chemical/Disease | Literature | NER  | BioCreative V Chemical-Disease Relation (CDR) | Public |
+| `i2b2drugs/`     | Drug             | Clinical   | NER  | n2c2/i2b2 2009 Medication Challenge           | DUA |
+| `shareclef2014/` | Disorder         | Clinical   | NER  | ShARe/CLEF 2014                               | DUA |
+| `thyme/`       | DocRelaTime         | Clinical   | Span  | THYME 2017                           | DUA|
+| `covid19/`       | Exposure         | Clinical   | Span  | COVID-19 exposure                            | - |
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -10,9 +10,9 @@
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 #
-# import os
-# import sys
-# sys.path.insert(0, os.path.abspath('.'))
+import os
+import sys
+sys.path.insert(0, os.path.abspath('../..'))
 
 
 # -- Project information -----------------------------------------------------
@@ -31,8 +31,15 @@
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
 extensions = [
+	'sphinx.ext.autodoc', 
+	'sphinx.ext.coverage', 
+	'sphinx.ext.napoleon',
+	'sphinx.ext.autosummary'
 ]
 
+autosummary_generate = True 
+
+
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ['_templates']
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -1,17 +1,28 @@
-.. trove documentation master file, created by
-   sphinx-quickstart on Mon Mar 22 00:23:28 2021.
-   You can adapt this file completely to your liking, but it should at least
-   contain the root `toctree` directive.
-
-Welcome to trove's documentation!
+Welcome to Trove's documentation!
 =================================
 
+Trove is a research framework for building weakly supervised (bio)medical 
+named entity recognition (NER) and other entity attribute classifiers without hand-labeled training data.
+
+The COVID-19 pandemic has underlined the need for faster, more flexible ways of building 
+and sharing state-of-the-art NLP/NLU tools to analyze electronic health records (EHR), 
+scientific literature, and social media. Trove provides tools for combining freely 
+available supervision sources such as medical ontologies from the Unified Medical 
+Language System (UMLS), common text heuristics, and other noisy labeling sources for use 
+as entity *labelers* in weak supervision frameworks such as Snorkel, FlyingSquid, and 
+others. Technical details are available in our manuscript.
+
+.. autosummary::
+   :toctree: _autosummary
+   :recursive:
+
+   trove
+
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 10
    :caption: Contents:
 
 
-
 Indices and tables
 ==================
 

diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,16 @@
+toolz==0.11.1
+tqdm==4.59.0
+torch==1.8.0
+requests==2.25.1
+pandas==1.1.5
+scipy==1.5.2
+lxml==4.6.2
+spacy==3.0.5
+numpy==1.19.2
+joblib==1.0.1
+msgpack_python==0.5.6
+norm==1.6.0
+pytorch_pretrained_bert==0.6.2
+scikit_learn==0.24.1
+seqeval==1.2.2
+stopwords==1.0.0
diff --git a/test/__init__.py b/test/__init__.py
diff --git a/test/metrics/__init__.py b/test/metrics/__init__.py
diff --git a/test/metrics/test_metrics.py b/test/metrics/test_metrics.py
@@ -0,0 +1,12 @@
+import unittest
+import numpy as np
+
+
+class MetricsTest(unittest.TestCase):
+    def test_convert_tag_fmt(self):
+        return True
+
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/trove/labelers/abbreviations.py b/trove/labelers/abbreviations.py
@@ -14,7 +14,6 @@
 """
 import re
 import collections
-from typing import Set
 from trove.dataloaders.contexts import Span
 from trove.labelers.labeling import (
     LabelingFunction,
@@ -23,13 +22,13 @@
 )
 from typing import List, Set, Dict
 
-def is_short_form(s, min_length=2):
+def is_short_form(text, min_length=2):
     """ Rule-based function for determining if a token is likely
     an abbreviation, acronym or other "short form" mention
 
     Parameters
     ----------
-    s
+    text
     min_length
 
     Returns
@@ -39,22 +38,21 @@ def is_short_form(s, min_length=2):
     accept_rgx = '[0-9A-Z-]{2,8}[s]*'
     reject_rgx = '([0-9]+/[0-9]+|[0-9]+[-][0-7]+)'
 
-    keep = re.search(accept_rgx, s) != None
-    keep &= re.search(reject_rgx, s) == None
-    keep &= not s.strip("-").isdigit()
-    keep &= "," not in s
-    keep &= len(s) < 15
+    keep = re.search(accept_rgx, text) is not None
+    keep &= re.search(reject_rgx, text) is None
+    keep &= not text.strip("-").isdigit()
+    keep &= "," not in text
+    keep &= len(text) < 15
 
     # reject if too short too short or contains lowercase single letters
-    reject = (len(s) > 3 and not keep)
-    reject |= (len(s) <= 3 and re.search("[/,+0-9-]", s) != None)
-    reject |= (len(s) < min_length)
-    reject |= (len(s) <= min_length and s.islower())  #
+    reject = (len(text) > 3 and not keep)
+    reject |= (len(text) <= 3 and re.search("[/,+0-9-]", text) is not None)
+    reject |= (len(text) < min_length)
+    reject |= (len(text) <= min_length and text.islower())
 
     return False if reject else True
 
 
-
 def get_parenthetical_short_forms(sentence):
     """Generator that returns indices of all words directly
     wrapped by parentheses or brackets.
@@ -67,10 +65,10 @@ def get_parenthetical_short_forms(sentence):
     -------
 
     """
-    for i, w in enumerate(sentence.words):
+    for i, _ in enumerate(sentence.words):
         if i > 0 and i < len(sentence.words) - 1:
             window = sentence.words[i - 1:i + 2]
-            if (window[0] == "(" and window[-1] == ")"):
+            if window[0] == "(" and window[-1] == ")":
                 if is_short_form(window[1]):
                     yield i
 
@@ -83,7 +81,7 @@ def extract_long_form(i, sentence, max_dup_chars=2):
     short_form = sentence.words[i]
     left_window = [w for w in sentence.words[0:i]]
 
-    # strip brackets/parantheses
+    # strip brackets/parentheses
     while left_window and left_window[-1] in ["(", "[", ":"]:
         left_window.pop()
 

diff --git a/trove/labelers/core.py b/trove/labelers/core.py
@@ -1,11 +1,12 @@
+import logging
 import itertools
 import numpy as np
 from scipy import sparse
 from functools import partial
 from toolz import partition_all
 from joblib import Parallel, delayed
-from abc import ABCMeta, abstractmethod
 
+logger = logging.getLogger(__name__)
 
 class Distributed:
 
@@ -14,7 +15,7 @@ def __init__(self, num_workers=1, backend='multiprocessing'):
                                backend=backend,
                                prefer="processes")
         self.num_workers = num_workers
-        print(self.client)
+        logger.info(self.client)
 
 
 class SequenceLabelingServer(Distributed):
@@ -29,15 +30,15 @@ def apply(self, lfs, Xs, block_size=None):
             block_size = int(
                 np.ceil(np.sum([len(x) for x in Xs]) / self.num_workers)
             )
-            print(f'auto block size={block_size}')
+            logger.info("auto block size %s", block_size)
 
         if block_size:
             blocks = list(
                 partition_all(block_size, itertools.chain.from_iterable(Xs))
             )
 
-        print(f"Partitioned into {len(blocks)} blocks, "
-              f"{np.unique([len(x) for x in blocks])} sizes")
+        lens = np.unique([len(x) for x in blocks])
+        logger.info("Partitioned into %s blocks %s sizes ", len(blocks), lens)
 
         do = delayed(partial(SequenceLabelingServer.worker, lfs))
         jobs = (do(batch) for batch in blocks)
@@ -67,15 +68,15 @@ def apply(self, lfs, Xs, block_size=None):
             block_size = int(
                 np.ceil(np.sum([len(x) for x in Xs]) / self.num_workers)
             )
-            print(f'auto block size={block_size}')
+            logger.info("auto block size %s", block_size)
 
         if block_size:
             blocks = list(
                 partition_all(block_size, itertools.chain.from_iterable(Xs))
             )
 
-        print(f"Partitioned into {len(blocks)} blocks, "
-              f"{np.unique([len(x) for x in blocks])} sizes")
+        lens = np.unique([len(x) for x in blocks])
+        logger.info("Partitioned into %s blocks %s sizes ", len(blocks), lens)
 
         do = delayed(partial(LabelingServer.worker, lfs))
         jobs = (do(batch) for batch in blocks)

diff --git a/trove/labelers/labeling.py b/trove/labelers/labeling.py
@@ -87,7 +87,7 @@ def __init__(self,
                  name: str,
                  ontology: Dict[str, np.array],
                  case_sensitive: bool = False,
-                 max_ngrams: int = 4,
+                 max_ngrams: int = 8,
                  stopwords = None) -> None:
 
         super().__init__(name, None)
@@ -103,9 +103,17 @@ def __init__(self,
                 else int(np.argmax(proba) + 1)
         self.ontology = frozenset(ontology)
 
-    def _get_term_label(self, t):
+    def _get_term_label(self, term):
+        """
+        Check for term match, given set of simple transformations
+        (e.g., lowercasing, simple pluralization)
 
-        for key in [t, t.lower(), t.rstrip('s'), t + 's']:
+        TODO: Consider a proper abstraction for handling valid aliases.
+
+        :param term:
+        :return:
+        """
+        for key in [term, term.lower(), term.rstrip('s'), term + 's']:
             if key in self.stopwords:
                 return self.stopwords[key]
             if key in self._labels:
@@ -202,17 +210,12 @@ def _get_term_label(self, t):
         return None
 
     def _merge_matches(self, matches):
-        """ Merge all contiguous spans with the same label.
-
-        Parameters
-        ----------
-        matches
-
-        Returns
-        -------
-
         """
+        Merge all contiguous spans with the same label.
 
+        :param matches:
+        :return:
+        """
         terms = [m[-1] for m in matches]
         labels = [self._get_term_label(m[-1]) for m in matches]
 
@@ -387,7 +390,7 @@ def __call__(self, sentence):
 
 class SynSetLabelingFunction(LabelingFunction):
     """
-    Given a map of TERM -> {t \in SYNONYMS}, if the TERM AND any t
+    Given a map of TERM -> {t \\in SYNONYMS}, if the TERM AND any t
     appear in document, label as a positive instance of the entity.
     """
     def __init__(self,
@@ -466,4 +469,4 @@ def __call__(self, sentence):
 
         spans = self._get_contiguous_spans(spans)
         spans = list(itertools.chain.from_iterable([s for s in spans if len(s) >= self.min_length]))
-        return {i:L[i] for i in spans}
+        return {i:L[i] for i in spans}