Genome Manager (#81)

* genome manager Co-authored-by: Jahnavi Singh <[email protected]> Co-authored-by: akmorrow13 <[email protected]>
YosefLab · Aug 11, 2021 · a326c67 · a326c67
1 parent f823bfd
commit a326c67
Show file tree

Hide file tree

Showing 16 changed files with 455 additions and 283 deletions.
diff --git a/docs/usage/create_dataset.rst b/docs/usage/create_dataset.rst
@@ -71,14 +71,17 @@ You can generate your own Epitome dataset from ENCODE using the following comman
                           http://hgdownload.cse.ucsc.edu/admin/exe/
 
 
-To use your own dataset in an Epitome model, make sure to set the environment environment variable
-``EPITOME_DATA_PATH`` that points to your custom dataset. This will tell Epitome where to load
-data from.
+To use your own dataset in an Epitome model, make sure to specify the ``data_dir``
+and/or ``assembly`` variables when creating the ``EpitomeDataset`` class. This
+will tell Epitome where to load data from. If neither variables are specified,
+the default assembly will be downloaded from the Epitome AWS S3 cluster into the
+default data directory on your machine. See `Load your processed dataset <./dataset.html>`__ for more details.
 
-.. code:: bash
+.. code:: python
+
+  from epitome.dataset import *
 
-  import os
-  os.environ["EPITOME_DATA_PATH"] = 'path/to/my/epitome/dataset'
+  dataset = EpitomeDataset(data_dir="path/to/configured/data", assembly="hg19")
   ...
 
 

diff --git a/docs/usage/dataset.rst b/docs/usage/dataset.rst
@@ -15,8 +15,8 @@ Create an Epitome Dataset
 -------------------------
 
 First, create an Epitome Dataset. In the dataset, you will define the
-ChIP-seq targets you want to predict, the cell types you want to train from,
-and the assays you want to use to compute cell type similarity.
+ChIP-seq targets you want to predict, the cell types you want to train from and
+the assays you want to use to compute cell type similarity.
 
 .. code:: python
 
@@ -32,7 +32,6 @@ specify the minimum number of cells required for a ChIP-seq target, and the mini
 number of ChIP-seq targets required to include a celltype. By default,
 ``min_cells_per_target = 3`` and ``min_targets_per_cell = 2``.
 
-
 .. code:: python
 
  	targets = ['CTCF','RAD21','SMC3']
@@ -48,7 +47,9 @@ cell type similarity, you can specify in the Epitome dataset:
 
 .. code:: python
 
-	dataset = EpitomeDataset(targets=targets, cells=celltypes, similarity_targets = ['DNase', 'H3K27ac'])
+	dataset = EpitomeDataset(targets=targets,
+					cells=celltypes,
+					similarity_targets = ['DNase', 'H3K27ac'])
 
 You can then visualize the ChIP-seq targets and cell types in your dataset by
 using the ``view()`` function:
@@ -72,3 +73,51 @@ in the dataset:
 
 
 You can now use your ``dataset`` in an Epitome model.
+
+Load your processed dataset
+---------------------------
+You can specify the data path and/or genome assembly that you would like to use
+in the Epitome dataset. You just need to define the ``data_dir`` and/or
+``assembly`` variables:
+
+.. code:: python
+
+	dataset = EpitomeDataset(data_dir="path/to/configured/data",
+				assembly="hg19")
+
+Note if both the ``data_dir`` and ``assembly`` are set, the dataset will
+append the specified assembly to the data_dir path such as
+``~/$USERNAME/epitome/data/hg19/data.h5`` and return the dataset that is stored
+in the path if it exists. If there is no data stored at that path, Epitome will
+try to download the specified assembly from the S3 cluster at
+https://epitome-data.s3-us-west-1.amazonaws.com.
+
+You do not need to define both variables though. If you leave ``data_dir`` empty,
+the Epitome dataset will append the ``assembly`` to the default data path located
+in ``~/$USER_NAME/.epitome/data/`` and return the dataset if it exists at that path.
+If there is no existing dataset located at the data path, Epitome will download
+the dataset for the specified assembly from S3 to that path:
+
+.. code:: python
+
+	dataset = EpitomeDataset(assembly="hg19")
+
+If the assembly is not specified but the ``data_dir`` is, the dataset will assume
+that the specified data directory ``data_dir`` is the absolute data path and it
+will append the default assembly to the configured data path. Like above, if the
+dataset exists at the configured data path, Epitome will load the configured data
+into the EpitomeDataset. If there is no existing dataset, Epitome will download
+the dataset for the default assembly from S3 and store it at the default data path:
+
+.. code:: python
+
+	dataset = EpitomeDataset(data_dir="path/to/configured/data")
+
+If neither ``data_dir`` or ``assembly`` are set, the dataset will just try to
+fetch the ``data.zip`` file in the default data directory. If no data exists in
+the default directory, Epitome will download the dataset for the default assembly
+from S3 and store it at the default data path:
+
+.. code:: python
+
+	dataset = EpitomeDataset()
diff --git a/docs/usage/train.rst b/docs/usage/train.rst
@@ -68,7 +68,7 @@ loss stops improving, the model will stop training early:
 If you are concerned about the model above overtraining because the model continues
 to improve by miniscule amounts, you can specify the min-delta which is minimum
 change in the train-validation loss required to qualify as an improvement. In the
-model below, a minimum improvement of at least 0.1 is required for the model to
+model below, a minimum improvement of at least 0.01 is required for the model to
 qualify as improving.
 
 If you are concerned about the model above under-fitting (stopping training too
@@ -86,7 +86,7 @@ be found in the `Github repo <https://github.com/YosefLab/epitome>`__.
 
 	best_model_batches, total_trained_batches, train_valid_losses = model.train(5000,
 		patience = 3,
-		min_delta = 0.1)
+		min_delta = 0.01)
 
 Test the Model
 ----------------

diff --git a/epitome/__init__.py b/epitome/__init__.py
@@ -17,24 +17,7 @@
 .. automodule:: epitome.conversion
 
 """
-import os
+
 from os.path import expanduser
 
 __path__ = __import__('pkgutil').extend_path(__path__, __name__)
-
-
-def GET_EPITOME_USER_PATH():
-    return os.path.join(os.path.expanduser('~'), '.epitome')
-
-def GET_DATA_PATH():
-	"""
-	Check if user has set env variable that specifies data path.
-	Otherwise, use default location.
-
-	Returns:
-		location of epitome data with all required files
-	"""
-	if os.environ.get("EPITOME_DATA_PATH") is not None:
-		return os.environ["EPITOME_DATA_PATH"]
-	else:
-		return os.path.join(GET_EPITOME_USER_PATH(),'data','hg19')
diff --git a/epitome/conversion.py b/epitome/conversion.py
@@ -97,10 +97,7 @@ def get_binary_vector(self, vector = None):
         assert vector.shape[0] == len(self.compare), "Error: value_vector must be the same shape as self.compare"
 
         base_vector = np.zeros((len(self.base)))
-        print(base_vector.shape)
-
         tmp = self._get_overlap()
-        print(tmp)
 
         base_indices = tmp.idx_base.values
         convert_indices = tmp.idx.values

diff --git a/epitome/dataset.py b/epitome/dataset.py
@@ -21,16 +21,21 @@
 from sklearn.metrics import jaccard_score
 
 # local imports
-from epitome import *
 from .constants import Dataset
 from .functions import download_and_unzip
 from .viz import plot_assay_heatmap
 
 ################### File accession constants #######################
-S3_DATA_PATH = 'https://epitome-data.s3-us-west-1.amazonaws.com/hg19.zip'
+S3_DATA_PATH = 'https://epitome-data.s3-us-west-1.amazonaws.com'
 
-# os env that should be set by user to explicitly set the data path
-EPITOME_DATA_PATH_ENV="EPITOME_DATA_PATH"
+# List of available assembiles in S3_DATA_PATH
+EPITOME_GENOME_ASSEMBLIES = ['hg19', 'hg38', 'test']
+# default genome assembly
+DEFAULT_EPITOME_ASSEMBLY = "hg19"
+# default path to where all epitome related information is stored
+EPITOME_USER_PATH = os.path.join(os.path.expanduser('~'), '.epitome')
+# default path to where epitome data is installed. Subdirectory of EPITOME_USER_PATH
+DEFAULT_EPITOME_DATA_PATH = os.path.join(EPITOME_USER_PATH, 'data')
 
 # data files required by epitome
 # data.h5 contains data, row information (celltypes and targets) and
@@ -71,7 +76,8 @@ def __init__(self,
                  cells = None,
                  min_cells_per_target = 3,
                  min_targets_per_cell = 2,
-                 similarity_targets = ['DNase']):
+                 similarity_targets = ['DNase'],
+                 assembly=None):
         '''
         Initializes an EpitomeDataset.
 
@@ -87,9 +93,10 @@ def __init__(self,
           (ie. DNase, H3K27ac, etc.)
         '''
 
-
+        if assembly is not None:
+            self.assembly = assembly
         # get directory where h5 file is stored
-        self.data_dir = EpitomeDataset.get_data_dir(data_dir)
+        self.data_dir = EpitomeDataset.download_data_dir(data_dir, assembly)
         self.h5_path = os.path.join(self.data_dir, EPITOME_H5_FILE)
 
         # save all parameters for any future use
@@ -154,10 +161,13 @@ def __init__(self,
         self.valid_chrs = [i.decode() for i in dataset['columns']['index']['valid_chrs'][:]]
         self.test_chrs = [i.decode() for i in dataset['columns']['index']['test_chrs'][:]]
 
-        self.assembly = dataset['meta']['assembly'][:][0].decode()
+        dataset_assembly = dataset['meta']['assembly'][:][0].decode()
+        if assembly is not None:
+            assert assembly == dataset_assembly, "Different assemblies"
+        else:
+            self.assembly = dataset_assembly
         self.source = dataset['meta']['source'][:][0].decode()
 
-
         dataset.close()
 
     def set_train_validation_indices(self, chrom):
@@ -237,7 +247,7 @@ def get_data(self, mode):
 
 
     @staticmethod
-    def get_y_indices_for_cell(matrix, cellmap,  cell):
+    def get_y_indices_for_cell(matrix, cellmap, cell):
         '''
         Gets indices for a cell.
         TODO: this function is called in genertors.py.
@@ -267,25 +277,65 @@ def get_y_indices_for_target(matrix, targetmap, target):
         return np.copy(matrix[:,targetmap[target]])
 
     @staticmethod
-    def get_data_dir(data_dir=None):
+    def contains_required_files(data_dir):
+        # make sure all required files exist
+        required_paths = [os.path.join(data_dir, x) for x in REQUIRED_FILES]
+        return np.all([os.path.exists(x) for x in required_paths])
+
+    @staticmethod
+    def get_data_dir(data_dir=None, assembly=None):
         '''
-        Loads data processed from data/download_encode.py. This will check that all required files
-        exist.
+        If both data_dir and assembly are set, it will return the data_dir with the specified
+        assembly. If only the assembly is set, it will return the default data_dir with the specified
+        assembly. If only the data_dir is set, it will just return the data_path. If neither data_dir
+        nor assembly are set, it will return the default data_dir with the default assembly.
 
-        :param str data_dir: Directory containing data.h5 file saved in data/download_encode.py script.
+        :param str data_dir: Directory that should contain the data.h5 file.
+        :param str assembly: Genome assembly that should be saved.
         :return: directory containing data.h5 file
-        :rtype: str
+                genome assembly of the data
+        :rtype: tuple
         '''
+        if (data_dir is not None) and (assembly is not None):
+            epitome_data_dir = os.path.join(data_dir, assembly)
+        elif (assembly is not None):
+            epitome_data_dir = os.path.join(DEFAULT_EPITOME_DATA_PATH, assembly)
+        elif (data_dir is not None):
+            epitome_data_dir = data_dir
+        else:
+            print("Warning: genome assembly was not set in EpitomeDataset. Defaulting assembly to %s." % DEFAULT_EPITOME_ASSEMBLY)
+            epitome_data_dir = os.path.join(DEFAULT_EPITOME_DATA_PATH, DEFAULT_EPITOME_ASSEMBLY)
+            assembly = DEFAULT_EPITOME_ASSEMBLY
+        return epitome_data_dir, assembly
 
-        if not data_dir:
-            data_dir = GET_DATA_PATH()
-            download_and_unzip(S3_DATA_PATH, os.path.dirname(data_dir))
+    @staticmethod
+    def list_genome_assemblies():
+        return ", ".join(EPITOME_GENOME_ASSEMBLIES)
 
-        # make sure all required files exist
-        required_paths = [os.path.join(data_dir, x) for x in  REQUIRED_FILES]
-        assert(np.all([os.path.exists(x) for x in required_paths]))
+    @staticmethod
+    def download_data_dir(data_dir=None, assembly=None):
+        '''
+        Loads data processed from data/download_encode.py. This will check that all required files
+        exist. If both data_dir and assembly are set, it will return the data_dir with the specified
+        assembly. If only the assembly is set, it will return the default data_dir with the specified
+        assembly. If only the data_dir is set, it will just return the data_path. If neither data_dir
+        nor assembly are set, it will return the default data_dir with the default assembly.
 
-        return data_dir
+        :param str data_dir: Directory containing data.h5 file saved in data/download_encode.py script.
+        :return: directory containing data.h5 file
+        :rtype: str
+        '''
+        epitome_data_dir, assembly = EpitomeDataset.get_data_dir(data_dir, assembly)
+
+        if not EpitomeDataset.contains_required_files(epitome_data_dir):
+            # Grab data directory and download it from S3 if it doesn't have the required files
+            assert assembly is not None, "Specify assembly to download."
+            assert assembly in EPITOME_GENOME_ASSEMBLIES, "assembly %s data is not in the S3 cluster. Must be either in %s" % (assembly, EpitomeDataset.list_genome_assemblies())
+            url_path = os.path.join(S3_DATA_PATH, assembly + ".zip")
+            download_and_unzip(url_path, epitome_data_dir)
+            # Make sure all required files exist
+            assert EpitomeDataset.contains_required_files(epitome_data_dir)
+        return epitome_data_dir
 
     def list_targets(self):
         '''
@@ -297,11 +347,11 @@ def list_targets(self):
         '''
         return list(self.targetmap)
 
-
     @staticmethod
     def get_assays(targets = None,
                      cells = None,
                      data_dir = None,
+                     assembly = None,
                      min_cells_per_target = 3,
                      min_targets_per_cell = 2,
                      similarity_targets = ['DNase']):
@@ -320,8 +370,7 @@ def get_assays(targets = None,
         :rtype: tuple
         '''
 
-        if not data_dir:
-            data_dir = GET_DATA_PATH()
+        data_dir = EpitomeDataset.download_data_dir(data_dir, assembly)
 
         data = h5py.File(os.path.join(data_dir, EPITOME_H5_FILE), 'r')
 
@@ -489,9 +538,9 @@ def save(out_path,
         :param list test_chrs: list of test chromsomes, str
 
         '''
-
-        if os.path.exists(os.path.join(out_path, EPITOME_H5_FILE)):
-            raise Exception("%s already exists at %s" % (EPITOME_H5_FILE, out_path))
+        epitome_data_dir, __ = EpitomeDataset.get_data_dir(out_path, assembly)
+        if os.path.exists(os.path.join(epitome_data_dir, EPITOME_H5_FILE)):
+            raise Exception("%s already exists at %s" % (EPITOME_H5_FILE, epitome_data_dir))
 
         # assertions
         assert all_data.dtype == np.dtype('int8'), "all_data type should be int8"
@@ -504,12 +553,12 @@ def save(out_path,
         assert np.all([type(i)==str for i in valid_chrs]), "valid_chrs elements must be type string"
         assert np.all([type(i)==str for i in test_chrs]), "test_chrs elements must be type string"
 
-        if not os.path.exists(out_path):
-            os.makedirs(out_path)
+        if not os.path.exists(epitome_data_dir):
+            os.makedirs(epitome_data_dir)
 
         try:
             # toy dataset with everything in it
-            new_data = h5py.File(os.path.join(out_path, EPITOME_H5_FILE), 'w')
+            new_data = h5py.File(os.path.join(epitome_data_dir, EPITOME_H5_FILE), 'w')
 
             # 0. set data
             data = new_data.create_dataset("data", all_data.shape, dtype=all_data.dtype, compression="gzip", compression_opts=9)
@@ -593,7 +642,6 @@ def save(out_path,
                                        compression="gzip", compression_opts=9)
             source_ds[:]=source.encode()
 
-
             # 4. Make sure we have all the correct keys
             keys = sorted(set(EpitomeDataset.all_keys(new_data)))
             assert np.all([i in keys for i in REQUIRED_KEYS ]), "Error: missing required keys for new dataset"