Skip to content

Commit

Permalink
Genome Manager (#81)
Browse files Browse the repository at this point in the history
* genome manager

Co-authored-by: Jahnavi Singh <[email protected]>
Co-authored-by: akmorrow13 <[email protected]>
  • Loading branch information
3 people authored Aug 11, 2021
1 parent f823bfd commit a326c67
Show file tree
Hide file tree
Showing 16 changed files with 455 additions and 283 deletions.
15 changes: 9 additions & 6 deletions docs/usage/create_dataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,14 +71,17 @@ You can generate your own Epitome dataset from ENCODE using the following comman
http://hgdownload.cse.ucsc.edu/admin/exe/
To use your own dataset in an Epitome model, make sure to set the environment environment variable
``EPITOME_DATA_PATH`` that points to your custom dataset. This will tell Epitome where to load
data from.
To use your own dataset in an Epitome model, make sure to specify the ``data_dir``
and/or ``assembly`` variables when creating the ``EpitomeDataset`` class. This
will tell Epitome where to load data from. If neither variables are specified,
the default assembly will be downloaded from the Epitome AWS S3 cluster into the
default data directory on your machine. See `Load your processed dataset <./dataset.html>`__ for more details.

.. code:: bash
.. code:: python
from epitome.dataset import *
import os
os.environ["EPITOME_DATA_PATH"] = 'path/to/my/epitome/dataset'
dataset = EpitomeDataset(data_dir="path/to/configured/data", assembly="hg19")
...
Expand Down
57 changes: 53 additions & 4 deletions docs/usage/dataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ Create an Epitome Dataset
-------------------------

First, create an Epitome Dataset. In the dataset, you will define the
ChIP-seq targets you want to predict, the cell types you want to train from,
and the assays you want to use to compute cell type similarity.
ChIP-seq targets you want to predict, the cell types you want to train from and
the assays you want to use to compute cell type similarity.

.. code:: python
Expand All @@ -32,7 +32,6 @@ specify the minimum number of cells required for a ChIP-seq target, and the mini
number of ChIP-seq targets required to include a celltype. By default,
``min_cells_per_target = 3`` and ``min_targets_per_cell = 2``.


.. code:: python
targets = ['CTCF','RAD21','SMC3']
Expand All @@ -48,7 +47,9 @@ cell type similarity, you can specify in the Epitome dataset:

.. code:: python
dataset = EpitomeDataset(targets=targets, cells=celltypes, similarity_targets = ['DNase', 'H3K27ac'])
dataset = EpitomeDataset(targets=targets,
cells=celltypes,
similarity_targets = ['DNase', 'H3K27ac'])
You can then visualize the ChIP-seq targets and cell types in your dataset by
using the ``view()`` function:
Expand All @@ -72,3 +73,51 @@ in the dataset:
You can now use your ``dataset`` in an Epitome model.

Load your processed dataset
---------------------------
You can specify the data path and/or genome assembly that you would like to use
in the Epitome dataset. You just need to define the ``data_dir`` and/or
``assembly`` variables:

.. code:: python
dataset = EpitomeDataset(data_dir="path/to/configured/data",
assembly="hg19")
Note if both the ``data_dir`` and ``assembly`` are set, the dataset will
append the specified assembly to the data_dir path such as
``~/$USERNAME/epitome/data/hg19/data.h5`` and return the dataset that is stored
in the path if it exists. If there is no data stored at that path, Epitome will
try to download the specified assembly from the S3 cluster at
https://epitome-data.s3-us-west-1.amazonaws.com.

You do not need to define both variables though. If you leave ``data_dir`` empty,
the Epitome dataset will append the ``assembly`` to the default data path located
in ``~/$USER_NAME/.epitome/data/`` and return the dataset if it exists at that path.
If there is no existing dataset located at the data path, Epitome will download
the dataset for the specified assembly from S3 to that path:

.. code:: python
dataset = EpitomeDataset(assembly="hg19")
If the assembly is not specified but the ``data_dir`` is, the dataset will assume
that the specified data directory ``data_dir`` is the absolute data path and it
will append the default assembly to the configured data path. Like above, if the
dataset exists at the configured data path, Epitome will load the configured data
into the EpitomeDataset. If there is no existing dataset, Epitome will download
the dataset for the default assembly from S3 and store it at the default data path:

.. code:: python
dataset = EpitomeDataset(data_dir="path/to/configured/data")
If neither ``data_dir`` or ``assembly`` are set, the dataset will just try to
fetch the ``data.zip`` file in the default data directory. If no data exists in
the default directory, Epitome will download the dataset for the default assembly
from S3 and store it at the default data path:

.. code:: python
dataset = EpitomeDataset()
4 changes: 2 additions & 2 deletions docs/usage/train.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ loss stops improving, the model will stop training early:
If you are concerned about the model above overtraining because the model continues
to improve by miniscule amounts, you can specify the min-delta which is minimum
change in the train-validation loss required to qualify as an improvement. In the
model below, a minimum improvement of at least 0.1 is required for the model to
model below, a minimum improvement of at least 0.01 is required for the model to
qualify as improving.

If you are concerned about the model above under-fitting (stopping training too
Expand All @@ -86,7 +86,7 @@ be found in the `Github repo <https://github.com/YosefLab/epitome>`__.
best_model_batches, total_trained_batches, train_valid_losses = model.train(5000,
patience = 3,
min_delta = 0.1)
min_delta = 0.01)
Test the Model
----------------
Expand Down
19 changes: 1 addition & 18 deletions epitome/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,24 +17,7 @@
.. automodule:: epitome.conversion
"""
import os

from os.path import expanduser

__path__ = __import__('pkgutil').extend_path(__path__, __name__)


def GET_EPITOME_USER_PATH():
return os.path.join(os.path.expanduser('~'), '.epitome')

def GET_DATA_PATH():
"""
Check if user has set env variable that specifies data path.
Otherwise, use default location.
Returns:
location of epitome data with all required files
"""
if os.environ.get("EPITOME_DATA_PATH") is not None:
return os.environ["EPITOME_DATA_PATH"]
else:
return os.path.join(GET_EPITOME_USER_PATH(),'data','hg19')
3 changes: 0 additions & 3 deletions epitome/conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,7 @@ def get_binary_vector(self, vector = None):
assert vector.shape[0] == len(self.compare), "Error: value_vector must be the same shape as self.compare"

base_vector = np.zeros((len(self.base)))
print(base_vector.shape)

tmp = self._get_overlap()
print(tmp)

base_indices = tmp.idx_base.values
convert_indices = tmp.idx.values
Expand Down
112 changes: 80 additions & 32 deletions epitome/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,21 @@
from sklearn.metrics import jaccard_score

# local imports
from epitome import *
from .constants import Dataset
from .functions import download_and_unzip
from .viz import plot_assay_heatmap

################### File accession constants #######################
S3_DATA_PATH = 'https://epitome-data.s3-us-west-1.amazonaws.com/hg19.zip'
S3_DATA_PATH = 'https://epitome-data.s3-us-west-1.amazonaws.com'

# os env that should be set by user to explicitly set the data path
EPITOME_DATA_PATH_ENV="EPITOME_DATA_PATH"
# List of available assembiles in S3_DATA_PATH
EPITOME_GENOME_ASSEMBLIES = ['hg19', 'hg38', 'test']
# default genome assembly
DEFAULT_EPITOME_ASSEMBLY = "hg19"
# default path to where all epitome related information is stored
EPITOME_USER_PATH = os.path.join(os.path.expanduser('~'), '.epitome')
# default path to where epitome data is installed. Subdirectory of EPITOME_USER_PATH
DEFAULT_EPITOME_DATA_PATH = os.path.join(EPITOME_USER_PATH, 'data')

# data files required by epitome
# data.h5 contains data, row information (celltypes and targets) and
Expand Down Expand Up @@ -71,7 +76,8 @@ def __init__(self,
cells = None,
min_cells_per_target = 3,
min_targets_per_cell = 2,
similarity_targets = ['DNase']):
similarity_targets = ['DNase'],
assembly=None):
'''
Initializes an EpitomeDataset.
Expand All @@ -87,9 +93,10 @@ def __init__(self,
(ie. DNase, H3K27ac, etc.)
'''


if assembly is not None:
self.assembly = assembly
# get directory where h5 file is stored
self.data_dir = EpitomeDataset.get_data_dir(data_dir)
self.data_dir = EpitomeDataset.download_data_dir(data_dir, assembly)
self.h5_path = os.path.join(self.data_dir, EPITOME_H5_FILE)

# save all parameters for any future use
Expand Down Expand Up @@ -154,10 +161,13 @@ def __init__(self,
self.valid_chrs = [i.decode() for i in dataset['columns']['index']['valid_chrs'][:]]
self.test_chrs = [i.decode() for i in dataset['columns']['index']['test_chrs'][:]]

self.assembly = dataset['meta']['assembly'][:][0].decode()
dataset_assembly = dataset['meta']['assembly'][:][0].decode()
if assembly is not None:
assert assembly == dataset_assembly, "Different assemblies"
else:
self.assembly = dataset_assembly
self.source = dataset['meta']['source'][:][0].decode()


dataset.close()

def set_train_validation_indices(self, chrom):
Expand Down Expand Up @@ -237,7 +247,7 @@ def get_data(self, mode):


@staticmethod
def get_y_indices_for_cell(matrix, cellmap, cell):
def get_y_indices_for_cell(matrix, cellmap, cell):
'''
Gets indices for a cell.
TODO: this function is called in genertors.py.
Expand Down Expand Up @@ -267,25 +277,65 @@ def get_y_indices_for_target(matrix, targetmap, target):
return np.copy(matrix[:,targetmap[target]])

@staticmethod
def get_data_dir(data_dir=None):
def contains_required_files(data_dir):
# make sure all required files exist
required_paths = [os.path.join(data_dir, x) for x in REQUIRED_FILES]
return np.all([os.path.exists(x) for x in required_paths])

@staticmethod
def get_data_dir(data_dir=None, assembly=None):
'''
Loads data processed from data/download_encode.py. This will check that all required files
exist.
If both data_dir and assembly are set, it will return the data_dir with the specified
assembly. If only the assembly is set, it will return the default data_dir with the specified
assembly. If only the data_dir is set, it will just return the data_path. If neither data_dir
nor assembly are set, it will return the default data_dir with the default assembly.
:param str data_dir: Directory containing data.h5 file saved in data/download_encode.py script.
:param str data_dir: Directory that should contain the data.h5 file.
:param str assembly: Genome assembly that should be saved.
:return: directory containing data.h5 file
:rtype: str
genome assembly of the data
:rtype: tuple
'''
if (data_dir is not None) and (assembly is not None):
epitome_data_dir = os.path.join(data_dir, assembly)
elif (assembly is not None):
epitome_data_dir = os.path.join(DEFAULT_EPITOME_DATA_PATH, assembly)
elif (data_dir is not None):
epitome_data_dir = data_dir
else:
print("Warning: genome assembly was not set in EpitomeDataset. Defaulting assembly to %s." % DEFAULT_EPITOME_ASSEMBLY)
epitome_data_dir = os.path.join(DEFAULT_EPITOME_DATA_PATH, DEFAULT_EPITOME_ASSEMBLY)
assembly = DEFAULT_EPITOME_ASSEMBLY
return epitome_data_dir, assembly

if not data_dir:
data_dir = GET_DATA_PATH()
download_and_unzip(S3_DATA_PATH, os.path.dirname(data_dir))
@staticmethod
def list_genome_assemblies():
return ", ".join(EPITOME_GENOME_ASSEMBLIES)

# make sure all required files exist
required_paths = [os.path.join(data_dir, x) for x in REQUIRED_FILES]
assert(np.all([os.path.exists(x) for x in required_paths]))
@staticmethod
def download_data_dir(data_dir=None, assembly=None):
'''
Loads data processed from data/download_encode.py. This will check that all required files
exist. If both data_dir and assembly are set, it will return the data_dir with the specified
assembly. If only the assembly is set, it will return the default data_dir with the specified
assembly. If only the data_dir is set, it will just return the data_path. If neither data_dir
nor assembly are set, it will return the default data_dir with the default assembly.
return data_dir
:param str data_dir: Directory containing data.h5 file saved in data/download_encode.py script.
:return: directory containing data.h5 file
:rtype: str
'''
epitome_data_dir, assembly = EpitomeDataset.get_data_dir(data_dir, assembly)

if not EpitomeDataset.contains_required_files(epitome_data_dir):
# Grab data directory and download it from S3 if it doesn't have the required files
assert assembly is not None, "Specify assembly to download."
assert assembly in EPITOME_GENOME_ASSEMBLIES, "assembly %s data is not in the S3 cluster. Must be either in %s" % (assembly, EpitomeDataset.list_genome_assemblies())
url_path = os.path.join(S3_DATA_PATH, assembly + ".zip")
download_and_unzip(url_path, epitome_data_dir)
# Make sure all required files exist
assert EpitomeDataset.contains_required_files(epitome_data_dir)
return epitome_data_dir

def list_targets(self):
'''
Expand All @@ -297,11 +347,11 @@ def list_targets(self):
'''
return list(self.targetmap)


@staticmethod
def get_assays(targets = None,
cells = None,
data_dir = None,
assembly = None,
min_cells_per_target = 3,
min_targets_per_cell = 2,
similarity_targets = ['DNase']):
Expand All @@ -320,8 +370,7 @@ def get_assays(targets = None,
:rtype: tuple
'''

if not data_dir:
data_dir = GET_DATA_PATH()
data_dir = EpitomeDataset.download_data_dir(data_dir, assembly)

data = h5py.File(os.path.join(data_dir, EPITOME_H5_FILE), 'r')

Expand Down Expand Up @@ -489,9 +538,9 @@ def save(out_path,
:param list test_chrs: list of test chromsomes, str
'''

if os.path.exists(os.path.join(out_path, EPITOME_H5_FILE)):
raise Exception("%s already exists at %s" % (EPITOME_H5_FILE, out_path))
epitome_data_dir, __ = EpitomeDataset.get_data_dir(out_path, assembly)
if os.path.exists(os.path.join(epitome_data_dir, EPITOME_H5_FILE)):
raise Exception("%s already exists at %s" % (EPITOME_H5_FILE, epitome_data_dir))

# assertions
assert all_data.dtype == np.dtype('int8'), "all_data type should be int8"
Expand All @@ -504,12 +553,12 @@ def save(out_path,
assert np.all([type(i)==str for i in valid_chrs]), "valid_chrs elements must be type string"
assert np.all([type(i)==str for i in test_chrs]), "test_chrs elements must be type string"

if not os.path.exists(out_path):
os.makedirs(out_path)
if not os.path.exists(epitome_data_dir):
os.makedirs(epitome_data_dir)

try:
# toy dataset with everything in it
new_data = h5py.File(os.path.join(out_path, EPITOME_H5_FILE), 'w')
new_data = h5py.File(os.path.join(epitome_data_dir, EPITOME_H5_FILE), 'w')

# 0. set data
data = new_data.create_dataset("data", all_data.shape, dtype=all_data.dtype, compression="gzip", compression_opts=9)
Expand Down Expand Up @@ -593,7 +642,6 @@ def save(out_path,
compression="gzip", compression_opts=9)
source_ds[:]=source.encode()


# 4. Make sure we have all the correct keys
keys = sorted(set(EpitomeDataset.all_keys(new_data)))
assert np.all([i in keys for i in REQUIRED_KEYS ]), "Error: missing required keys for new dataset"
Expand Down
Loading

0 comments on commit a326c67

Please sign in to comment.