Skip to content

Commit

Permalink
fixed documentation bugs
Browse files Browse the repository at this point in the history
  • Loading branch information
akmorrow13 committed Apr 20, 2020
1 parent 9090807 commit 839a880
Show file tree
Hide file tree
Showing 8 changed files with 72 additions and 50 deletions.
5 changes: 2 additions & 3 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,8 @@
"matplotlib.pyplot",
"matplotlib.backends",
"matplotlib.lines",
"matplotlib.transforms"


"matplotlib.transforms",
"sklearn.calibration"
]

for mod_name in MOCK_MODULES:
Expand Down
Binary file removed docs/figures/epitome_diagram.png
Binary file not shown.
Binary file added docs/figures/epitome_diagram_celllines.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Introduction

Epitome is a computational model that leverages chromatin accessibility data to predict ChIP-seq peaks on unseen cell types. Epitome computes the chromatin similarity between 11 cell types in ENCODE and the novel cell types, and uses chromatin similarity to transfer ChIP-seq peaks in known cell types to a novel cell type of interest.

.. image:: figures/epitome_diagram.png
.. image:: figures/epitome_diagram_celllines.png


.. toctree::
Expand Down
31 changes: 20 additions & 11 deletions docs/installation/source.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Building Epitome from Source
============================
Installing Epitome
==================

**Note**: Epitome is configured for tensorflow 2/Cuda 9. If you have a different
version of cuda, update tensorflow-gpu version accordingly.
Expand All @@ -10,6 +10,7 @@ Requirements
* `conda <https://docs.conda.io/en/latest/miniconda.html>`__
* python 3.7


Installation
------------

Expand All @@ -20,25 +21,33 @@ Installation
conda create --name EpitomeEnv python=3.7
source activate EpitomeEnv
2. Install Epitome from Pypi:

2. Get Epitome code:
.. code:: bash
pip install epitome
From Source
-----------

1. Create and activate a pytion 3.7 conda venv:

.. code:: bash
git clone [email protected]:akmorrow13/epitome.git
cd epitome
conda create --name EpitomeEnv python=3.7
source activate EpitomeEnv
3. Install Epitome and its requirements
2. Get Epitome code:

.. code:: bash
pip install -e .
git clone https://github.com/YosefLab/epitome.git
cd epitome
3. Install Epitome and its requirements

Configuring Data
----------------
.. code:: bash
Epitome requires data for training, validation and test. See `Configuring Epitome data <../usage/data.html>`__ for more information
on how to download data for Epitome or generate your own dataset.
pip install -e .
21 changes: 11 additions & 10 deletions docs/usage/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,32 @@ Configuring data
================

Epitome pre-processes ChIP-seq peaks and DNase-seq peaks from ENCODE for usage
in the Epitome models. Pre-processed data can be downloaded from:
in the Epitome models. Pre-processed datasets are lazily downloaded from `Amazon S3 <../https://epitome-data.s3-us-west-1.amazonaws.com/data.zip>`__ when users run an Epitome model.

TODO: upload to AWS

This data contains the following:
- train.npz, valid.npz, and test.npz: compressed numpy data matrices. Valid.npz includes chr7 data, test.npz includes chr8 and chr8,
and train.npz includes data from all other chromosomes.
- all.pos.bed.gz: gunzipped genomic regions matching the numpy data matrices
- feature_name: ChIP-seq and DNase-seq peaks corresponding to the data matrix.
This dataset contains the following files:

- **train.npz, valid.npz, and test.npz**: compressed numpy data matrices. Valid.npz includes chr7 data, test.npz includes chr8 and chr9, and train.npz includes data from all other chromosomes.

- **all.pos.bed.gz**: gunzipped genomic regions matching the numpy data matrices.

- **feature_name**: ChIP-seq and DNase-seq peaks corresponding to the data matrix.


Generating data for Epitome
---------------------------

TODO: need to add this script as a binary in the module.

You can generate your own Epitome dataset from ENCODE using the following command:
```download_encode.py```.

.. code:: bash
python get_deepsea_data.py -h
python download_encode.py -h
usage: download_encode.py [-h] [--metadata_url METADATA_URL]
[--min_chip_per_cell MIN_CHIP_PER_CELL]
[--regions_file REGIONS_FILE]
download_path {hg19,mm10,GRCh38} bigBedToBed
output_path
TODO: need to add this script as a binary in the module.
8 changes: 5 additions & 3 deletions docs/usage/predict.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,19 @@ To get predictions on the whole genome, run:
.. code:: python
peak_result = model.score_whole_genome(peak_file, # chromatin accessibility peak file
output_path, # where to save results
chrs=["chr8","chr9"]) # chromosomes you would like to score. Leave blank for whole genome.
output_path, # where to save results
chrs=["chr8","chr9"]) # chromosomes you would like to score. Leave blank for whole genome.
**Note:** Scoring on the whole genome scores about 7 million regions and takes about 1.5 hours.

TODO: talk about including histone modification files.


You can also get predictions on specific genomic regions:

.. code:: python
results = model.score_peak_file(peak_file, # chromatin accessibility peak file
regions_file) # bed file of regions to score
regions_file) # bed file of regions to score
This method returns a dataframe of the scored predictions.
55 changes: 33 additions & 22 deletions docs/usage/train.rst
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
Training an Epitome Model
=========================

Once you have `installed Epitome and configured the data <../installation/source.html>`__, you are ready to train a model.
Once you have `installed Epitome <../installation/source.html>`__, you are ready to train a model.

Training a Model
----------------

First, import Epitome and specified the `path to Epitome data: <./data.html>`__
First, import Epitome:

.. code:: python
Expand All @@ -16,36 +16,45 @@ First, import Epitome and specified the `path to Epitome data: <./data.html>`__
from epitome.functions import *
from epitome.viz import *
epitome_data_path = <path_to_dataset>
Quick Start
^^^^^^^^^^^

To train a model, you will need to first specify the assays and cell lines you would like to train with:
First, define the assays you would like to train. Then you can create a `VLP` model:

TODO: have a function that lists all of the assays you can build a model from.

.. code:: python
assays = ['CTCF','RAD21','SMC3']
model = VLP(assays, test_celltypes = ["K562"]) # cell line reserved for testing
To train a model on a specific set of targets and cell lines, you will need to first specify the assays and cell lines you would like to train with:

.. code:: bash
matrix, cellmap, assaymap = get_assays_from_feature_file(feature_path=os.path.join(epitome_data_path,'feature_name'),
eligible_assays = None,
eligible_cells = None,
min_assays_per_cell=2,
min_cells_per_assay=2)
matrix, cellmap, assaymap = get_assays_from_feature_file(eligible_assays = None,
eligible_cells = None,
min_assays_per_cell=6,
min_cells_per_assay=8)
# visualize cell lines and ChIP-seq peaks you have selected
plot_assay_heatmap(matrix, cellmap, assaymap)
Next train a model for 5000 iterations:
Next define a model:

.. code:: python
# for each, train a model
shuffle_size = 2
model = VLP(list(assaymap),
matrix = matrix,
assaymap = assaymap,
cellmap = cellmap,
test_celltypes = ["K562"]) # cell line reserved for testing)
model = MLP(epitome_data_path,
["K562"], # cell line reserved for testing
matrix,
assaymap,
cellmap,
shuffle_size=shuffle_size,
prefetch_size = 64,
debug = False,
batch_size=64)
Next, train the model. Here, we train the model for 5000 iterations:

.. code:: python
model.train(5000)
Expand All @@ -54,6 +63,8 @@ You can then evaluate model performance on held out test cell lines specified in

.. code:: python
results = model.test(self, 10000, log=True)
results = model.test(10000,
mode = Dataset.TEST,
calculate_metrics=True)
The output of `results` will contain the predictions and truth values, a dictionary of assay specific performance metrics, and the average auROC and auPRC across all evaluated assays.

0 comments on commit 839a880

Please sign in to comment.