Skip to content

Commit

Permalink
First commit
Browse files Browse the repository at this point in the history
  • Loading branch information
xapple committed May 19, 2021
0 parents commit 25b78c7
Show file tree
Hide file tree
Showing 29 changed files with 4,441 additions and 0 deletions.
24 changes: 24 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Temporary files #
.DS_Store
*.pyc
*.swp
*.swo
*~
\#*
.#*
._*

# IDE files #
/.idea

# Distribution #
/build/
/dist/
/crest4.egg-info/

# Testing #
/.pytest_cache/

# The results of the tests #
*.hits
assignments.txt
674 changes: 674 additions & 0 deletions LICENSE.txt

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include README.md
include LICENSE.txt
include pytest.ini
202 changes: 202 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# CREST version 4.0.1

`crest4` is a python package for automatically assigning taxonomic names to DNA sequences obtained from environmental sequencing.

<p style="text-align:center">
<img height="143" src="docs/logo.png?raw=true" alt="CREST Logo">
</p>

More specifically, the acronym CREST stands for "Classification Resources for Environmental Sequence Tags" and is a collection of software and databases for taxonomic classification of environmental marker genes obtained from community sequencing studies. Such studies are also known as "meta-genomics", "meta-transcriptomics", "meta-barcoding", "taxonomic profiling" or "phylogenetic profiling".

Simply put, given the following fragment of an rRNA 16S sequence from an uncultured microbe:

TGGGGAATTTTCCGCAATGGGCGAAAGCCTGACGGAGCAATACCGCGTGAGGGAGGAAGGCCTTAGGGTT
GTAAACCTCTTTTCTCTGGGAAGAAGATCTGACGGTACCAGAGGAATAAGCCTCGGCTAACTCCGTGCCA
GCAGCCGCGGTAAGACGGAGGAGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGTAGGCGGTT
AATTAAGTCTGTTGTTAAAGCCCACAGCTCAACTGTGGATCGGCAATGGAAACTGGTTGACTAGAGTGTG
GTAGGGGTAGAGGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCG

`crest4` will be able to tell you that this gene is likely originating from the following taxonomy order:

Bacteria; Terrabacteria; Cyanobacteria; Oxyphotobacteria; Synechococcales

To produce this result, each input sequence is compared against a built-in reference database of marker genes (such as the SSU rRNA), and the exact position in the phylogenetic tree of life of every high similarity hit is recorded.

Insert brief description of the LCA method and comparison with kmers.


### Citation

If you use CREST in your research, please cite [this publication](https://dx.plos.org/10.1371/journal.pone.0049334):

CREST - Classification Resources for Environmental Sequence Tags, PLoS ONE, 7:e49334
Lanzén A, Jørgensen SL, Huson D, Gorfer M, Grindhaug SH, Jonassen I, Øvreås L, Urich T (2012)


## Installing

Since `crest4` is written in python it is compatible with all operating systems: Linux, macOS and Windows. The only prerequisite is `python3` which is often installed by default. Simply type one of the two following commands on your terminal, depending on which package manager you prefer to use:

$ pip3 install crest4
$ conda install crest4 -c conda-forge

Once the installation completes you are ready to use the `crest4` executable command from the shell. The reference databases are downloaded automatically during first run, so this might take some time depending on your internet connection.

* If you do not have `pip3` on your system you can refer to [this section](docs/installing_tips.md#obtaining-pip3).
* If you do not have `python3` on your system you can refer to [this other section](docs/installing_tips.md#obtaining-python3).
* If none the above has enabled you to install `crest4`, please open an issue on [the bug tracker](https://github.com/xapple/crest4/issues) and we will get back to you shortly.

### Database location

To download the databases that are used in the classification algorithm, `crest4` needs somewhere to write to on the filesystem. This will default to your home directory at: `~/.crest4/`. If you wish to change this, simply set the environment variable `$CREST4_DIR` to another directory path prior to execution.


## Usage

Bellow are some examples to illustrate the various ways there are to use this package.

crest4 -f sequences.fasta

Simply specifying a FASTA file is sufficient, and `crest4` will choose default values for all the parameters automatically. The results produced will be placed in a sub-directory inside the same directory as the FASTA file. Refer to the [results](#Results) section below for more information.

To change the output directory, specify the following option:

crest4 -f sequences.fasta --out_dir=~/data/results/crest_test/

To parallelize the sequence similarity search with 32 processes use this option:

crest4 -f sequences.fasta --threads=32

Silvamod is the default reference database. To use another database, e.g. Greengenes, the `database` option must be specified followed by the database name:

crest4 -f sequences.fasta --database=greengenes

### All options

The full list of options is as follows:

```
Optional arguments:
--search_algo ALGORITHM, -a ALGORITHM
The algorithm used for the sequence similarity search
that will be run to match the sequences against the
database chosen. Either 'blast' or 'vsearch'. No
other values are currently supported. By default
'blast'.
--num_threads NUM, -t NUM
The number of processors to use for the sequence
similarity search. By default parallelism is turned
off and this value is 1. If you pass the value `True`
we will run as many processes as there are CPUs but
no more than 32.
--search_db DATABASE, -d DATABASE
The database used for the sequence similarity search.
Either 'silvamod128' or 'greengenes'. No other values
are currently supported. By default 'silvamod128'.
--output_dir DIR, -o DIR
The directory into which all the classification
results will be written to. This defaults to a
directory with the same name original FASTA file and
a `.crest4` suffix appended.
--search_hits PATH, -s PATH
The path where the search results will be stored.
This defaults to the output directory. However,
if the search operation has already been completed
before hand, specify the path here to skip the
sequence similarity search step and go directly to
the taxonomy step.
--min_score MIN_SCORE, -m MIN_SCORE
a
--score_drop SCORE_DROP, -c SCORE_DROP
a
--version, -v Show program's version number and exit.
--help, -h Show this help message and exit.
Required arguments:
--fasta PATH, -f PATH
The path to a single FASTA or FASTQ file as a string.
These are the sequences that will be taxonomically
classified. The file can be gzipped or not.
```

### Python API

If you want to integrate `crest4` directly into your python pipeline, you may do so by accessing the convenient `Classify` object as follows:

from crest4 import Classify
tax = Classify('~/data/sequences.fasta', num_threads=16)
output = tax()
print(output)

The specific arguments accepted are the same as the command line version as specified in the [internal API documentation](http://xapple.github.io/crest4/crest4/classify#Classify).

### Test suite

To test that the installation was successful you can launch the test suite by executing:

crest4 --pytest

### Splitting computation

It is possible to run the sequence similarity search yourself without passing through the `crest4` executable. This is useful for instance if you want to run BLAST on a dedicated server for increased speed and only want to perform the taxonomic assignment on your local computer.

In such a case you just need to copy the hits file that was generated back to your local computer and specify its location with the following parameter:

crest4 sequences.fasta --hits_file=~/results/seq_search.hits

To create the hits file on a different server you should call the `blastn` executable with the following options:

blastn -query sequences.fasta -db ~/.crest4/silvamod128/silvamod128.fasta -num_alignments 100 -outfmt "7 qseqid sseqid bitscore length nident" -out seq_search.hits

We also recommend that you use `-num_threads` to enable multi-threading and speed up the alignments.


## Results

The results produced are as follows:




## More information

### Classification databases

The SilvaMod database was derived by manual curation of the [SILVA NR SSU Ref v.128](https://www.arb-silva.de/documentation/release-128/). It supports SSU sequences from bacteria and archaea (16S) as well as eukaryotes (18S), with a high level of manual curation and defined environmental clades. Release supported: Silva NR SSU Ref v128. The database was last released in: September 2016.

The [Greengenes](http://greengenes.secondgenome.com) database is an alternative reference for classification of prokaryotic 16S, curated and maintained by The Greengenes Database Consortium. The database was last released in: May 2013

### Classification algorithm

The classification is carried out based on a subset of the best matching alignments using the [Lowest Common Ancestor](http://en.wikipedia.org/wiki/Lowest_common_ancestor) (LCA) strategy. Briefly, the subset includes sequences that score within x% of the "bit-score" of the best alignment, providing the best score is above a minimum value. Default values are `155` for the minimum bit-score and `2%` for the LCA range. Based on cross-validation testing using the non-redundant SilvaMod database, this results in relatively few false positives for most datasets. However, the LCA range can be turned up to about `10%`, to increase accuracy with short reads and for datasets with many novel sequences.

In addition to LCA classification, a minimum similarity filter is used, based on a set of taxon-specific requirements, by default depending on their taxonomic rank. By default, a sequence must be aligned with at least 99% nucleotide similarity to the best reference sequence in order to be classified to the species rank. For the genus, family, order, class and phylum ranks the respective default cut-offs are 97%, 95%, 90%, 85% and 80%. These cutoffs can be changed manually by editing the `.map` file of the respective reference database. This filter ensures that classification is made to the taxon of the lowest allowed rank, effectively re-assigning sequences to parent taxa until allowed.

When using amplicon sequences, we strongly recommend preparing the sequences by performing a noise reduction step as well as applying chimera removal. This can be achieved with various third party software: vsearch, UPARSE, DADA2, SWARM, etc.

For amplicon sequencing experiments with many replicates or similar samples (>~10), the unique noise-reduced sequences may be further clustered using a similarity threshold (often 97% although larger thresholds are probably preferable) into Operational Taxonomic Units (OTUs), prior to classification.

### Custom databases

It is possible to construct a custom reference database for use with `crest4`. The scripts necessary to do this along with some documentations are available in this other git repository:

<https://github.com/xapple/crest4_utils>

### Developer documentation

The internal documentation of the `crest4` python package is available at:

<http://xapple.github.io/crest4/crest4>

This documentation is simply generated from the source code with this command:

$ pdoc3 --html --output-dir docs --force crest4
16 changes: 16 additions & 0 deletions crest4/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
The `crest4` python package can automatically assign taxonomic names to
DNA sequences obtained from environmental sequencing.
"""

# Special variables #
__version__ = '4.0.1'

# Constants #
project_url = 'https://github.com/xapple/crest4'

# Expose our main object at the module level #
from crest4.classify import Classify
21 changes: 21 additions & 0 deletions crest4/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
Written by Lucas Sinclair.
GNUv3 Licensed.
Contact at www.sinclair.bio
Created in May 2021.
"""

# Use the optmagic library to make a command line tool automatically #
from optmagic import OptMagic

# The main object of our package #
from crest4 import Classify

# The main function #
def main(): return OptMagic(Classify)()

# Execute #
if __name__ == "__main__": main()
Loading

0 comments on commit 25b78c7

Please sign in to comment.