Skip to content

Commit d3d08c1

Browse files
committed
added kraken, bracken and pavian
1 parent 6934a60 commit d3d08c1

File tree

1 file changed

+98
-21
lines changed

1 file changed

+98
-21
lines changed

wms.md

Lines changed: 98 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,13 @@ In this tutorial you'll analyze a sample from Pig Gut Metagenome.
1414

1515
### Introduction
1616

17-
#### The Pig Microbiome
17+
#### Microbiome used
18+
19+
In this tutorial we will compare samples from the Pig Gut Microbiome to samples from the Human Gut Microbiome. Below you'll find a brief description of the two projects:
1820

1921
> Pig is a main species for livestock and biomedicine. The pig genome sequence was recently reported. To boost research, we established a catalogue of the genes of the gut microbiome based on faecal samples of 287 pigs from France, Denmark and China. More than 7.6 million non-redundant genes representing 719 metagenomic species were identified by deep metagenome sequencing, highlighting more similarities with the human than with the mouse catalogue. The pig and human catalogues share only 12.6 and 9.3 % of their genes, respectively, but 70 and 95% of their functional pathways. The pig gut microbiota is influenced by gender, age and breed. Analysis of the prevalence of antibiotics resistance genes (ARGs) reflected antibiotics supplementation in each farm system, and revealed that non-antibiotics-fed animals still harbour ARGs. The pig catalogue creates a resource for whole metagenomics-based studies, highly valuable for research in biomedicine and for sustainable knowledge-based pig farming
2022
21-
To speed up the analysis, we'll only use the first 60K reads from the first sample of the study. The full samples are accessible under BioProject [PRJEB11755](http://www.ncbi.nlm.nih.gov/bioproject/308698)
23+
> We are facing a global metabolic health crisis provoked by an obesity epidemic. Here we report the human gut microbial composition in a population sample of 123 non-obese and 169 obese Danish individuals. We find two groups of individuals that differ by the number of gut microbial genes and thus gut bacterial richness. They harbour known and previously unknown bacterial species at different proportions; individuals with a low bacterial richness (23% of the population) are characterized by more marked overall adiposity, insulin resistance and dyslipidaemia and a more pronounced inflammatory phenotype when compared with high bacterial richness individuals. The obese individuals among the former also gain more weight over time. Only a few bacterial species are sufficient to distinguish between individuals with high and low bacterial richness, and even between lean and obese. Our classifications based on variation in the gut microbiome identify subsets of individuals in the general white adult population who may be at increased risk of progressing to adiposity-associated co-morbidities
2224

2325
#### Whole Metagenome Sequencing
2426

@@ -30,16 +32,18 @@ The choice of shotgun or 16S approaches is usually dictated by the nature of the
3032

3133
* [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
3234
* [Kraken](https://ccb.jhu.edu/software/kraken/)
33-
* kraken_to_krona (script part of [MetLab](https://github.com/norling/metlab))
34-
* [KronaTools](https://github.com/marbl/Krona/wiki)
35+
<!-- * [scythe](https://github.com/vsbuffalo/scythe) -->
36+
<!-- * kraken_to_krona (script part of [MetLab](https://github.com/norling/metlab))
37+
* [KronaTools](https://github.com/marbl/Krona/wiki) -->
3538

3639
### Getting the Data and Checking their Quality
3740

38-
If you are reading this tutorial online and haven't cloned the directory, first download and unzip the data:
41+
If you are reading this tutorial online and haven't cloned the directory, first download and unpack the data:
3942

4043
```
41-
wget https://github.com/HadrienG/tutorials/raw/master/data/DHN_Pig_60K.fastq.gz
42-
gzip -d DHN_Pig_60K.fastq.gz
44+
wget http://77.235.253.14/metlab/wms.tar
45+
tar xvf wms.tar
46+
cd wms
4347
```
4448

4549
We'll use FastQC to check the quality of our data. FastQC can be downloaded and
@@ -49,6 +53,10 @@ Start FastQC and select the fastq file you just downloaded with `file -> open`
4953
What do you think about the quality of the reads? Do they need trimming? Is there still adapters
5054
present? Overrepresented sequences?
5155

56+
Alternatively, run fastqc on the command-line:
57+
58+
`fastqc *.fastq.gz`
59+
5260
If the quality appears to be that good, it's because it was probably the cleaned reads that were deposited into SRA.
5361
We can directly move to the classification step.
5462

@@ -62,37 +70,106 @@ approaches (i.e. Blast)
6270

6371
By default, the authors of kraken built their database based on RefSeq Bacteria, Archea and Viruses. We'll use it for the purpose of this tutorial.
6472

65-
```
73+
**NOTE: The database may have been installed already! Ask your instructor!**
74+
75+
```bash
76+
# You might not need this step (example if you're working on Uppmax!)
6677
wget https://ccb.jhu.edu/software/kraken/dl/minikraken.tgz
6778
tar xzf minikraken.tgz
79+
$KRAKEN_DB=minikraken_20141208
6880
```
6981

7082
Now run kraken on the reads
7183

72-
`kraken --db minikraken_20141208/ --threads 8 --fastq-input DHN_Pig_60K.fastq > DHN_Pig.tab`
84+
```bash
85+
mkdir kraken_results
86+
for i in *_1.fastq.gz
87+
do
88+
prefix=$(basename $i _1.fastq.gz)
89+
kraken --db $KRAKEN_DB --threads 2 --fastq-input --gzip-compressed \
90+
${prefix}_1.fastq.gz ${prefix}_2.fastq.gz > kraken_results/${prefix}.tab
91+
kraken-report --db $KRAKEN_DB \
92+
kraken_results/${prefix}.tab > kraken_results/${prefix}_tax.txt
93+
done
94+
```
7395

74-
which produces a tab-delimited file with an assigned TaxID for each read. Kraken includes a script called `kraken-report` to transform this file into a "tree" view with the percentage of reads assigned to each taxa.
96+
which produces a tab-delimited file with an assigned TaxID for each read.
7597

76-
`kraken-report --db minikraken_20141208/ DHN_Pig.tab > DHN_Pig_tax.txt`
98+
Kraken includes a script called `kraken-report` to transform this file into a "tree" view with the percentage of reads assigned to each taxa. We've run this script at each step in the loop. Take a look at the `_tax.txt` files!
7799

78-
Open this file and take a look!
100+
### Abundance estimation using Bracken
79101

80-
### Visualization
102+
Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample
81103

82-
We'll visualize the composition of our datasets using Krona.
104+
Before starting, you need to install Bracken:
83105

84-
Get the script to transform the kraken results in a format Krona can understand
106+
```bash
107+
cd
108+
git clone https://github.com/jenniferlu717/Bracken.git
109+
chmod 755 Bracken/*.py
110+
chmod 755 Bracken/*.pl
111+
export PATH=$PATH:$HOME/Bracken
112+
```
113+
114+
Unfortunately, Uppmax lacks some perl packages necessary for Bracken to work:
115+
116+
Follow the tutorial [here](http://www.uppmax.uu.se/support/faq/software-faq/installing-local-perl-packages/) to install `cpanm`
117+
118+
then install the two perl libraries that are missing:
85119

120+
```bash
121+
cpanm Parallel::ForkManager
122+
cpanm List::MoreUtils
86123
```
87-
wget https://raw.githubusercontent.com/norling/metlab/master/pipeline_scripts/kraken_to_krona.py
88-
chmod 755 kraken_to_krona.py
124+
125+
Three steps are necessary to set up Kraken abundance estimation.
126+
127+
1. Classify all reads using Kraken and Generate a Kraken report file. We've done this!
128+
129+
2. Search all library input sequences against the database and compute the classifications for each perfect read of ${READ_LENGTH} base pairs from one of the input sequences.
130+
131+
132+
```bash
133+
find -L $KRAKEN_DB/library -name "*.fna" -o -name "*.fa" -o -name "*.fasta" > genomes.list
134+
cat $(grep -v '^#' genomes.list) > genomes.fasta
135+
kraken --db=${KRAKEN_DB} --fasta-input --threads=10 kraken.fasta > database.kraken
136+
count-kmer-abundances.pl --db=${KRAKEN_DB} --read-length=100 database.kraken > database100mers.kraken_cnts
89137
```
90138

91-
Run the script and Krona
139+
3. Generate the kmer distribution file
92140

141+
```bash
142+
python generate_kmer_distribution.py -i database100mers.kraken_cnts -o KMER_DISTR.TXT
93143
```
94-
./kraken_to_krona.py DHN_Pig_tax.txt
95-
ktImportText DHN_Pig_tax.krona.in
144+
145+
Now, given the expected kmer distribution for genomes in a kraken database along
146+
with a kraken report file, the number of reads belonging to each species (or
147+
genus) is estimated using the estimate_abundance.py file, run with the
148+
following command line:
149+
150+
`python estimate_abundance.py -i KRAKEN.REPORT -k KMER_DISTR.TXT -o OUTPUT_FILE.TXT`
151+
152+
Run this command for the six `_tax.txt` files that you generated with kraken!
153+
154+
The following required parameters must be specified:
155+
- KRAKEN.REPORT :: the kraken report generated for a given dataset
156+
- KMER_DISTR.TXT :: the file generated by generate_kmer_distribution.py
157+
- OUTPUT_FILE.TXT :: the desired name of the output file to be generated by the code
158+
159+
### Visualization
160+
161+
#### Alternative 1: Pavian
162+
163+
Pavian is a web application for exploring metagenomics classification results.
164+
165+
Install and run Pavian:
166+
167+
(In R or Rstudio)
168+
169+
```R
170+
## Installs required packages from CRAN and Bioconductor
171+
source("https://raw.githubusercontent.com/fbreitwieser/pavian/master/inst/shinyapp/install-pavian.R")
172+
pavian::runApp(port=5000)
96173
```
97174

98-
And open the generated html file in your browser! What are the most abundant taxa?
175+
Pavian will be available at http://127.0.0.1:5000 .

0 commit comments

Comments
 (0)