-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathALS.Rmd
409 lines (279 loc) · 16.1 KB
/
ALS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
---
title: "Microbiome of Amyotrophic Lateral Sclerosis (ALS) using 16S rRNA"
output:
github_document:
toc: true
toc_depth: 3
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r libraries, echo=FALSE, message=FALSE, warning=FALSE}
library(printr)
library(tictoc)
tic()
```
## Setting up
### Direcotry Structure
In the working directory, there is a file called `samples.tsv` with a list of the samples and the status (`ALS` vs `Control`) of each sample. The dataset in **paired-end** FASTQ (compressed `.fastq.gz`) is located under `data/original/`. There is also the `data/silva/` folder with the [SILVA](https://www.arb-silva.de/) database for taxonomic classification.
### Installing and Loading Packages
The main package required for this lesson is [`dada2`](https://doi.org/doi:10.18129/B9.bioc.dada2). If it is not installed already, it can installed through [`BiocManager`](https://cran.r-project.org/web/packages/BiocManager/vignettes/BiocManager.html) as follows:
```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("dada2")
```
The main package required for this lesson is [`phyloseq`](https://doi.org/doi:10.18129/B9.bioc.phyloseq). If it is not installed already, it can installed through [`BiocManager`](https://cran.r-project.org/web/packages/BiocManager/vignettes/BiocManager.html) as follows:
```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("phyloseq")
```
Additional, [`tidyverse`](https://www.tidyverse.org/) will be required for data reading, handling, and plotting. Finally, [`digest`](https://cran.r-project.org/web/packages/digest/index.html) will be required for only prettify the names of the operational taxonomic units (OTUs). Both packages can be installed from the The Comprehensive R Archive Network (CRAN) repository as follows:
```{r eval=FALSE}
install.packages(c("tidyverse", "digest"))
```
After all the required packages are successfully installed, typically happens once per `R` installation, let us **load** these package in the current `R` session (workspace):
```{r}
library(dada2)
library(phyloseq)
library(tidyverse)
library(digest)
```
### Loading Meta and Sequence Data
In loading the samples information (metadata), `read_tsv`'s parameter `col_types` takes the `cff` to indicate that the first column in a *character* (string) but the second and third columns are *factor* (categorical) - for more on data types in R, see the episode on [Data Types and Structures](https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures/)
```{r}
samples = read_tsv("data/samples.tsv", col_types = 'cff')
glimpse(samples)
```
`glimpse` shows that there are 18 rows (samples) and the three columns are `sample`, `status`, and `pair` were loaded successfully as `chr`, `fct`, and `fct`, respectively.
Now let's start processing the FASTQ files. First, we will list all the files in the `original` subfolder.
```{r}
path = "data/original"
list.files(path)
```
The above list will be split into forward and reverse reads, according to the FASTQ file name where file names that end with `1.fastq.gz` are the forward reads and file names that end with `2.fastq.gz` are the reverse reads.
**The forward reads:**
```{r}
forward.original = sort(list.files(path, pattern="1.fastq.gz", full.names = TRUE))
head(forward.original)
```
**The reverse reads:**
```{r}
reverse.original = sort(list.files(path, pattern="2.fastq.gz", full.names = TRUE))
head(reverse.original)
```
The sample name is extracted from the FASTQ file name using [`strsplit`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/strsplit) by searching and taking the part the file name before the underscore `_`. The extracted sample name should match the `sample` column (first column) in the metadata loaded above. With [`sapply`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/lapply), the extraction (`strsplit`) is applied on the list (vector) of names in `forward.original`.
```{r}
sample.names = sapply(strsplit(basename(forward.original), "_"), `[`, 1)
head(sample.names)
```
## Quality Control
### Quality Profiles
#### Forward Reads
We will plot (using [`plotQualityProfile`](https://rdrr.io/bioc/dada2/man/plotQualityProfile.html)) the quality profile of *forward* reads in the first two samples
```{r}
plotQualityProfile(forward.original[1:2])
```
#### Reverse Reads
Then the quality profile of *reverse* reads of the same first two samples above
```{r}
plotQualityProfile(reverse.original[1:2])
```
As expected, the forward reads are typically of higher quality for most of the read length, but for the reverse read, the quality drops about half-way through the read length (starting approximately from the 100th nucleotide)
### Filtering and Trimming
Here, only defining the names (`.1.filtered.fastq.gz` and `.2.filtered.fastq.gz` for the *forward* and *reverse* reads, respectively) and destination subfolder `data/filtered` of the filtered FASTQ files. The sample name `sample.names` will be assigned to the row name of the `data.frame`:
```{r}
forward.filtered = file.path("data/filtered", paste0(sample.names, ".1.filtered.fastq.gz"))
reverse.filtered = file.path("data/filtered", paste0(sample.names, ".2.filtered.fastq.gz"))
names(forward.filtered) = sample.names
names(reverse.filtered) = sample.names
```
Now filter and trim using [`filterAndTrim`](https://rdrr.io/bioc/dada2/man/filterAndTrim.html), which takes input FASTQ files and generates the corresponding output FASTQ files.
```{r}
out = filterAndTrim(forward.original, forward.filtered,
reverse.original, reverse.filtered,
minLen = 150, # Pretty stringent but to show difference between the in and out
multithread = TRUE) # In case of Windows OS, multithread should be FALSE (which is the default)
head(out)
```
## Sample Composition
Now it comes to the nuts and bolts of using **DADA2** in infer sample compsition
### Learn Error Rates
**DADA2** uses of a parametric error model per every amplicon dataset. The error learning process is performed by [`learnErrors`](https://rdrr.io/bioc/dada2/man/learnErrors.html) from the *data*. The approach is to alternate the estimation of the error rates and the inference of sample composition until the two converge.
#### Forward Reads Errors
```{r warning=FALSE}
forward.errors = learnErrors(forward.filtered, multithread = TRUE)
plotErrors(forward.errors, nominalQ = TRUE) # Plot observed frequency of each transition
```
#### Reverse Reads Errors
```{r warning=FALSE}
reverse.errors = learnErrors(reverse.filtered, multithread = TRUE)
plotErrors(reverse.errors, nominalQ = TRUE) # Plot observed frequency of each transition
```
### Sample Inference
Now it is time to apply the sample inference algorithm, not surprisingly the main method is called [`dada`](https://rdrr.io/bioc/dada2/man/dada.html) :wink: This method removes all estimated errors from the filtered sequence reads and estimates the composition of each sample.
#### Forward Sample Inference
```{r}
forward.dada = dada(forward.filtered, err = forward.errors, multithread = TRUE)
head(forward.dada)
```
#### Reverse Sample Inference
```{r}
reverse.dada = dada(reverse.filtered, err = reverse.errors, multithread = TRUE)
head(reverse.dada)
```
### Merge Paired Reads
Now the denoised forward and reverse mates of the paired-end reads are merged into full sequences using [`mergePairs`](https://rdrr.io/bioc/dada2/man/mergePairs.html). To merge, the reverse read is *reverse complemented* and aligned to the forward read. And paired reads that do not overlap exactly are removed from the merged output.
```{r}
mergers = mergePairs(forward.dada, forward.filtered, reverse.dada, reverse.filtered, verbose = TRUE)
head(mergers[[1]]) # Inspect the merger data.frame of the first sample
```
*From the documentation of the method*:
The return `data.frame`(s) has a row for each unique pairing of forward/reverse denoised sequences, and the following columns:
- `$abundance`: Number of reads corresponding to this forward/reverse combination.
- `$sequence`: The merged sequence.
- `$forward`: The index of the forward denoised sequence.
- `$reverse`: The index of the reverse denoised sequence.
- `$nmatch`: Number of matches nts in the overlap region.
- `$nmismatch`: Number of mismatches in the overlap region.
- `$nindel`: Number of indels in the overlap region.
- `$prefer`: The sequence used for the overlap region. 1=forward; 2=reverse.
- `$accept`: `TRUE` if overlap between forward and reverse denoised sequences was at least `minOverlap` and had at most `maxMismatch` differences. `FALSE` otherwise.
## Construct Sequence Table
<a href="https://en.wikipedia.org/wiki/Amplicon_sequence_variant"><img src="https://upload.wikimedia.org/wikipedia/commons/8/8f/ASVs_vs_OTU.png" title="ASVs vs OTUs by Benjamin John Callahan" align="right" width="300px"></a>
We are almost there! Now we can build the [amplicon sequence variant](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) ([ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant)) table. The ASV (DADA2's) approach provides more precise, tractable, reproducible, and comprehensive estimate of samples composition compared to the traditional [operational taxonomic unit](https://en.wikipedia.org/wiki/Operational_taxonomic_unit) ([OTU](https://en.wikipedia.org/wiki/Operational_taxonomic_unit)) approach. For more on the topic, read the [microbiome informatics: OTU vs. ASV](https://www.zymoresearch.com/blogs/blog/microbiome-informatics-otu-vs-asv) blog post by [Zymo Research](https://www.zymoresearch.com/).
The construction of the ASV table is performed using the [`makeSequenceTable`](https://rdrr.io/bioc/dada2/man/makeSequenceTable.html), which takes the list of samples `data.frame`s , each of which is a list of the ASVs and their abundances, then it identifies the unique ASVs across all samples.
```{r}
seqtab = makeSequenceTable(mergers)
```
The dimensions of the generated sequences (ASVs) table are:
```{r}
dim(seqtab)
```
```{r echo=FALSE, warning=FALSE, message=FALSE}
m1 = nrow(seqtab)
n1 = ncol(seqtab)
```
As shown above, the dimensions of the `seqtab` `matrix` are `r m1` $\times$ `r n1`, indicating that there are `r n1` unique ASVs across the `r m1` processed samples.
## Removal of Chimeras
<img src="https://upload.wikimedia.org/wikipedia/commons/b/b3/Chimera_Apulia_Louvre_K362.jpg" title="Chimera Apulia Louvre" align="right" width="200px">
ASVs constructed from two parents (*bimera*) can now be identified and removed from the denoised ASVs using the [`removeBimeraDenovo`](https://rdrr.io/bioc/dada2/man/removeBimeraDenovo.html) method:
```{r}
seqtab.nochim = removeBimeraDenovo(seqtab,
multithread = TRUE,
verbose = TRUE)
```
After the removal of the bimeras (chimeric sequences), the final dimensions of the ASVs table are:
```{r}
dim(seqtab.nochim)
```
```{r echo=FALSE, warning=FALSE, message=FALSE}
m2 = nrow(seqtab.nochim)
n2 = ncol(seqtab.nochim)
```
`r m2` $\times$ `r n2`, so `r n1 - n2` bimeras were identified and removed. Thus, the remaining (non-chimeric) ASVs (overall; not only the unique) after the removal the bimeras is `r scales::percent(sum(seqtab.nochim)/sum(seqtab))`
### Track reads through the pipeline
As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:
```{r}
getN = function(x) sum(getUniques(x))
track = cbind(out, sapply(forward.dada, getN), sapply(reverse.dada, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(forward.dada, getN) with getN(forward.dada)
colnames(track) = c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) = sample.names
head(track)
```
Looks good! We kept the majority of our raw reads, and there is no over-large drop associated with any single step.
## Taxonomic Classification
### Assign Taxonomy
Using the [Naïve Bayesian Classifier](http://www.ncbi.nlm.nih.gov/pubmed/17586664) algorithm, **DADA2** assigns taxonomic classification to the sequence variants with the [`assignTaxonomy`](https://rdrr.io/bioc/dada2/man/assignTaxonomy.html) method, which takes also as input the reference formatted FASTA sequence. A copy of the SILVA train set is located under the `data/silva` subfolder, which can also be downloaded from here <a href="https://doi.org/10.5281/zenodo.3986799"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3986799.svg" alt="DOI"></a>
```{r}
taxa = assignTaxonomy(seqtab.nochim,
"data/silva/silva_nr99_v138_train_set.fa.gz",
multithread = TRUE,
verbose = TRUE)
head(taxa)
```
### Assign Species
In additional to above taxonomic classification, *DADA2* can also assign on the species-level through the [`addSpecies`](https://rdrr.io/bioc/dada2/man/addSpecies.html) method, which takes the above computed taxonomic table `taxa` and the reference FASTA file (`data/silva/silva_species_assignment_v138.fa.gz`).
```{r}
taxa = addSpecies(taxa, "data/silva/silva_species_assignment_v138.fa.gz")
head(taxa)
```
### Digest ASVs
The following is only to convert the ASVs (the actual sequences) into a cryptographical hash using the [`digest`](https://rdrr.io/cran/digest/) method, which by default uses the [MD5 message-digest](https://en.wikipedia.org/wiki/MD5) algorithm -*This is entirely an option step*
```{r}
md5 = lapply(colnames(seqtab.nochim), digest)
colnames(seqtab.nochim) = md5
otus = as.matrix(seqtab.nochim)
colnames(otus) = md5
otus[1:2, 1:2]
```
```{r}
rownames(taxa) = lapply(rownames(taxa), digest)
head(taxa)
```
## Phyloseq
The phyloseq R package is a powerful framework for further analysis of microbiome data. We now demonstrate how to straightforwardly import the tables produced by the DADA2 pipeline into phyloseq. We’ll also add the small amount of metadata we have – the samples are named by the gender (G), mouse subject number (X) and the day post-weaning (Y) it was sampled (eg. GXDY).
### Import into phyloseq:
We can construct a simple sample data.frame from the information encoded in the filenames. Usually this step would instead involve reading the sample data in from a file.
```{r}
samples = as.data.frame(samples)
row.names(samples) = samples$sample
samples
```
We now construct a phyloseq object directly from the dada2 outputs.
```{r}
ps = phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE), sample_data(samples), tax_table(taxa))
ps
```
We are now ready to use phyloseq!
Visualize alpha-diversity:
```{r}
plot_richness(ps, x="status", measures=c("Shannon", "Simpson"), color = "pair")
```
There are some differences in alpha-diversity between early and late samples.
```{r}
ps_rank = transform_sample_counts(ps, threshrankfun(50))
ps_log = transform_sample_counts(ps, log)
ps_norm = transform_sample_counts(ps, function(x) x / sum(x))
```
### Ordinate:
Transform data to proportions as appropriate for Bray-Curtis distances
```{r}
ord.nmds.bray = ordinate(ps_norm, method="NMDS", distance="bray")
plot_ordination(ps, ord.nmds.bray, color="status", title="Bray NMDS") + geom_point(size = 3)
```
Ordination picks out a clear separation between the early and late samples.
### Bar plot
```{r}
ps
ps_norm = transform_sample_counts(ps, function(x) x / sum(x) )
ps_filtered = filter_taxa(ps_norm, function(x) mean(x) > 1e-3, prune = TRUE)
ps_filtered
```
```{r}
plot_bar(ps_filtered, x="status", fill="Phylum") + geom_bar(aes(fill=Phylum), stat="identity", position="stack", color = "white")
```
Normalize number of reads in each sample using median sequencing depth.
```{r}
total = median(sample_sums(ps))
standf = function(x, t=total) round(t * (x / sum(x)))
ps2 = transform_sample_counts(ps, standf)
ps2
```
```{r}
top20 = names(sort(taxa_sums(ps), decreasing=TRUE))[1:20]
ps.top20 = transform_sample_counts(ps, function(OTU) OTU/sum(OTU))
ps.top20 = prune_taxa(top20, ps.top20)
plot_bar(ps.top20, x="status", fill="Phylum")
```
---
```{r}
toc()
```
---
## Session Info
```{r}
sessionInfo()
```