ALS.Rmd

---
title: "Microbiome of Amyotrophic Lateral Sclerosis (ALS) using 16S rRNA"
output:
  github_document:
    toc: true
    toc_depth: 3
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r libraries, echo=FALSE, message=FALSE, warning=FALSE}
library(printr)
library(tictoc)
tic()
```

## Setting up

### Direcotry Structure

In the working directory, there is a file called `samples.tsv` with a list of the samples and the status (`ALS` vs `Control`) of each sample. The dataset in **paired-end** FASTQ (compressed `.fastq.gz`) is located under `data/original/`. There is also the `data/silva/` folder with the [SILVA](https://www.arb-silva.de/) database for taxonomic classification.

### Installing and Loading Packages

The main package required for this lesson is [`dada2`](https://doi.org/doi:10.18129/B9.bioc.dada2). If it is not installed already, it can installed through [`BiocManager`](https://cran.r-project.org/web/packages/BiocManager/vignettes/BiocManager.html) as follows:

```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("dada2")
```

The main package required for this lesson is [`phyloseq`](https://doi.org/doi:10.18129/B9.bioc.phyloseq). If it is not installed already, it can installed through [`BiocManager`](https://cran.r-project.org/web/packages/BiocManager/vignettes/BiocManager.html) as follows:

```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("phyloseq")
```


Additional, [`tidyverse`](https://www.tidyverse.org/) will be required for data reading, handling, and plotting. Finally, [`digest`](https://cran.r-project.org/web/packages/digest/index.html) will be required for only prettify the names of the operational taxonomic units (OTUs). Both packages can be installed from the The Comprehensive R Archive Network (CRAN) repository as follows:

```{r eval=FALSE}
install.packages(c("tidyverse", "digest"))
```

After all the required packages are successfully installed, typically happens once per `R` installation, let us **load** these package in the current `R` session (workspace):

```{r}
library(dada2)
library(phyloseq)
library(tidyverse)
library(digest)
```

### Loading Meta and Sequence Data

In loading the samples information (metadata), `read_tsv`'s parameter `col_types` takes the `cff` to indicate that the first column in a *character* (string) but the second and third columns are *factor* (categorical) - for more on data types in R, see the episode on [Data Types and Structures](https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures/)

```{r}
samples = read_tsv("data/samples.tsv", col_types = 'cff')
glimpse(samples)
```

`glimpse` shows that there are 18 rows (samples) and the three columns are `sample`, `status`, and `pair` were loaded successfully as `chr`, `fct`, and `fct`, respectively.

Now let's start processing the FASTQ files. First, we will list all the files in the `original` subfolder.

```{r}
path = "data/original"
list.files(path)
```

The above list will be split into forward and reverse reads, according to the FASTQ file name where file names that end with `1.fastq.gz` are the forward reads and file names that end with `2.fastq.gz` are the reverse reads.

**The forward reads:**

```{r}
forward.original = sort(list.files(path, pattern="1.fastq.gz", full.names = TRUE))
head(forward.original)
```

**The reverse reads:**

```{r}
reverse.original = sort(list.files(path, pattern="2.fastq.gz", full.names = TRUE))
head(reverse.original)
```

The sample name is extracted from the FASTQ file name using [`strsplit`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/strsplit) by searching and taking the part the file name before the underscore `_`. The extracted sample name should match the `sample` column (first column) in the metadata loaded above. With [`sapply`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/lapply), the extraction (`strsplit`) is applied on the list (vector) of names in `forward.original`.

```{r}
sample.names = sapply(strsplit(basename(forward.original), "_"), `[`, 1)
head(sample.names)
```

## Quality Control

### Quality Profiles

#### Forward Reads

We will plot (using [`plotQualityProfile`](https://rdrr.io/bioc/dada2/man/plotQualityProfile.html)) the quality profile of *forward* reads in the first two samples

```{r}
plotQualityProfile(forward.original[1:2])
```

#### Reverse Reads

Then the quality profile of *reverse* reads of the same first two samples above

```{r}
plotQualityProfile(reverse.original[1:2])
```

As expected, the forward reads are typically of higher quality for most of the read length, but for the reverse read, the quality drops about half-way through the read length (starting approximately from the 100th nucleotide)

### Filtering and Trimming

Here, only defining the names (`.1.filtered.fastq.gz` and `.2.filtered.fastq.gz` for the *forward* and *reverse* reads, respectively) and destination subfolder `data/filtered` of the filtered FASTQ files. The sample name `sample.names` will be assigned to the row name of the `data.frame`:

```{r}
forward.filtered = file.path("data/filtered", paste0(sample.names, ".1.filtered.fastq.gz"))
reverse.filtered = file.path("data/filtered", paste0(sample.names, ".2.filtered.fastq.gz"))

names(forward.filtered) = sample.names
names(reverse.filtered) = sample.names
```

Now filter and trim using [`filterAndTrim`](https://rdrr.io/bioc/dada2/man/filterAndTrim.html), which takes input FASTQ files and generates the corresponding output FASTQ files.


```{r}
out = filterAndTrim(forward.original, forward.filtered, 
                    reverse.original, reverse.filtered,
                    minLen = 150, # Pretty stringent but to show difference between the in and out
                    multithread = TRUE) # In case of Windows OS, multithread should be FALSE (which is the default)
head(out)
```

## Sample Composition

Now it comes to the nuts and bolts of using **DADA2** in infer sample compsition

### Learn Error Rates

**DADA2** uses of a parametric error model per every amplicon dataset. The error learning process is performed by [`learnErrors`](https://rdrr.io/bioc/dada2/man/learnErrors.html) from the *data*. The approach is to alternate the estimation of the error rates and the inference of sample composition until the two converge.

#### Forward Reads Errors

```{r warning=FALSE}
forward.errors = learnErrors(forward.filtered, multithread = TRUE)
plotErrors(forward.errors, nominalQ = TRUE) # Plot observed frequency of each transition
```

#### Reverse Reads Errors

```{r warning=FALSE}
reverse.errors = learnErrors(reverse.filtered, multithread = TRUE)
plotErrors(reverse.errors, nominalQ = TRUE) # Plot observed frequency of each transition
```

### Sample Inference

Now it is time to apply the sample inference algorithm, not surprisingly the main method is called [`dada`](https://rdrr.io/bioc/dada2/man/dada.html) :wink: This method removes all estimated errors from the filtered sequence reads and estimates the composition of each sample.

#### Forward Sample Inference

```{r}
forward.dada = dada(forward.filtered, err = forward.errors, multithread = TRUE)
head(forward.dada)
```

#### Reverse Sample Inference

```{r}
reverse.dada = dada(reverse.filtered, err = reverse.errors, multithread = TRUE)
head(reverse.dada)
```

### Merge Paired Reads

Now the denoised forward and reverse mates of the paired-end reads are merged into full sequences using [`mergePairs`](https://rdrr.io/bioc/dada2/man/mergePairs.html). To merge, the reverse read is *reverse complemented* and aligned to the forward read. And paired reads that do not overlap exactly are removed from the merged output.

```{r}
mergers = mergePairs(forward.dada, forward.filtered, reverse.dada, reverse.filtered, verbose = TRUE)
head(mergers[[1]]) # Inspect the merger data.frame of the first sample
```

*From the documentation of the method*:

The return `data.frame`(s) has a row for each unique pairing of forward/reverse denoised sequences, and the following columns:

 - `$abundance`: Number of reads corresponding to this forward/reverse combination.
 - `$sequence`: The merged sequence.
 - `$forward`: The index of the forward denoised sequence.
 - `$reverse`: The index of the reverse denoised sequence.
 - `$nmatch`: Number of matches nts in the overlap region.
 - `$nmismatch`: Number of mismatches in the overlap region.
 - `$nindel`: Number of indels in the overlap region.
 - `$prefer`: The sequence used for the overlap region. 1=forward; 2=reverse.
 - `$accept`: `TRUE` if overlap between forward and reverse denoised sequences was at least `minOverlap` and had at most `maxMismatch` differences. `FALSE` otherwise.

## Construct Sequence Table

<a href="https://en.wikipedia.org/wiki/Amplicon_sequence_variant"><img src="https://upload.wikimedia.org/wikipedia/commons/8/8f/ASVs_vs_OTU.png" title="ASVs vs OTUs by Benjamin John Callahan" align="right" width="300px"></a>

We are almost there! Now we can build the [amplicon sequence variant](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) ([ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant)) table. The ASV (DADA2's) approach provides more precise, tractable, reproducible, and comprehensive estimate of samples composition compared to the traditional [operational taxonomic unit](https://en.wikipedia.org/wiki/Operational_taxonomic_unit) ([OTU](https://en.wikipedia.org/wiki/Operational_taxonomic_unit)) approach. For more on the topic, read the [microbiome informatics: OTU vs. ASV](https://www.zymoresearch.com/blogs/blog/microbiome-informatics-otu-vs-asv) blog post by [Zymo Research](https://www.zymoresearch.com/).

The construction of the ASV table is performed using the [`makeSequenceTable`](https://rdrr.io/bioc/dada2/man/makeSequenceTable.html), which takes the list of samples `data.frame`s , each of which is a list of the ASVs and their abundances, then it identifies the unique ASVs across all samples.

```{r}
seqtab = makeSequenceTable(mergers)
```

The dimensions of the generated sequences (ASVs) table are:

```{r}
dim(seqtab)
```

```{r echo=FALSE, warning=FALSE, message=FALSE}
m1 = nrow(seqtab)
n1 = ncol(seqtab)
```

As shown above, the dimensions of the `seqtab` `matrix` are `r m1` $\times$ `r n1`, indicating that there are `r n1` unique ASVs across the `r m1` processed samples.

## Removal of Chimeras

<img src="https://upload.wikimedia.org/wikipedia/commons/b/b3/Chimera_Apulia_Louvre_K362.jpg" title="Chimera Apulia Louvre" align="right" width="200px">

ASVs constructed from two parents (*bimera*) can now be identified and removed from the denoised ASVs using the [`removeBimeraDenovo`](https://rdrr.io/bioc/dada2/man/removeBimeraDenovo.html) method:

```{r}
seqtab.nochim = removeBimeraDenovo(seqtab,
                                   multithread = TRUE,
                                   verbose = TRUE)
```

After the removal of the bimeras (chimeric sequences), the final dimensions of the ASVs table are:

```{r}
dim(seqtab.nochim)
```

```{r echo=FALSE, warning=FALSE, message=FALSE}
m2 = nrow(seqtab.nochim)
n2 = ncol(seqtab.nochim)
```

`r m2` $\times$ `r n2`, so  `r n1 - n2` bimeras were identified and removed. Thus, the remaining (non-chimeric) ASVs (overall; not only the unique) after the removal the bimeras is `r scales::percent(sum(seqtab.nochim)/sum(seqtab))`

### Track reads through the pipeline

As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:

```{r}
getN = function(x) sum(getUniques(x))
track = cbind(out, sapply(forward.dada, getN), sapply(reverse.dada, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(forward.dada, getN) with getN(forward.dada)
colnames(track) = c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) = sample.names
head(track)
```

Looks good! We kept the majority of our raw reads, and there is no over-large drop associated with any single step.

## Taxonomic Classification

### Assign Taxonomy

Using the [Naïve Bayesian Classifier](http://www.ncbi.nlm.nih.gov/pubmed/17586664) algorithm, **DADA2** assigns taxonomic classification to the sequence variants with the [`assignTaxonomy`](https://rdrr.io/bioc/dada2/man/assignTaxonomy.html) method, which takes also as input the reference formatted FASTA sequence. A copy of the SILVA train set is located under the `data/silva` subfolder, which can also be downloaded from here <a href="https://doi.org/10.5281/zenodo.3986799"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3986799.svg" alt="DOI"></a>


```{r}
taxa = assignTaxonomy(seqtab.nochim,
                      "data/silva/silva_nr99_v138_train_set.fa.gz", 
                      multithread = TRUE,
                      verbose = TRUE)
head(taxa)
```

### Assign Species 

In additional to above taxonomic classification, *DADA2* can also assign on the species-level through the [`addSpecies`](https://rdrr.io/bioc/dada2/man/addSpecies.html) method, which takes the above computed taxonomic table `taxa` and the reference FASTA file (`data/silva/silva_species_assignment_v138.fa.gz`).

```{r}
taxa = addSpecies(taxa, "data/silva/silva_species_assignment_v138.fa.gz")
head(taxa)
```

### Digest ASVs

The following is only to convert the ASVs (the actual sequences) into a cryptographical hash using the [`digest`](https://rdrr.io/cran/digest/) method, which by default uses the [MD5 message-digest](https://en.wikipedia.org/wiki/MD5) algorithm -*This is entirely an option step*

```{r}
md5 = lapply(colnames(seqtab.nochim), digest)
colnames(seqtab.nochim) = md5

otus = as.matrix(seqtab.nochim)
colnames(otus) = md5
otus[1:2, 1:2]
```


```{r}
rownames(taxa) = lapply(rownames(taxa), digest)
head(taxa)
```

## Phyloseq

The phyloseq R package is a powerful framework for further analysis of microbiome data. We now demonstrate how to straightforwardly import the tables produced by the DADA2 pipeline into phyloseq. We’ll also add the small amount of metadata we have – the samples are named by the gender (G), mouse subject number (X) and the day post-weaning (Y) it was sampled (eg. GXDY).

### Import into phyloseq:
We can construct a simple sample data.frame from the information encoded in the filenames. Usually this step would instead involve reading the sample data in from a file.

```{r}
samples = as.data.frame(samples)
row.names(samples) = samples$sample
samples
```

We now construct a phyloseq object directly from the dada2 outputs.

```{r}
ps = phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE),  sample_data(samples),  tax_table(taxa))
ps
```

We are now ready to use phyloseq!

Visualize alpha-diversity:

```{r}
plot_richness(ps, x="status", measures=c("Shannon", "Simpson"), color = "pair")
```

There are some differences in alpha-diversity between early and late samples.


```{r}
ps_rank = transform_sample_counts(ps, threshrankfun(50))
ps_log = transform_sample_counts(ps, log)
ps_norm = transform_sample_counts(ps, function(x) x / sum(x))
```


### Ordinate:

Transform data to proportions as appropriate for Bray-Curtis distances

```{r}
ord.nmds.bray = ordinate(ps_norm, method="NMDS", distance="bray")
plot_ordination(ps, ord.nmds.bray, color="status", title="Bray NMDS") + geom_point(size = 3)
```

Ordination picks out a clear separation between the early and late samples.

### Bar plot

```{r}
ps
ps_norm  = transform_sample_counts(ps, function(x) x / sum(x) )
ps_filtered = filter_taxa(ps_norm, function(x) mean(x) > 1e-3, prune = TRUE)
ps_filtered
```

```{r}
plot_bar(ps_filtered, x="status", fill="Phylum") + geom_bar(aes(fill=Phylum), stat="identity", position="stack", color = "white")
```

Normalize number of reads in each sample using median sequencing depth.

```{r}
total = median(sample_sums(ps))
standf = function(x, t=total) round(t * (x / sum(x)))
ps2 = transform_sample_counts(ps, standf)
ps2
```


```{r}
top20 = names(sort(taxa_sums(ps), decreasing=TRUE))[1:20]
ps.top20 = transform_sample_counts(ps, function(OTU) OTU/sum(OTU))
ps.top20 = prune_taxa(top20, ps.top20)

plot_bar(ps.top20, x="status", fill="Phylum")
```

---

```{r}
toc()
```

---

## Session Info

```{r}
sessionInfo()
```