Microbiome of Amyotrophic Lateral Sclerosis (ALS) using 16S rRNA

Setting up
Quality Control
- Quality Profiles
- Filtering and Trimming
Sample Composition
Construct Sequence Table
Removal of Chimeras
- Track reads through the pipeline
Taxonomic Classification
Phyloseq
Session Info

Setting up

Direcotry Structure

In the working directory, there is a file called samples.tsv with a list of the samples and the status (ALS vs Control) of each sample. The dataset in paired-end FASTQ (compressed .fastq.gz) is located under data/original/. There is also the data/silva/ folder with the SILVA database for taxonomic classification.

Installing and Loading Packages

The main package required for this lesson is dada2. If it is not installed already, it can installed through BiocManager as follows:

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("dada2")

The main package required for this lesson is phyloseq. If it is not installed already, it can installed through BiocManager as follows:

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("phyloseq")

Additional, tidyverse will be required for data reading, handling, and plotting. Finally, digest will be required for only prettify the names of the operational taxonomic units (OTUs). Both packages can be installed from the The Comprehensive R Archive Network (CRAN) repository as follows:

install.packages(c("tidyverse", "digest"))

After all the required packages are successfully installed, typically happens once per R installation, let us load these package in the current R session (workspace):

library(dada2)

## Loading required package: Rcpp

library(phyloseq)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.4     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(digest)

Loading Meta and Sequence Data

In loading the samples information (metadata), read_tsv’s parameter col_types takes the cff to indicate that the first column in a character (string) but the second and third columns are factor (categorical) - for more on data types in R, see the episode on Data Types and Structures

samples = read_tsv("data/samples.tsv", col_types = 'cff')
glimpse(samples)

## Rows: 18
## Columns: 3
## $ sample <chr> "SRR10153501", "SRR10153511", "SRR10153503", "SRR10153499", "SR…
## $ status <fct> ALS, Control, ALS, Control, ALS, Control, ALS, Control, ALS, Co…
## $ pair   <fct> 1, 1, 2, 2, 3, 3, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10

glimpse shows that there are 18 rows (samples) and the three columns are sample, status, and pair were loaded successfully as chr, fct, and fct, respectively.

Now let’s start processing the FASTQ files. First, we will list all the files in the original subfolder.

path = "data/original"
list.files(path)

##  [1] "SRR10153499_1.fastq.gz" "SRR10153499_2.fastq.gz" "SRR10153500_1.fastq.gz"
##  [4] "SRR10153500_2.fastq.gz" "SRR10153501_1.fastq.gz" "SRR10153501_2.fastq.gz"
##  [7] "SRR10153502_1.fastq.gz" "SRR10153502_2.fastq.gz" "SRR10153503_1.fastq.gz"
## [10] "SRR10153503_2.fastq.gz" "SRR10153504_1.fastq.gz" "SRR10153504_2.fastq.gz"
## [13] "SRR10153505_1.fastq.gz" "SRR10153505_2.fastq.gz" "SRR10153506_1.fastq.gz"
## [16] "SRR10153506_2.fastq.gz" "SRR10153507_1.fastq.gz" "SRR10153507_2.fastq.gz"
## [19] "SRR10153508_1.fastq.gz" "SRR10153508_2.fastq.gz" "SRR10153509_1.fastq.gz"
## [22] "SRR10153509_2.fastq.gz" "SRR10153510_1.fastq.gz" "SRR10153510_2.fastq.gz"
## [25] "SRR10153511_1.fastq.gz" "SRR10153511_2.fastq.gz" "SRR10153512_1.fastq.gz"
## [28] "SRR10153512_2.fastq.gz" "SRR10153513_1.fastq.gz" "SRR10153513_2.fastq.gz"
## [31] "SRR10153514_1.fastq.gz" "SRR10153514_2.fastq.gz" "SRR10153515_1.fastq.gz"
## [34] "SRR10153515_2.fastq.gz" "SRR10153573_1.fastq.gz" "SRR10153573_2.fastq.gz"

The above list will be split into forward and reverse reads, according to the FASTQ file name where file names that end with 1.fastq.gz are the forward reads and file names that end with 2.fastq.gz are the reverse reads.

The forward reads:

forward.original = sort(list.files(path, pattern="1.fastq.gz", full.names = TRUE))
head(forward.original)

## [1] "data/original/SRR10153499_1.fastq.gz"
## [2] "data/original/SRR10153500_1.fastq.gz"
## [3] "data/original/SRR10153501_1.fastq.gz"
## [4] "data/original/SRR10153502_1.fastq.gz"
## [5] "data/original/SRR10153503_1.fastq.gz"
## [6] "data/original/SRR10153504_1.fastq.gz"

The reverse reads:

reverse.original = sort(list.files(path, pattern="2.fastq.gz", full.names = TRUE))
head(reverse.original)

## [1] "data/original/SRR10153499_2.fastq.gz"
## [2] "data/original/SRR10153500_2.fastq.gz"
## [3] "data/original/SRR10153501_2.fastq.gz"
## [4] "data/original/SRR10153502_2.fastq.gz"
## [5] "data/original/SRR10153503_2.fastq.gz"
## [6] "data/original/SRR10153504_2.fastq.gz"

The sample name is extracted from the FASTQ file name using strsplit by searching and taking the part the file name before the underscore _. The extracted sample name should match the sample column (first column) in the metadata loaded above. With sapply, the extraction (strsplit) is applied on the list (vector) of names in forward.original.

sample.names = sapply(strsplit(basename(forward.original), "_"), `[`, 1)
head(sample.names)

## [1] "SRR10153499" "SRR10153500" "SRR10153501" "SRR10153502" "SRR10153503"
## [6] "SRR10153504"

Quality Control

Quality Profiles

Forward Reads

We will plot (using plotQualityProfile) the quality profile of forward reads in the first two samples

plotQualityProfile(forward.original[1:2])

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Reverse Reads

Then the quality profile of reverse reads of the same first two samples above

plotQualityProfile(reverse.original[1:2])

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

As expected, the forward reads are typically of higher quality for most of the read length, but for the reverse read, the quality drops about half-way through the read length (starting approximately from the 100th nucleotide)

Filtering and Trimming

Here, only defining the names (.1.filtered.fastq.gz and .2.filtered.fastq.gz for the forward and reverse reads, respectively) and destination subfolder data/filtered of the filtered FASTQ files. The sample name sample.names will be assigned to the row name of the data.frame:

forward.filtered = file.path("data/filtered", paste0(sample.names, ".1.filtered.fastq.gz"))
reverse.filtered = file.path("data/filtered", paste0(sample.names, ".2.filtered.fastq.gz"))

names(forward.filtered) = sample.names
names(reverse.filtered) = sample.names

Now filter and trim using filterAndTrim, which takes input FASTQ files and generates the corresponding output FASTQ files.

out = filterAndTrim(forward.original, forward.filtered, 
                    reverse.original, reverse.filtered,
                    minLen = 150, # Pretty stringent but to show difference between the in and out
                    multithread = TRUE) # In case of Windows OS, multithread should be FALSE (which is the default)
head(out)

	reads.in	reads.out
SRR10153499_1.fastq.gz	11688	9907
SRR10153500_1.fastq.gz	15828	11295
SRR10153501_1.fastq.gz	14180	11437
SRR10153502_1.fastq.gz	10482	8965
SRR10153503_1.fastq.gz	13782	11345
SRR10153504_1.fastq.gz	14467	11425

Sample Composition

Now it comes to the nuts and bolts of using DADA2 in infer sample compsition

Learn Error Rates

DADA2 uses of a parametric error model per every amplicon dataset. The error learning process is performed by learnErrors from the data. The approach is to alternate the estimation of the error rates and the inference of sample composition until the two converge.

Forward Reads Errors

forward.errors = learnErrors(forward.filtered, multithread = TRUE)

## 45461505 total bases in 190008 reads from 18 samples will be used for learning the error rates.

plotErrors(forward.errors, nominalQ = TRUE) # Plot observed frequency of each transition

Reverse Reads Errors

reverse.errors = learnErrors(reverse.filtered, multithread = TRUE)

## 39575151 total bases in 190008 reads from 18 samples will be used for learning the error rates.

plotErrors(reverse.errors, nominalQ = TRUE) # Plot observed frequency of each transition

Sample Inference

Now it is time to apply the sample inference algorithm, not surprisingly the main method is called dada 😉 This method removes all estimated errors from the filtered sequence reads and estimates the composition of each sample.

Forward Sample Inference

forward.dada = dada(forward.filtered, err = forward.errors, multithread = TRUE)

## Sample 1 - 9907 reads in 4306 unique sequences.
## Sample 2 - 11295 reads in 6598 unique sequences.
## Sample 3 - 11437 reads in 6130 unique sequences.
## Sample 4 - 8965 reads in 4869 unique sequences.
## Sample 5 - 11345 reads in 5462 unique sequences.
## Sample 6 - 11425 reads in 6348 unique sequences.
## Sample 7 - 11111 reads in 5274 unique sequences.
## Sample 8 - 12067 reads in 5961 unique sequences.
## Sample 9 - 9883 reads in 5960 unique sequences.
## Sample 10 - 9563 reads in 4130 unique sequences.
## Sample 11 - 10508 reads in 5357 unique sequences.
## Sample 12 - 11592 reads in 6122 unique sequences.
## Sample 13 - 8911 reads in 4174 unique sequences.
## Sample 14 - 10840 reads in 4331 unique sequences.
## Sample 15 - 9624 reads in 6037 unique sequences.
## Sample 16 - 9088 reads in 4960 unique sequences.
## Sample 17 - 10429 reads in 3865 unique sequences.
## Sample 18 - 12018 reads in 5879 unique sequences.

head(forward.dada)

## $SRR10153499
## dada-class: object describing DADA2 denoising results
## 63 sequence variants were inferred from 4306 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $SRR10153500
## dada-class: object describing DADA2 denoising results
## 124 sequence variants were inferred from 6598 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $SRR10153501
## dada-class: object describing DADA2 denoising results
## 94 sequence variants were inferred from 6130 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $SRR10153502
## dada-class: object describing DADA2 denoising results
## 78 sequence variants were inferred from 4869 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $SRR10153503
## dada-class: object describing DADA2 denoising results
## 78 sequence variants were inferred from 5462 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $SRR10153504
## dada-class: object describing DADA2 denoising results
## 102 sequence variants were inferred from 6348 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

Reverse Sample Inference

reverse.dada = dada(reverse.filtered, err = reverse.errors, multithread = TRUE)

## Sample 1 - 9907 reads in 6084 unique sequences.
## Sample 2 - 11295 reads in 8328 unique sequences.
## Sample 3 - 11437 reads in 8002 unique sequences.
## Sample 4 - 8965 reads in 6397 unique sequences.
## Sample 5 - 11345 reads in 7343 unique sequences.
## Sample 6 - 11425 reads in 7901 unique sequences.
## Sample 7 - 11111 reads in 6943 unique sequences.
## Sample 8 - 12067 reads in 7889 unique sequences.
## Sample 9 - 9883 reads in 7295 unique sequences.
## Sample 10 - 9563 reads in 5886 unique sequences.
## Sample 11 - 10508 reads in 7123 unique sequences.
## Sample 12 - 11592 reads in 8244 unique sequences.
## Sample 13 - 8911 reads in 5942 unique sequences.
## Sample 14 - 10840 reads in 5829 unique sequences.
## Sample 15 - 9624 reads in 7704 unique sequences.
## Sample 16 - 9088 reads in 6444 unique sequences.
## Sample 17 - 10429 reads in 5279 unique sequences.
## Sample 18 - 12018 reads in 7587 unique sequences.

head(reverse.dada)

## $SRR10153499
## dada-class: object describing DADA2 denoising results
## 36 sequence variants were inferred from 6084 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $SRR10153500
## dada-class: object describing DADA2 denoising results
## 70 sequence variants were inferred from 8328 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $SRR10153501
## dada-class: object describing DADA2 denoising results
## 65 sequence variants were inferred from 8002 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $SRR10153502
## dada-class: object describing DADA2 denoising results
## 33 sequence variants were inferred from 6397 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $SRR10153503
## dada-class: object describing DADA2 denoising results
## 43 sequence variants were inferred from 7343 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
## 
## $SRR10153504
## dada-class: object describing DADA2 denoising results
## 58 sequence variants were inferred from 7901 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

Merge Paired Reads

Now the denoised forward and reverse mates of the paired-end reads are merged into full sequences using mergePairs. To merge, the reverse read is reverse complemented and aligned to the forward read. And paired reads that do not overlap exactly are removed from the merged output.

mergers = mergePairs(forward.dada, forward.filtered, reverse.dada, reverse.filtered, verbose = TRUE)

## 8188 paired-reads (in 34 unique pairings) successfully merged out of 9583 (in 101 pairings) input.

## Duplicate sequences in merged output.

## 8454 paired-reads (in 59 unique pairings) successfully merged out of 10536 (in 246 pairings) input.

## Duplicate sequences in merged output.

## 9263 paired-reads (in 58 unique pairings) successfully merged out of 11048 (in 192 pairings) input.

## 6208 paired-reads (in 30 unique pairings) successfully merged out of 8471 (in 214 pairings) input.

## 8025 paired-reads (in 34 unique pairings) successfully merged out of 10896 (in 153 pairings) input.

## 9086 paired-reads (in 52 unique pairings) successfully merged out of 10891 (in 199 pairings) input.

## Duplicate sequences in merged output.

## 8073 paired-reads (in 35 unique pairings) successfully merged out of 10613 (in 207 pairings) input.

## Duplicate sequences in merged output.

## 9942 paired-reads (in 42 unique pairings) successfully merged out of 11661 (in 153 pairings) input.

## 6890 paired-reads (in 39 unique pairings) successfully merged out of 9242 (in 243 pairings) input.

## 8307 paired-reads (in 34 unique pairings) successfully merged out of 9231 (in 113 pairings) input.

## Duplicate sequences in merged output.

## 8333 paired-reads (in 50 unique pairings) successfully merged out of 9927 (in 199 pairings) input.

## 9652 paired-reads (in 54 unique pairings) successfully merged out of 10969 (in 185 pairings) input.

## 7437 paired-reads (in 27 unique pairings) successfully merged out of 8526 (in 118 pairings) input.

## 9990 paired-reads (in 30 unique pairings) successfully merged out of 10656 (in 82 pairings) input.

## 6714 paired-reads (in 61 unique pairings) successfully merged out of 8721 (in 248 pairings) input.

## 6509 paired-reads (in 32 unique pairings) successfully merged out of 8627 (in 147 pairings) input.

## Duplicate sequences in merged output.

## 8878 paired-reads (in 19 unique pairings) successfully merged out of 10278 (in 79 pairings) input.

## 9486 paired-reads (in 34 unique pairings) successfully merged out of 11590 (in 136 pairings) input.

## Duplicate sequences in merged output.

head(mergers[[1]]) # Inspect the merger data.frame of the first sample

sequence	abundance	forward	reverse	nmatch	prefer	accept
TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCTGTCTATTAAGCGTGTTGTGAAATTTACCGGCTCAACCGGTGGCTTGCAGCGCGAACTGGTCGACTTGAGTATGCAGGAAGTAGGCGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGACGAACTCCGATTGCGCAGGCAGCTTACTGTAGCATAACTGACGCTGATGCTCGAAAGTGCGGGTATCAAACAGG	2368	1	1	247	1	TRUE
TACGGAAGGTTCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGTTTGGTAAGCGTGTTGTGAAATGTAGGAGCTCAACTTCTAGATTGCAGCGCGAACTGTCAGACTTGAGTGCGCACAACGTAGGCGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTTACGGGAGCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCGAACAGG	916	3	2	247	1	TRUE
TACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTGCGTAGGCTGTTTCGTAAGTCGTGTGTGAAAGGCGCGGGCTCAACCCGCGGACGGCACATGATACTGCGAGACTAGAGTAATGGAGGGGGAACCGGAATTCTCGGTGTAGCAGTGAAATGCGTAGATATCGAGAGGAACACTCGTGGCGAAGGCGGGTTCCTGGACATTAACTGACGCTGAGGCACGAAGGCCAGGGGAGCGAAAGGG	906	2	3	193	2	TRUE
TACGGAAGGTTCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGTTTGGTAAGCGTGTTGTGAAATGTAGTAGCTCAACTTCTAGATTGCAGCGCGAACTGTCAGACTTGAGTGCGCACAACGTAGGCGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTTACGGGAGCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCGAACAGG	665	4	31	247	1	TRUE
TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCTGTCTATTAAGCGTGTTGTGAAATATACCGGCTCAACCGGTGGCTTGCAGCGCGAACTGGTCGACTTGAGTATGCAGGAAGTAGGCGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGACGAACTCCGATTGCGCAGGCAGCTTACTGTAGCATAACTGACGCTGATGCTCGAAAGTGCGGGTATCAAACAGG	435	10	33	247	1	TRUE
TACGTATGGAGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGTACGCAGGCGGTTTAATAAGTCGAATGTTAAAGATCGGGGCTCAACCCCGTAAAGCATTGGAAACTGATAAACTTGAGTAGTGGAGAGGAAAGTGGAATTCCTAGTGTAGTGGTGAAATACGTAGATATTAGGAGGAATACCAGTAGCGAAGGCGACTTTCTGGACACAAACTGACGCTGAGGTACGAAAGCGTGGGGAGCAAACAGG	404	5	5	250	2	TRUE

From the documentation of the method:

The return data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, and the following columns:

$abundance: Number of reads corresponding to this forward/reverse combination.
$sequence: The merged sequence.
$forward: The index of the forward denoised sequence.
$reverse: The index of the reverse denoised sequence.
$nmatch: Number of matches nts in the overlap region.
$nmismatch: Number of mismatches in the overlap region.
$nindel: Number of indels in the overlap region.
$prefer: The sequence used for the overlap region. 1=forward; 2=reverse.
$accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise.

Construct Sequence Table

We are almost there! Now we can build the amplicon sequence variant (ASV) table. The ASV (DADA2’s) approach provides more precise, tractable, reproducible, and comprehensive estimate of samples composition compared to the traditional operational taxonomic unit (OTU) approach. For more on the topic, read the microbiome informatics: OTU vs. ASV blog post by Zymo Research.

The construction of the ASV table is performed using the makeSequenceTable, which takes the list of samples data.frames , each of which is a list of the ASVs and their abundances, then it identifies the unique ASVs across all samples.

seqtab = makeSequenceTable(mergers)

## Duplicate sequences detected and merged.
## Duplicate sequences detected and merged.
## Duplicate sequences detected and merged.
## Duplicate sequences detected and merged.
## Duplicate sequences detected and merged.
## Duplicate sequences detected and merged.
## Duplicate sequences detected and merged.

The dimensions of the generated sequences (ASVs) table are:

dim(seqtab)

## [1]  18 293

As shown above, the dimensions of the seqtab matrix are 18 × 293, indicating that there are 293 unique ASVs across the 18 processed samples.

Removal of Chimeras

ASVs constructed from two parents (bimera) can now be identified and removed from the denoised ASVs using the removeBimeraDenovo method:

seqtab.nochim = removeBimeraDenovo(seqtab,
                                   multithread = TRUE,
                                   verbose = TRUE)

## Identified 18 bimeras out of 293 input sequences.

After the removal of the bimeras (chimeric sequences), the final dimensions of the ASVs table are:

dim(seqtab.nochim)

## [1]  18 275

18 × 275, so 18 bimeras were identified and removed. Thus, the remaining (non-chimeric) ASVs (overall; not only the unique) after the removal the bimeras is 99%

Track reads through the pipeline

As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:

getN = function(x) sum(getUniques(x))
track = cbind(out, sapply(forward.dada, getN), sapply(reverse.dada, getN), sapply(mergers, getN), rowSums(seqtab.nochim))

## Duplicate sequences detected and merged.
## Duplicate sequences detected and merged.
## Duplicate sequences detected and merged.
## Duplicate sequences detected and merged.
## Duplicate sequences detected and merged.
## Duplicate sequences detected and merged.
## Duplicate sequences detected and merged.

# If processing a single sample, remove the sapply calls: e.g. replace sapply(forward.dada, getN) with getN(forward.dada)
colnames(track) = c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) = sample.names
head(track)

	input	filtered	denoisedF	denoisedR	merged	nonchim
SRR10153499	11688	9907	9688	9703	8188	8188
SRR10153500	15828	11295	10847	10788	8454	8449
SRR10153501	14180	11437	11191	11209	9263	9263
SRR10153502	10482	8965	8671	8642	6208	6119
SRR10153503	13782	11345	11079	11059	8025	7979
SRR10153504	14467	11425	11104	11076	9086	9086

Looks good! We kept the majority of our raw reads, and there is no over-large drop associated with any single step.

Taxonomic Classification

Assign Taxonomy

Using the Naïve Bayesian Classifier algorithm, DADA2 assigns taxonomic classification to the sequence variants with the assignTaxonomy method, which takes also as input the reference formatted FASTA sequence. A copy of the SILVA train set is located under the data/silva subfolder, which can also be downloaded from here

taxa = assignTaxonomy(seqtab.nochim,
                      "data/silva/silva_nr99_v138_train_set.fa.gz", 
                      multithread = TRUE,
                      verbose = TRUE)

## Finished processing reference fasta.

head(taxa)

	Kingdom	Phylum	Class	Order	Family	Genus
TACGGAAGGTTCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGTTTGGTAAGCGTGTTGTGAAATGTAGGAGCTCAACTTCTAGATTGCAGCGCGAACTGTCAGACTTGAGTGCGCACAACGTAGGCGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTTACGGGAGCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCGAACAGG	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotella
TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGTGTAGGCGGCCTGTTAAGCGTGTTGTGAAATGTAGATGCTCAACATCTGAACTGCAGCGCGAACTGGCTGGCTTGAGTACACGCAACGTGGGCGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAGGAACTCCTATTGCGAAGGCAGCTCACGGGAGTGTCACTGACGCTTAAGCTCGAAGGTGCGGGTATCAAACAGG	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotella
TACGTATGGAGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGTACGCAGGCGGTTTAATAAGTCGAATGTTAAAGATCGGGGCTCAACCCCGTAAAGCATTGGAAACTGATAAACTTGAGTAGTGGAGAGGAAAGTGGAATTCCTAGTGTAGTGGTGAAATACGTAGATATTAGGAGGAATACCAGTAGCGAAGGCGACTTTCTGGACACAAACTGACGCTGAGGTACGAAAGCGTGGGGAGCAAACAGG	Bacteria	Firmicutes	Clostridia	Peptostreptococcales-Tissierellales	Finegoldia	NA
TACGGAAGGTCCAGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGTGTAGGCGGTTGGTTAAGCGTGTTGTGAAATGTAGATGCTCAACATCTGACTTGCAGCGCGAACTGGCTGACTTGAGTACACACAACGTAGGCGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTTACGGGAGTGTTACTGACGCTTAAGCTCGAAGGTGCGGGTATCGAACAGG	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotella
TACGTAAGGGGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGTGCGTAGGCGGCAAATTAAGTCAGATGTGAAAACTAAGGGCTCAACCCATAGATTGCATCTGAAACTGATATGCTTGAGTCAAGGAGAGGAAAGTGGAATTCCTAGTGTAGCGGTGGAATGCGTAGATATTAGGAGGAATACCGGTGGCGAAGGCGACTTTCTGGACTTGAACTGACGCTGAGGCACGAAAGCGTGGGGAGCAAACAGG	Bacteria	Firmicutes	Clostridia	Peptostreptococcales-Tissierellales	Fenollaria	NA
TACGGAGGGTGCAAGCGTTACTCGGAATCACTGGGCGTAAAGGGTGCGTAGGTGGATTATCAAGTCTCTTGTGAAATCTAATAGCTTAACTATTAAATTGCTTGGGAAACTGATAGTCTAGAGTAGGGGAGAGGCAGATGGAACTCTTGGTGTAGGAGTAAAATCCGTAGATATCAAGAAGAATACCTATTGCGAAAGCGATCTGCTAGAACCTAACTGACACTGATGCACGAAAGCGTGGGGAGCAAACAGG	Bacteria	Campilobacterota	Campylobacteria	Campylobacterales	Campylobacteraceae	Campylobacter

Assign Species

In additional to above taxonomic classification, DADA2 can also assign on the species-level through the addSpecies method, which takes the above computed taxonomic table taxa and the reference FASTA file (data/silva/silva_species_assignment_v138.fa.gz).

taxa = addSpecies(taxa, "data/silva/silva_species_assignment_v138.fa.gz")
head(taxa)

	Kingdom	Phylum	Class	Order	Family	Genus	Species
TACGGAAGGTTCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGTTTGGTAAGCGTGTTGTGAAATGTAGGAGCTCAACTTCTAGATTGCAGCGCGAACTGTCAGACTTGAGTGCGCACAACGTAGGCGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTTACGGGAGCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCGAACAGG	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotella	NA
TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGTGTAGGCGGCCTGTTAAGCGTGTTGTGAAATGTAGATGCTCAACATCTGAACTGCAGCGCGAACTGGCTGGCTTGAGTACACGCAACGTGGGCGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAGGAACTCCTATTGCGAAGGCAGCTCACGGGAGTGTCACTGACGCTTAAGCTCGAAGGTGCGGGTATCAAACAGG	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotella	corporis
TACGTATGGAGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGTACGCAGGCGGTTTAATAAGTCGAATGTTAAAGATCGGGGCTCAACCCCGTAAAGCATTGGAAACTGATAAACTTGAGTAGTGGAGAGGAAAGTGGAATTCCTAGTGTAGTGGTGAAATACGTAGATATTAGGAGGAATACCAGTAGCGAAGGCGACTTTCTGGACACAAACTGACGCTGAGGTACGAAAGCGTGGGGAGCAAACAGG	Bacteria	Firmicutes	Clostridia	Peptostreptococcales-Tissierellales	Finegoldia	NA	NA
TACGGAAGGTCCAGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGTGTAGGCGGTTGGTTAAGCGTGTTGTGAAATGTAGATGCTCAACATCTGACTTGCAGCGCGAACTGGCTGACTTGAGTACACACAACGTAGGCGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTTACGGGAGTGTTACTGACGCTTAAGCTCGAAGGTGCGGGTATCGAACAGG	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotella	disiens
TACGTAAGGGGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGTGCGTAGGCGGCAAATTAAGTCAGATGTGAAAACTAAGGGCTCAACCCATAGATTGCATCTGAAACTGATATGCTTGAGTCAAGGAGAGGAAAGTGGAATTCCTAGTGTAGCGGTGGAATGCGTAGATATTAGGAGGAATACCGGTGGCGAAGGCGACTTTCTGGACTTGAACTGACGCTGAGGCACGAAAGCGTGGGGAGCAAACAGG	Bacteria	Firmicutes	Clostridia	Peptostreptococcales-Tissierellales	Fenollaria	NA	NA
TACGGAGGGTGCAAGCGTTACTCGGAATCACTGGGCGTAAAGGGTGCGTAGGTGGATTATCAAGTCTCTTGTGAAATCTAATAGCTTAACTATTAAATTGCTTGGGAAACTGATAGTCTAGAGTAGGGGAGAGGCAGATGGAACTCTTGGTGTAGGAGTAAAATCCGTAGATATCAAGAAGAATACCTATTGCGAAAGCGATCTGCTAGAACCTAACTGACACTGATGCACGAAAGCGTGGGGAGCAAACAGG	Bacteria	Campilobacterota	Campylobacteria	Campylobacterales	Campylobacteraceae	Campylobacter	hominis

Digest ASVs

The following is only to convert the ASVs (the actual sequences) into a cryptographical hash using the digest method, which by default uses the MD5 message-digest algorithm -This is entirely an option step

md5 = lapply(colnames(seqtab.nochim), digest)
colnames(seqtab.nochim) = md5

otus = as.matrix(seqtab.nochim)
colnames(otus) = md5
otus[1:2, 1:2]

	6075c62942cb38859a92dd48a72a3885	023d4737baa910a431f2ddb5de0e44c8
SRR10153499	916	0
SRR10153500	0	457

rownames(taxa) = lapply(rownames(taxa), digest)
head(taxa)

	Kingdom	Phylum	Class	Order	Family	Genus	Species
6075c62942cb38859a92dd48a72a3885	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotella	NA
023d4737baa910a431f2ddb5de0e44c8	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotella	corporis
504fb0a312e04e43d371acb19a95d789	Bacteria	Firmicutes	Clostridia	Peptostreptococcales-Tissierellales	Finegoldia	NA	NA
6619cbcb3bcaafe1e418ad7b9073af70	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotella	disiens
977091fc9d08b0b43c99857ca9976e72	Bacteria	Firmicutes	Clostridia	Peptostreptococcales-Tissierellales	Fenollaria	NA	NA
e41846b638aa4b7eb2728408abf3700b	Bacteria	Campilobacterota	Campylobacteria	Campylobacterales	Campylobacteraceae	Campylobacter	hominis

Phyloseq

The phyloseq R package is a powerful framework for further analysis of microbiome data. We now demonstrate how to straightforwardly import the tables produced by the DADA2 pipeline into phyloseq. We’ll also add the small amount of metadata we have – the samples are named by the gender (G), mouse subject number (X) and the day post-weaning (Y) it was sampled (eg. GXDY).

Import into phyloseq:

We can construct a simple sample data.frame from the information encoded in the filenames. Usually this step would instead involve reading the sample data in from a file.

samples = as.data.frame(samples)
row.names(samples) = samples$sample
samples

	sample	status	pair
SRR10153501	SRR10153501	ALS	1
SRR10153511	SRR10153511	Control	1
SRR10153503	SRR10153503	ALS	2
SRR10153499	SRR10153499	Control	2
SRR10153500	SRR10153500	ALS	3
SRR10153504	SRR10153504	Control	3
SRR10153502	SRR10153502	ALS	5
SRR10153505	SRR10153505	Control	5
SRR10153509	SRR10153509	ALS	6
SRR10153514	SRR10153514	Control	6
SRR10153507	SRR10153507	ALS	7
SRR10153512	SRR10153512	Control	7
SRR10153510	SRR10153510	ALS	8
SRR10153508	SRR10153508	Control	8
SRR10153506	SRR10153506	ALS	9
SRR10153515	SRR10153515	Control	9
SRR10153513	SRR10153513	ALS	10
SRR10153573	SRR10153573	Control	10

We now construct a phyloseq object directly from the dada2 outputs.

ps = phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE),  sample_data(samples),  tax_table(taxa))
ps

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 275 taxa and 18 samples ]
## sample_data() Sample Data:       [ 18 samples by 3 sample variables ]
## tax_table()   Taxonomy Table:    [ 275 taxa by 7 taxonomic ranks ]

We are now ready to use phyloseq!

Visualize alpha-diversity:

plot_richness(ps, x="status", measures=c("Shannon", "Simpson"), color = "pair")

## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

There are some differences in alpha-diversity between early and late samples.

ps_rank = transform_sample_counts(ps, threshrankfun(50))
ps_log = transform_sample_counts(ps, log)
ps_norm = transform_sample_counts(ps, function(x) x / sum(x))

Ordinate:

Transform data to proportions as appropriate for Bray-Curtis distances

ord.nmds.bray = ordinate(ps_norm, method="NMDS", distance="bray")

## Run 0 stress 0.1867397 
## Run 1 stress 0.1978797 
## Run 2 stress 0.1695197 
## ... New best solution
## ... Procrustes: rmse 0.1882398  max resid 0.6444576 
## Run 3 stress 0.1677135 
## ... New best solution
## ... Procrustes: rmse 0.06752529  max resid 0.1821645 
## Run 4 stress 0.1693977 
## Run 5 stress 0.1865755 
## Run 6 stress 0.1867397 
## Run 7 stress 0.1808857 
## Run 8 stress 0.1677134 
## ... New best solution
## ... Procrustes: rmse 3.208964e-05  max resid 0.0001121116 
## ... Similar to previous best
## Run 9 stress 0.1677132 
## ... New best solution
## ... Procrustes: rmse 0.0002691525  max resid 0.0009414973 
## ... Similar to previous best
## Run 10 stress 0.1766461 
## Run 11 stress 0.1677133 
## ... Procrustes: rmse 0.0001029658  max resid 0.0003578937 
## ... Similar to previous best
## Run 12 stress 0.183413 
## Run 13 stress 0.1677133 
## ... Procrustes: rmse 7.457289e-05  max resid 0.0002586777 
## ... Similar to previous best
## Run 14 stress 0.1693585 
## Run 15 stress 0.1693585 
## Run 16 stress 0.261521 
## Run 17 stress 0.1677134 
## ... Procrustes: rmse 0.0004003549  max resid 0.001402586 
## ... Similar to previous best
## Run 18 stress 0.1677134 
## ... Procrustes: rmse 0.0002501921  max resid 0.0008759365 
## ... Similar to previous best
## Run 19 stress 0.1834109 
## Run 20 stress 0.1867397 
## *** Solution reached

plot_ordination(ps, ord.nmds.bray, color="status", title="Bray NMDS") + geom_point(size = 3)

Ordination picks out a clear separation between the early and late samples.

Bar plot

ps

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 275 taxa and 18 samples ]
## sample_data() Sample Data:       [ 18 samples by 3 sample variables ]
## tax_table()   Taxonomy Table:    [ 275 taxa by 7 taxonomic ranks ]

ps_norm  = transform_sample_counts(ps, function(x) x / sum(x) )
ps_filtered = filter_taxa(ps_norm, function(x) mean(x) > 1e-3, prune = TRUE)
ps_filtered

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 127 taxa and 18 samples ]
## sample_data() Sample Data:       [ 18 samples by 3 sample variables ]
## tax_table()   Taxonomy Table:    [ 127 taxa by 7 taxonomic ranks ]

plot_bar(ps_filtered, x="status", fill="Phylum") + geom_bar(aes(fill=Phylum), stat="identity", position="stack", color = "white")

Normalize number of reads in each sample using median sequencing depth.

total = median(sample_sums(ps))
standf = function(x, t=total) round(t * (x / sum(x)))
ps2 = transform_sample_counts(ps, standf)
ps2

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 275 taxa and 18 samples ]
## sample_data() Sample Data:       [ 18 samples by 3 sample variables ]
## tax_table()   Taxonomy Table:    [ 275 taxa by 7 taxonomic ranks ]

top20 = names(sort(taxa_sums(ps), decreasing=TRUE))[1:20]
ps.top20 = transform_sample_counts(ps, function(OTU) OTU/sum(OTU))
ps.top20 = prune_taxa(top20, ps.top20)

plot_bar(ps.top20, x="status", fill="Phylum")

toc()

## 335.38 sec elapsed

Session Info

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] digest_0.6.27   forcats_0.5.1   stringr_1.4.0   dplyr_1.0.6    
##  [5] purrr_0.3.4     readr_1.4.0     tidyr_1.1.3     tibble_3.1.2   
##  [9] ggplot2_3.3.4   tidyverse_1.3.1 phyloseq_1.36.0 dada2_1.20.0   
## [13] Rcpp_1.0.6      tictoc_1.0.1    printr_0.1.1   
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-1            hwriter_1.3.2              
##   [3] ellipsis_0.3.2              XVector_0.32.0             
##   [5] GenomicRanges_1.44.0        fs_1.5.0                   
##   [7] rstudioapi_0.13             farver_2.1.0               
##   [9] fansi_0.5.0                 lubridate_1.7.10           
##  [11] xml2_1.3.2                  codetools_0.2-18           
##  [13] splines_4.1.0               knitr_1.33                 
##  [15] ade4_1.7-16                 jsonlite_1.7.2             
##  [17] Rsamtools_2.8.0             broom_0.7.7                
##  [19] cluster_2.1.2               dbplyr_2.1.1               
##  [21] png_0.1-7                   compiler_4.1.0             
##  [23] httr_1.4.2                  backports_1.2.1            
##  [25] assertthat_0.2.1            Matrix_1.3-4               
##  [27] cli_2.5.0                   htmltools_0.5.1.1          
##  [29] prettyunits_1.1.1           tools_4.1.0                
##  [31] igraph_1.2.6                gtable_0.3.0               
##  [33] glue_1.4.2                  GenomeInfoDbData_1.2.6     
##  [35] reshape2_1.4.4              ShortRead_1.50.0           
##  [37] Biobase_2.52.0              cellranger_1.1.0           
##  [39] vctrs_0.3.8                 Biostrings_2.60.1          
##  [41] rhdf5filters_1.4.0          multtest_2.48.0            
##  [43] ape_5.5                     nlme_3.1-152               
##  [45] iterators_1.0.13            xfun_0.24                  
##  [47] rvest_1.0.0                 lifecycle_1.0.0            
##  [49] zlibbioc_1.38.0             MASS_7.3-54                
##  [51] scales_1.1.1                hms_1.1.0                  
##  [53] MatrixGenerics_1.4.0        parallel_4.1.0             
##  [55] SummarizedExperiment_1.22.0 biomformat_1.20.0          
##  [57] rhdf5_2.36.0                RColorBrewer_1.1-2         
##  [59] yaml_2.2.1                  latticeExtra_0.6-29        
##  [61] stringi_1.6.2               highr_0.9                  
##  [63] S4Vectors_0.30.0            foreach_1.5.1              
##  [65] permute_0.9-5               BiocGenerics_0.38.0        
##  [67] BiocParallel_1.26.0         GenomeInfoDb_1.28.0        
##  [69] rlang_0.4.11                pkgconfig_2.0.3            
##  [71] matrixStats_0.59.0          bitops_1.0-7               
##  [73] evaluate_0.14               lattice_0.20-44            
##  [75] Rhdf5lib_1.14.1             labeling_0.4.2             
##  [77] GenomicAlignments_1.28.0    tidyselect_1.1.1           
##  [79] plyr_1.8.6                  magrittr_2.0.1             
##  [81] R6_2.5.0                    IRanges_2.26.0             
##  [83] generics_0.1.0              DelayedArray_0.18.0        
##  [85] DBI_1.1.1                   pillar_1.6.1               
##  [87] haven_2.4.1                 withr_2.4.2                
##  [89] mgcv_1.8-36                 survival_3.2-11            
##  [91] RCurl_1.98-1.3              modelr_0.1.8               
##  [93] crayon_1.4.1                utf8_1.2.1                 
##  [95] rmarkdown_2.9               jpeg_0.1-8.1               
##  [97] progress_1.2.2              grid_4.1.0                 
##  [99] readxl_1.3.1                data.table_1.14.0          
## [101] vegan_2.5-7                 reprex_2.0.0               
## [103] RcppParallel_5.1.4          stats4_4.1.0               
## [105] munsell_0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALS.md

ALS.md

Microbiome of Amyotrophic Lateral Sclerosis (ALS) using 16S rRNA

Setting up

Direcotry Structure

Installing and Loading Packages

Loading Meta and Sequence Data

Quality Control

Quality Profiles

Forward Reads

Reverse Reads

Filtering and Trimming

Sample Composition

Learn Error Rates

Forward Reads Errors

Reverse Reads Errors

Sample Inference

Forward Sample Inference

Reverse Sample Inference

Merge Paired Reads

Construct Sequence Table

Removal of Chimeras

Track reads through the pipeline

Taxonomic Classification

Assign Taxonomy

Assign Species

Digest ASVs

Phyloseq

Import into phyloseq:

Ordinate:

Bar plot

Session Info

Files

ALS.md

Latest commit

History

ALS.md

File metadata and controls

Microbiome of Amyotrophic Lateral Sclerosis (ALS) using 16S rRNA

Setting up

Direcotry Structure

Installing and Loading Packages

Loading Meta and Sequence Data

Quality Control

Quality Profiles

Forward Reads

Reverse Reads

Filtering and Trimming

Sample Composition

Learn Error Rates

Forward Reads Errors

Reverse Reads Errors

Sample Inference

Forward Sample Inference

Reverse Sample Inference

Merge Paired Reads

Construct Sequence Table

Removal of Chimeras

Track reads through the pipeline

Taxonomic Classification

Assign Taxonomy

Assign Species

Digest ASVs

Phyloseq

Import into phyloseq:

Ordinate:

Bar plot

Session Info