-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhabssed-data-ms.Rmd
134 lines (99 loc) · 16.1 KB
/
habssed-data-ms.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
# title: MinION sequencing of environmental DNA from Lake Erie
title: Environmental DNA sequencing data from algal blooms in Lake Erie using Oxford Nanopore MinION
authors:
- name: Alexander F. Koeppel
affiliation: Elder Research
location: Charlottesville, Virginia
- name: Will Goodrum
affiliation: Elder Research
location: Charlottesville, Virginia
- name: Morgan Steffen
department: Biology Department
affiliation: James Madison University
location: Harrisonburg, Virginia
- name: Louie Wurch
department: Biology Department
affiliation: James Madison University
location: Harrisonburg, Virginia
- name: Stephen D. Turner
thanks: "Corresponding author: `sturner at signaturescience dot com`"
affiliation: Signature Science, LLC
location: Charlottesville, Virginia
# email: [email protected] dont use this unless you want the spam cannon
abstract: |
Harmful Algal Blooms (HABs) pose a significant and increasing risk, both to human health and to the Blue Economy. Genomics approaches to early detection promise to help mitigate these risks. We have developed and prototyped HABSSED (HAB Surveillance by Sequencing of Environmental DNA), a portable, reliable, rapid, low-cost pipeline for detecting HABs in the field using 3rd generation sequencing with the Oxford Nanopore MinION device. We demonstrated the efficacy of our approach by sequencing existing samples collected from a National Oceanic and Atmospheric Administration (NOAA) Great Lakes Environmental Research Laboratory (GLERL) on Lake Erie. We sequenced environmental DNA (eDNA) from samples drawn before, during, and after a _Microcystis_ bloom, and estimate the abundanced of HAB-associated taxa. While sequencing results showed some evidence of human and _E. coli_ contamination, we find that the abundance of _Mycrocystis_ and other HAB-associated orgnisms significantly differs between pre- and post-bloom environments. Here we describe the publicly available sequencing data that was generated as part of this research, which is available in the Sequence Read Archive (SRA) under accession number PRJNA812770.
keywords:
- MinION
- Nanopore
- eDNA
- Harmful algal blooms
- Freshwater ecology
bibliography: references.bib
biblio-style: unsrt
output: rticles::arxiv_article
---
```{r, include=FALSE}
suppressPackageStartupMessages({
library(readr)
library(dplyr)
library(knitr)
})
opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE, fig.align="center")
options(knitr.kable.NA = '--')
# theme_set(theme_bw())
```
\clearpage
# Background & Summary
Harmful algal blooms (HABs) represent a significant challenge to the health of the world's waters and the Blue Economies that depend on them. Toxic HABs of many different species represent a direct risk of injury or death to humans [@erdner2008; @jochimsen1998], but even algal blooms that are not directly toxic can cause significant damage by leading to hypoxia or anoxia in the water column, shading critical submerged aquatic vegetation, and degrading ecosystem function [@heisler2008]. As a result, HABs cause significant harm to the Blue Economy, adversely affect beaches [@morgan2009] and fisheries [@holland2020], and cost hundreds of millions of dollars in efforts to manage and mitigate the blooms [@anderson2008]. As the global climate continues to warm, the number and severity of HABs is expected to increase, along with their incumbent deleterious health and economic effects [@anderson2002; @fu2012; @moore2008].
We are developing a genomics-based approach to help the diverse group of Blue Economy stakeholders to address this challenge. We developed a prototype of HABSSED (Harmful Algal Bloom Surveillance by Sequencing of Environmental DNA), an environmental DNA (eDNA) sequencing and bioinformatic analysis pipeline for detection of toxic *Microcystis* blooms using third generation sequencing technology, specifically Oxford Nanopore's MinION device. MinION sequencing technology has numerous advantages over more traditional next generation sequencing platforms (e.g. Illumina HiSeq and/or MiSeq) that make it an ideal choice for HAB monitoring. The MinION sequencer is handheld and robust to vibration and other movement, can be powered by the same laptop used for data analysis, and has a relatively low initial capital investment for the instrument, allowing it to be used in field environments and putting the technology within reach of even small labs. In addition, the MinION is capable of truly de novo sequencing, as no sequence-specific primers are needed; any DNA within the sample can be bound to sequencing adapters and processed. As single molecules of DNA are sequenced, this data is accessible to the user in nearly real-time; analysis can begin as soon as five minutes after sequencing starts, rather than the 24+ hours of a conventional NGS sequencing run. For detection purposes, samples can be effectively sequentially multiplexed onto a single flow cell, decreasing operational costs.
The ultimate goal of the HABSSED is to include all steps necessary for research and monitoring personnel to rapidly obtain an estimate of aquatic microbial species abundance for HAB-related organisms from a raw water sample (Figure \ref{fig:pipeline}). HABSSED is being designed to be field-portable, incorporating genomics tools and workflows that can be run on a well-powered laptop or using an Amazon Web Services (AWS) or other similar cloud instance. Ultimately, field researchers or monitoring professionals will be able to readily detect the presence and abundance of toxic *Microcystis* with minimal specialized training, even aboard ships or in other remote field locations.
For developing and evaluating the HABSSED capability we used MinION to sequence ten samples from several NOAA Great Lakes Environmental Research Laboratories (GLERLs). Here we describe the methods used for sample collection, sequencing, and data deposited to the Sequence Read Archive (SRA) under accession number PRJNA812770.
```{r pipeline, fig.cap="The HABSSED pipeline. eDNA is extracted from water samples and sequenced using the MinION. Bases are called and reads are demultiplexed and filtered for quality. Taxonomy of reads are classified, and abundances of each taxon is calculated based on read counts. ", fig.align='center', out.width="2.8in"}
knitr::include_graphics(here::here("figures/pipeline.jpg"))
```
\clearpage
# Methods
## Sample collection
We collected water samples from NOAA GLERL stations WE02 and WE13 during a 2019 *Mycrocystis* bloom in Lake Erie. Collection sites are show in the map in Figure \ref{fig:map} (adapted from www.glerl.noaa.gov). Sampling excursions from these NOAA GLERL stations in the 2019 bloom season took place during pre-, peak, and post-bloom conditions. Samples were drawn using standard collection protocols [@steffen2017] at a water depth of 0.5m-1.5m. Metadata on the physical and chemical properties of the water at the time of sampling were also collected [@boedecker2020].
```{r map, fig.cap="GLERL sites WE02 and WE13 from which water samples sequenced here were collected (adapted from https://www.glerl.noaa.gov/).", fig.align='center', out.width="4in"}
knitr::include_graphics(here::here("figures/map2.jpg"))
```
## Sequencing and data analysis
Biomass was collected onto 0.2 um Sterivex filters and kept on ice until they were returned to the lab where they were stored at -80C until extraction. Extraction protocol was adapted according to [@cruaud2017]. We then prepared the libraries for MinION sequencing using Oxford Nanopore Technologies (ONT) sequencing kit (initially the Rapid Sequencing Kit (SQK-RAD004) but transitioning to the Rapid Barcoding Kit 96 (SQK-RBK110.96) for later runs). We sequenced the eDNA from those samples using the MinION Mk1C device, which performed base-calling and quality filtering using ONT's embedded MinKNOW software using the default settings. We then performed taxonomic classification using the ONT What's In My Pot (WIMP) pipeline and counted the abundance of *Microcystis* reads, as well as species and strains within the *Microcystis* genus where possible. We computed the relative abundance by dividing these counts by the total number of reads passing the QC filters for the same samples.
\clearpage
# Data Records
All sequencing data is available in the Sequence Read Archive (SRA) at accession PRJNA812770. Table 1 provides a subset of the metadata associated with these samples. The full metadata is available on the SRA bioproject page.
\small
```{r sampletable}
read_csv(here::here("data/manifest.csv"), col_types="cDccccciciiic") %>%
select(sample:samples_on_cell) %>%
rename(`# Multiplexed`=samples_on_cell) %>%
arrange(run_date) %>%
select(-protocol) %>%
mutate(run_type=stringr::str_to_sentence(run_type)) %>%
mutate(condition=stringr::str_to_sentence(condition)) %>%
rlang::set_names(stringr::str_replace_all, "_", " ") %>%
rlang::set_names(stringr::str_to_sentence) %>%
rename(`Input (ng/uL)`=`Qbit ng ul`) %>%
rename(GLERL=Glerl) %>%
kable(caption="Subset of the available metadata about samples sequenced in this study, available at SRA accession number PRJNA812770.")
```
\normalsize
# Technical Validation
We found that the relative abundance values are comparable between samples and sampling stations (GLERLs). Though the abundance does appear to be higher at GLERL WE02, this difference is not statistically significant due to the wide range of variation within the samples from each GLERL (t-test, $p=0.407$). This resulted a broad baseline range of pre-bloom *Microcystis* abundance, ranging over almost a full order of magnitude from 0.016% to 0.15% of reads in a given sample.
```{r}
read_csv(here::here("data/relabundance.csv"), col_types="cccc") %>%
knitr::kable(caption="Average abundance of \\emph{Microcystis} between stations and across samples.")
```
Additionally, there was notable human and *E. coli* contamination present in several samples that was not revealed until taxonomic classification. While we took steps to track down and eliminate the contamination for later samples, we were unable to conclusively discover the source. Based on the quantities observed in the negative controls and reagent blanks, these included at minimum, 1,400 to several thousand reads classifying as *E. coli*, and from 50 to several hundred reads classifying as *H. sapiens*, and hundreds more classifying to other taxa, including *Shigella*, *Acinetobacter*, and *Microcystis* (though never more than 10 reads in any of the blanks or negative control). In some of the bloom-drawn samples (those from GLERL WE13) the abundance of *E. coli* was greater than that of *Microcystis* despite the fact that these were *Microcystis* blooms. These contaminants could therefore have had a skewing effect on the abundance results.
Detecting the occurrence blooms is important for mitigation, but the ability to distinguish toxic from non-toxic blooms is also critically important. The *mcy* genes code for the microcystin toxin that makes *Microcystis* blooms so harmful. We estimated the abundance of reads aligning to the *Microcystis* *mcy*-family genes as part of this objective. We used the reads from our initial set of pre-bloom samples for this analysis. We aligned all the reads classified to the *Microcystis* genus against a reference consisting of all annotated *Microcystis* genomes using the aligner minimap2 [@li2018], and then filtering the alignments to those intersecting the genomic intervals annotated as containing *mcy*-family genes using bedtools [@quinlan2010]. We saw four hits in the full run pre-bloom sample from GLERL WE13, two to *mcyD*, and one each to *mcyC*, and *mcyE*. The WE02 sample produced no hits at all.
We also subsampled the pre-bloom sequencing runs using the timestamps added by the MinION software for all reads as they are sequenced. We generated a table of read IDs and their time stamps, which were then joined back to the WIMP classification results. We binned the reads into cumulative one-hour and four-hour buckets based on the time of sequencing.
Based on the accumulation data, we can conclude that for this protocol, the majority of the reads for all samples were sequenced within 24 hours of beginning the run. Having subsampled the reads, we then evaluated the length of run time necessary to accurately approximate the range of relative abundances of the genus *Microcystis*, and the species and strains of that that we observed in our samples. By 12 hours, all but the rarest taxa are detectable, and the overall relative abundance of taxa appears consistent from that point forward. By 24 hours, the rarer taxa are detected and the pattern of relative abundances changes very little. Similarly, the relative abundance of both the genus *Microcystis* and the species *Microcystis aeruginosa* converge very close to their final values by 16 hours into the run, with almost no deviation after 24 hours. Our preliminary conclusion is that a 16 to 24 hour run (equivalent to multiplexing two to three samples on a flow cell) is sufficient to capture *Microcystis* abundance reliably for water samples of this type. However, there are significant caveats to this conclusion based on two factors. First, the contaminants that we discovered may have affected our abundance estimates, especially relative abundance. Second, the low yields of DNA extracted from our samples mean that our overall read counts were lower than expected for this MinION protocol.
We next sequenced samples drawn from bloom conditions from the same GLERL stations during the same year as our pre-bloom samples. Downstream processing and taxonomic classification were performed using the same MinKNOW and WIMP pipelines as used for the initial pre-bloom samples. We multiplexed the bloom and pre-bloom samples on our two remaining flow-cells using the ONT Rapid Barcoding Kit 96 (SQK-RBK110.96). Two bloom samples and two pre-bloom samples (one of each from each GLERL) were sequenced on each flow cell to avoid the possibility of batch effects. Using the per-taxon read counts generated for each sample with the WIMP pipeline, we then performed a differential abundance analysis using DESeq2 [@love2014]. This analysis passed an essential quality assurance check, in that all taxa of *Microcystis* were significantly more abundant in the bloom condition than in the pre-bloom condition. However, the range of different taxa that are differentially abundant suggests the possibility of a pre-bloom taxon profile that might be used as features in a predictive model for forecasting blooms. Taxa such as *Methylobacterium* and *Stenotrophomonas* that are significantly more abundant pre-bloom compared to during the bloom might be of particular interest as indicator taxa, for which a shift in abundance might presage a HAB. As with the pre-bloom samples, we also investigated the abundance of the *mcy*-family genes for the bloom samples generated as part of this objective. Reads classified by WIMP as belonging *Microcystis aeruginosa* were aligned against a reference set of all annotated *Mycrocystis* genome sequences, and then the alignments were then intersected with the annotated *mcy* gene positions. While pre-bloom samples in few hits (four reads mapping to *mcyC*, *mcyD*, and *mcyE* in one full-run sample) we did observed more hits in the bloom-drawn samples. The bloom samples in GLERL WE02 contained hits to all 10 *mcy* genes. By contrast, in station WE13, which suffered a much less significant bloom in 2019 (the year of sampling) had reads mapping only to some members of the gene family.
# Acknowledgements, Author Contributions & Competing Interests
The authors would like to express sincere gratitude and acknowledge Jessica Bouchet, David Russell, and Hunter Baylous for assisting with laboratory work on this project, and Dr. Stephanie Guertin for consultation on design and execution.
AK performed sequencing and conducted all data analysis. All authors contributed to study design, results interpretation, and drafting and reviewing the manuscript.
This material is based upon work supported by the SBIR Program within the NOAA Technology Partnerships Office under Grant No. NA21OAR0210481.
AK and WG are employees of Elder Research, recipient of the NOAA SBIR grant that funded the collection of this data. ST is an employee of Signature Science, LLC.
# References