Skip to content

Commit 6934a60

Browse files
committed
added phyloseq to 16s
1 parent 87d8932 commit 6934a60

File tree

1 file changed

+284
-1
lines changed

1 file changed

+284
-1
lines changed

16s.md

Lines changed: 284 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -324,6 +324,18 @@ If you have time, copy all the commands from this tutorial in a file, a try to m
324324

325325
### PhyloSeq Analysis
326326

327+
First, install and load the phyloseq package:
328+
329+
```R
330+
source('http://bioconductor.org/biocLite.R')
331+
biocLite('phyloseq')
332+
333+
library("phyloseq")
334+
library("ggplot2")
335+
library("plyr")
336+
theme_set(theme_bw()) # set the ggplot theme
337+
```
338+
327339
The PhyloSeq package has an `import_mothur` function that you can use to import the files you generated with mothur. As an example, import the example mothur data provided by phyloseq as an example:
328340

329341
```R
@@ -348,5 +360,276 @@ For the rest of this tutorial, we will work with an example dataset provided by
348360

349361
```R
350362
data(enterotype)
351-
enterotype
363+
data("GlobalPatterns")
364+
```
365+
366+
#### Ordination and distance-based analysis
367+
368+
Let's do some preliminary filtering. Remove the OTUs that included all unassigned sequences ("-1")
369+
370+
```R
371+
enterotype <- subset_species(enterotype, Genus != "-1")
372+
```
373+
374+
The available distance methods coded in the phyloseq package:
375+
376+
```R
377+
dist_methods <- unlist(distanceMethodList)
378+
print(dist_methods)
379+
380+
## UniFrac1 UniFrac2 DPCoA JSD vegdist1
381+
## "unifrac" "wunifrac" "dpcoa" "jsd" "manhattan"
382+
## vegdist2 vegdist3 vegdist4 vegdist5 vegdist6
383+
## "euclidean" "canberra" "bray" "kulczynski" "jaccard"
384+
## vegdist7 vegdist8 vegdist9 vegdist10 vegdist11
385+
## "gower" "altGower" "morisita" "horn" "mountford"
386+
## vegdist12 vegdist13 vegdist14 vegdist15 betadiver1
387+
## "raup" "binomial" "chao" "cao" "w"
388+
## betadiver2 betadiver3 betadiver4 betadiver5 betadiver6
389+
## "-1" "c" "wb" "r" "I"
390+
## betadiver7 betadiver8 betadiver9 betadiver10 betadiver11
391+
## "e" "t" "me" "j" "sor"
392+
## betadiver12 betadiver13 betadiver14 betadiver15 betadiver16
393+
## "m" "-2" "co" "cc" "g"
394+
## betadiver17 betadiver18 betadiver19 betadiver20 betadiver21
395+
## "-3" "l" "19" "hk" "rlb"
396+
## betadiver22 betadiver23 betadiver24 dist1 dist2
397+
## "sim" "gl" "z" "maximum" "binary"
398+
## dist3 designdist
399+
## "minkowski" "ANY"
400+
```
401+
402+
Remove the two distance-methods that require a tree, and the generic custom method that requires user-defined distance arguments.
403+
404+
```R
405+
# These require tree
406+
dist_methods[(1:3)]
407+
408+
# Remove them from the vector
409+
dist_methods <- dist_methods[-(1:3)]
410+
# This is the user-defined method:
411+
dist_methods["designdist"]
412+
413+
# Remove the user-defined distance
414+
dist_methods = dist_methods[-which(dist_methods=="ANY")]
415+
```
416+
417+
Loop through each distance method, save each plot to a list, called plist.
418+
419+
420+
```R
421+
plist <- vector("list", length(dist_methods))
422+
names(plist) = dist_methods
423+
for( i in dist_methods ){
424+
# Calculate distance matrix
425+
iDist <- distance(enterotype, method=i)
426+
# Calculate ordination
427+
iMDS <- ordinate(enterotype, "MDS", distance=iDist)
428+
## Make plot
429+
# Don't carry over previous plot (if error, p will be blank)
430+
p <- NULL
431+
# Create plot, store as temp variable, p
432+
p <- plot_ordination(enterotype, iMDS, color="SeqTech", shape="Enterotype")
433+
# Add title to each plot
434+
p <- p + ggtitle(paste("MDS using distance method ", i, sep=""))
435+
# Save the graphic to file.
436+
plist[[i]] = p
437+
}
438+
```
439+
440+
Combine results and shade according to Sequencing technology:
441+
442+
```R
443+
df = ldply(plist, function(x) x$data)
444+
names(df)[1] <- "distance"
445+
p = ggplot(df, aes(Axis.1, Axis.2, color=SeqTech, shape=Enterotype))
446+
p = p + geom_point(size=3, alpha=0.5)
447+
p = p + facet_wrap(~distance, scales="free")
448+
p = p + ggtitle("MDS on various distance metrics for Enterotype dataset")
449+
p
450+
```
451+
452+
Print individual plots:
453+
454+
```R
455+
print(plist[["jsd"]])
456+
print(plist[["jaccard"]])
457+
print(plist[["bray"]])
458+
print(plist[["euclidean"]])
459+
```
460+
461+
#### Alpha diversity graphics
462+
463+
Here is the default graphic produced by the plot_richness function on the GP example dataset:
464+
465+
```R
466+
GP <- prune_species(speciesSums(GlobalPatterns) > 0, GlobalPatterns)
467+
plot_richness(GP)
468+
```
469+
470+
Note that in this case, the Fisher calculation results in a warning (but still plots). We can avoid this by specifying a measures argument to plot_richness, which will include just the alpha-diversity measures that we want.
471+
472+
```R
473+
plot_richness(GP, measures=c("Chao1", "Shannon"))
474+
```
475+
476+
We can specify a sample variable on which to group/organize samples along the horizontal (x) axis. An experimentally meaningful categorical variable is usually a good choice – in this case, the "SampleType" variable works much better than attempting to interpret the sample names directly (as in the previous plot):
477+
478+
```R
479+
plot_richness(GP, x="SampleType", measures=c("Chao1", "Shannon"))
480+
```
481+
482+
Now suppose we wanted to use an external variable in the plot that isn’t in the GP dataset already – for example, a logical that indicated whether or not the samples are human-associated. First, define this new variable, human, as a factor (other vectors could also work; or other data you might have describing the samples).
483+
484+
```R
485+
sampleData(GP)$human <- getVariable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")
486+
```
487+
488+
Now tell plot_richness to map the new human variable on the horizontal axis, and shade the points in different color groups, according to which "SampleType" they belong.
489+
490+
```R
491+
plot_richness(GP, x="human", color="SampleType", measures=c("Chao1", "Shannon"))
492+
```
493+
494+
We can merge samples that are from the environment (SampleType), and make the points bigger with a ggplot2 layer. First, merge the samples.
495+
496+
```R
497+
GPst = merge_samples(GP, "SampleType")
498+
# repair variables that were damaged during merge (coerced to numeric)
499+
sample_data(GPst)$SampleType <- factor(sample_names(GPst))
500+
sample_data(GPst)$human <- as.logical(sample_data(GPst)$human)
501+
502+
p = plot_richness(GPst, x="human", color="SampleType", measures=c("Chao1", "Shannon"))
503+
p + geom_point(size=5, alpha=0.7)
504+
```
505+
506+
#### Trees
507+
508+
```R
509+
head(phy_tree(GlobalPatterns)$node.label, 10)
510+
```
511+
512+
The node data from the `GlobalPatterns` dataset are strange. They look like they might be bootstrap values, but they sometimes have two decimals.
513+
514+
```R
515+
phy_tree(GlobalPatterns)$node.label = substr(phy_tree(GlobalPatterns)$node.label, 1, 4)
516+
```
517+
518+
Additionally, the dataset has many OTUs, too many to fit them all on a tree. Let's take the 50 more abundant and plot a basic tree:
519+
520+
```R
521+
physeq = prune_taxa(taxa_names(GlobalPatterns)[1:50], GlobalPatterns)
522+
plot_tree(physeq)
523+
```
524+
525+
dots are annotated next to tips (OTUs) in the tree, one for each sample in which that OTU was observed. Let's color the dots by taxonomic ranks, and sample covariates:
526+
527+
```R
528+
plot_tree(physeq, nodelabf=nodeplotboot(), ladderize="left", color="SampleType")
529+
```
530+
531+
by taxonomic class:
532+
533+
```R
534+
plot_tree(physeq, nodelabf=nodeplotboot(), ladderize="left", color="Class")
535+
```
536+
537+
It can be useful to label the tips:
538+
539+
```
540+
plot_tree(physeq, color="SampleType", label.tips="Genus")
541+
```
542+
543+
Making a radial tree is easy with ggplot2, simply recognizing that our vertically-oriented tree is a cartesian mapping of the data to a graphic – and that a radial tree is the same mapping, but with polar coordinates instead.
544+
545+
```R
546+
plot_tree(physeq, nodelabf=nodeplotboot(60,60,3), color="SampleType", shape="Class", ladderize="left") + coord_polar(theta="y")
547+
```
548+
549+
#### Bar plots
550+
551+
Bar plots are one of the easiest way to vizualize your data. But be careful, they can be misleading if grouping sample!
552+
553+
Let's take a subset of the GlobalPatterns dataset, and produce a basic bar plot:
554+
555+
```R
556+
gp.ch = subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")
557+
plot_bar(gp.ch)
558+
```
559+
560+
The dataset is plotted with every sample mapped individually to the horizontal (x) axis, and abundance values mapped to the veritcal (y) axis. At each sample’s horizontal position, the abundance values for each OTU are stacked in order from greatest to least, separate by a thin horizontal line. As long as the parameters you choose to separate the data result in more than one OTU abundance value at the respective position in the plot, the values will be stacked in order as a means of displaying both the sum total value while still representing the individual OTU abundances.
561+
562+
The bar plot will be clearer with color to represent the Genus to which each OTU belongs.
563+
564+
```R
565+
plot_bar(gp.ch, fill="Genus")
566+
```
567+
568+
Now keep the same fill color, and group the samples together by the SampleType variable; essentially, the environment from which the sample was taken and sequenced.
569+
570+
```R
571+
plot_bar(gp.ch, x="SampleType", fill="Genus")
572+
```
573+
574+
A more complex example using facets:
575+
576+
```R
577+
plot_bar(gp.ch, "Family", fill="Genus", facet_grid=~SampleType)
578+
```
579+
580+
#### Heatmaps
581+
582+
The following two lines subset the dataset to just the top 300 most abundant Bacteria taxa across all samples (in this case, with no prior preprocessing. Not recommended, but quick).
583+
584+
```R
585+
data("GlobalPatterns")
586+
gpt <- subset_taxa(GlobalPatterns, Kingdom=="Bacteria")
587+
gpt <- prune_taxa(names(sort(taxa_sums(gpt),TRUE)[1:300]), gpt)
588+
plot_heatmap(gpt, sample.label="SampleType")
589+
```
590+
591+
subset a smaller dataset based on an Archaeal phylum
592+
593+
```R
594+
gpac <- subset_taxa(GlobalPatterns, Phylum=="Crenarchaeota")
595+
plot_heatmap(gpac)
596+
```
597+
598+
#### Plot microbiome network
599+
600+
There is a random aspect to some of the network layout methods. For complete reproducibility of the images produced later in this tutorial, it is possible to set the random number generator seed explicitly:
601+
602+
`set.seed(711L)`
603+
604+
Because we want to use the enterotype designations as a plot feature in these plots, we need to remove the 9 samples for which no enterotype designation was assigned (this will save us the hassle of some pesky warning messages, but everything still works; the offending samples are anyway omitted).
605+
606+
```R
607+
enterotype = subset_samples(enterotype, !is.na(Enterotype))
608+
```
609+
610+
Create an igraph-based network based on the default distance method, “Jaccard”, and a maximum distance between connected nodes of 0.3.
611+
612+
```R
613+
ig <- make_network(enterotype, max.dist=0.3)
614+
plot_network(ig, enterotype)
615+
```
616+
617+
The previous graphic displayed some interesting structure, with one or two major subgraphs comprising a majority of samples. Furthermore, there seemed to be a correlation in the sample naming scheme and position within the network. Instead of trying to read all of the sample names to understand the pattern, let’s map some of the sample variables onto this graphic as color and shape:
618+
619+
```R
620+
plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.4, label=NULL)
621+
```
622+
623+
In the previous examples, the choice of maximum-distance and distance method were informed, but arbitrary. Let’s see what happens when the maximum distance is lowered, decreasing the number of edges in the network
624+
625+
```R
626+
ig <- make_network(enterotype, max.dist=0.2)
627+
plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.4, label=NULL)
628+
```
629+
630+
Let’s repeat the previous exercise, but replace the Jaccard (default) distance method with Bray-Curtis
631+
632+
```R
633+
ig <- make_network(enterotype, dist.fun="bray", max.dist=0.3)
634+
plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.4, label=NULL)
352635
```

0 commit comments

Comments
 (0)