added phyloseq to 16s

HadrienG · HadrienG · commit 6934a6057f3c · 2017-02-21T09:10:23.000+01:00
diff --git a/16s.md b/16s.md
@@ -324,6 +324,18 @@ If you have time, copy all the commands from this tutorial in a file, a try to m
 
 ### PhyloSeq Analysis
 
+First, install and load the phyloseq package:
+
+```R
+source('http://bioconductor.org/biocLite.R')
+biocLite('phyloseq')
+
+library("phyloseq")
+library("ggplot2")
+library("plyr")
+theme_set(theme_bw())  # set the ggplot theme
+```
+
 The PhyloSeq package has an `import_mothur` function that you can use to import the files you generated with mothur. As an example, import the example mothur data provided by phyloseq as an example:
 
 ```R
@@ -348,5 +360,276 @@ For the rest of this tutorial, we will work with an example dataset provided by
 
 ```R
 data(enterotype)
-enterotype
+data("GlobalPatterns")
+```
+
+#### Ordination and distance-based analysis
+
+Let's do some preliminary filtering. Remove the OTUs that included all unassigned sequences ("-1")
+
+```R
+enterotype <- subset_species(enterotype, Genus != "-1")
+```
+
+The available distance methods coded in the phyloseq package:
+
+```R
+dist_methods <- unlist(distanceMethodList)
+print(dist_methods)
+
+##     UniFrac1     UniFrac2        DPCoA          JSD     vegdist1
+##    "unifrac"   "wunifrac"      "dpcoa"        "jsd"  "manhattan"
+##     vegdist2     vegdist3     vegdist4     vegdist5     vegdist6
+##  "euclidean"   "canberra"       "bray" "kulczynski"    "jaccard"
+##     vegdist7     vegdist8     vegdist9    vegdist10    vegdist11
+##      "gower"   "altGower"   "morisita"       "horn"  "mountford"
+##    vegdist12    vegdist13    vegdist14    vegdist15   betadiver1
+##       "raup"   "binomial"       "chao"        "cao"          "w"
+##   betadiver2   betadiver3   betadiver4   betadiver5   betadiver6
+##         "-1"          "c"         "wb"          "r"          "I"
+##   betadiver7   betadiver8   betadiver9  betadiver10  betadiver11
+##          "e"          "t"         "me"          "j"        "sor"
+##  betadiver12  betadiver13  betadiver14  betadiver15  betadiver16
+##          "m"         "-2"         "co"         "cc"          "g"
+##  betadiver17  betadiver18  betadiver19  betadiver20  betadiver21
+##         "-3"          "l"         "19"         "hk"        "rlb"
+##  betadiver22  betadiver23  betadiver24        dist1        dist2
+##        "sim"         "gl"          "z"    "maximum"     "binary"
+##        dist3   designdist
+##  "minkowski"        "ANY"
+```
+
+Remove the two distance-methods that require a tree, and the generic custom method that requires user-defined distance arguments.
+
+```R
+# These require tree
+dist_methods[(1:3)]
+
+# Remove them from the vector
+dist_methods <- dist_methods[-(1:3)]
+# This is the user-defined method:
+dist_methods["designdist"]
+
+# Remove the user-defined distance
+dist_methods = dist_methods[-which(dist_methods=="ANY")]
+```
+
+Loop through each distance method, save each plot to a list, called plist.
+
+
+```R
+plist <- vector("list", length(dist_methods))
+names(plist) = dist_methods
+for( i in dist_methods ){
+    # Calculate distance matrix
+    iDist <- distance(enterotype, method=i)
+    # Calculate ordination
+    iMDS  <- ordinate(enterotype, "MDS", distance=iDist)
+    ## Make plot
+    # Don't carry over previous plot (if error, p will be blank)
+    p <- NULL
+    # Create plot, store as temp variable, p
+    p <- plot_ordination(enterotype, iMDS, color="SeqTech", shape="Enterotype")
+    # Add title to each plot
+    p <- p + ggtitle(paste("MDS using distance method ", i, sep=""))
+    # Save the graphic to file.
+    plist[[i]] = p
+}
+```
+
+Combine results and shade according to Sequencing technology:
+
+```R
+df = ldply(plist, function(x) x$data)
+names(df)[1] <- "distance"
+p = ggplot(df, aes(Axis.1, Axis.2, color=SeqTech, shape=Enterotype))
+p = p + geom_point(size=3, alpha=0.5)
+p = p + facet_wrap(~distance, scales="free")
+p = p + ggtitle("MDS on various distance metrics for Enterotype dataset")
+p
+```
+
+Print individual plots:
+
+```R
+print(plist[["jsd"]])
+print(plist[["jaccard"]])
+print(plist[["bray"]])
+print(plist[["euclidean"]])
+```
+
+#### Alpha diversity graphics
+
+Here is the default graphic produced by the plot_richness function on the GP example dataset:
+
+```R
+GP <- prune_species(speciesSums(GlobalPatterns) > 0, GlobalPatterns)
+plot_richness(GP)
+```
+
+Note that in this case, the Fisher calculation results in a warning (but still plots). We can avoid this by specifying a measures argument to plot_richness, which will include just the alpha-diversity measures that we want.
+
+```R
+plot_richness(GP, measures=c("Chao1", "Shannon"))
+```
+
+We can specify a sample variable on which to group/organize samples along the horizontal (x) axis. An experimentally meaningful categorical variable is usually a good choice – in this case, the "SampleType" variable works much better than attempting to interpret the sample names directly (as in the previous plot):
+
+```R
+plot_richness(GP, x="SampleType", measures=c("Chao1", "Shannon"))
+```
+
+Now suppose we wanted to use an external variable in the plot that isn’t in the GP dataset already – for example, a logical that indicated whether or not the samples are human-associated. First, define this new variable, human, as a factor (other vectors could also work; or other data you might have describing the samples).
+
+```R
+sampleData(GP)$human <- getVariable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")
+```
+
+Now tell plot_richness to map the new human variable on the horizontal axis, and shade the points in different color groups, according to which "SampleType" they belong.
+
+```R
+plot_richness(GP, x="human", color="SampleType", measures=c("Chao1", "Shannon"))
+```
+
+We can merge samples that are from the environment (SampleType), and make the points bigger with a ggplot2 layer. First, merge the samples.
+
+```R
+GPst = merge_samples(GP, "SampleType")
+# repair variables that were damaged during merge (coerced to numeric)
+sample_data(GPst)$SampleType <- factor(sample_names(GPst))
+sample_data(GPst)$human <- as.logical(sample_data(GPst)$human)
+
+p = plot_richness(GPst, x="human", color="SampleType", measures=c("Chao1", "Shannon"))
+p + geom_point(size=5, alpha=0.7)
+```
+
+#### Trees
+
+```R
+head(phy_tree(GlobalPatterns)$node.label, 10)
+```
+
+The node data from the `GlobalPatterns` dataset are strange. They look like they might be bootstrap values, but they sometimes have two decimals.
+
+```R
+phy_tree(GlobalPatterns)$node.label = substr(phy_tree(GlobalPatterns)$node.label, 1, 4)
+```
+
+Additionally, the dataset has many OTUs, too many to fit them all on a tree. Let's take the 50 more abundant and plot a basic tree:
+
+```R
+physeq = prune_taxa(taxa_names(GlobalPatterns)[1:50], GlobalPatterns)
+plot_tree(physeq)
+```
+
+dots are annotated next to tips (OTUs) in the tree, one for each sample in which that OTU was observed. Let's color the dots by taxonomic ranks, and sample covariates:
+
+```R
+plot_tree(physeq, nodelabf=nodeplotboot(), ladderize="left", color="SampleType")
+```
+
+by taxonomic class:
+
+```R
+plot_tree(physeq, nodelabf=nodeplotboot(), ladderize="left", color="Class")
+```
+
+It can be useful to label the tips:
+
+```
+plot_tree(physeq, color="SampleType", label.tips="Genus")
+```
+
+Making a radial tree is easy with ggplot2, simply recognizing that our vertically-oriented tree is a cartesian mapping of the data to a graphic – and that a radial tree is the same mapping, but with polar coordinates instead.
+
+```R
+plot_tree(physeq, nodelabf=nodeplotboot(60,60,3), color="SampleType", shape="Class", ladderize="left") + coord_polar(theta="y")
+```
+
+#### Bar plots
+
+Bar plots are one of the easiest way to vizualize your data. But be careful, they can be misleading if grouping sample!
+
+Let's take a subset of the GlobalPatterns dataset, and produce a basic bar plot:
+
+```R
+gp.ch = subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")
+plot_bar(gp.ch)
+```
+
+The dataset is plotted with every sample mapped individually to the horizontal (x) axis, and abundance values mapped to the veritcal (y) axis. At each sample’s horizontal position, the abundance values for each OTU are stacked in order from greatest to least, separate by a thin horizontal line. As long as the parameters you choose to separate the data result in more than one OTU abundance value at the respective position in the plot, the values will be stacked in order as a means of displaying both the sum total value while still representing the individual OTU abundances.
+
+The bar plot will be clearer with color to represent the Genus to which each OTU belongs.
+
+```R
+plot_bar(gp.ch, fill="Genus")
+```
+
+Now keep the same fill color, and group the samples together by the SampleType variable; essentially, the environment from which the sample was taken and sequenced.
+
+```R
+plot_bar(gp.ch, x="SampleType", fill="Genus")
+```
+
+A more complex example using facets:
+
+```R
+plot_bar(gp.ch, "Family", fill="Genus", facet_grid=~SampleType)
+```
+
+#### Heatmaps
+
+The following two lines subset the dataset to just the top 300 most abundant Bacteria taxa across all samples (in this case, with no prior preprocessing. Not recommended, but quick).
+
+```R
+data("GlobalPatterns")
+gpt <- subset_taxa(GlobalPatterns, Kingdom=="Bacteria")
+gpt <- prune_taxa(names(sort(taxa_sums(gpt),TRUE)[1:300]), gpt)
+plot_heatmap(gpt, sample.label="SampleType")
+```
+
+subset a smaller dataset based on an Archaeal phylum
+
+```R
+gpac <- subset_taxa(GlobalPatterns, Phylum=="Crenarchaeota")
+plot_heatmap(gpac)
+```
+
+#### Plot microbiome network
+
+There is a random aspect to some of the network layout methods. For complete reproducibility of the images produced later in this tutorial, it is possible to set the random number generator seed explicitly:
+
+`set.seed(711L)`
+
+Because we want to use the enterotype designations as a plot feature in these plots, we need to remove the 9 samples for which no enterotype designation was assigned (this will save us the hassle of some pesky warning messages, but everything still works; the offending samples are anyway omitted).
+
+```R
+enterotype = subset_samples(enterotype, !is.na(Enterotype))
+```
+
+Create an igraph-based network based on the default distance method, “Jaccard”, and a maximum distance between connected nodes of 0.3.
+
+```R
+ig <- make_network(enterotype, max.dist=0.3)
+plot_network(ig, enterotype)
+```
+
+The previous graphic displayed some interesting structure, with one or two major subgraphs comprising a majority of samples. Furthermore, there seemed to be a correlation in the sample naming scheme and position within the network. Instead of trying to read all of the sample names to understand the pattern, let’s map some of the sample variables onto this graphic as color and shape:
+
+```R
+plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.4, label=NULL)
+```
+
+In the previous examples, the choice of maximum-distance and distance method were informed, but arbitrary. Let’s see what happens when the maximum distance is lowered, decreasing the number of edges in the network
+
+```R
+ig <- make_network(enterotype, max.dist=0.2)
+plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.4, label=NULL)
+```
+
+Let’s repeat the previous exercise, but replace the Jaccard (default) distance method with Bray-Curtis
+
+```R
+ig <- make_network(enterotype, dist.fun="bray", max.dist=0.3)
+plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.4, label=NULL)
 ```