added tutorial for R-based beta diversity analysis

KunDHuang · KunDHuang · commit 5a775eb3f526 · 2023-10-31T10:06:02.000+08:00
diff --git a/README.md b/README.md
@@ -7,3 +7,4 @@ Computational workflows for reproducing analysis in the study of Huang et al., 2
 3. Detailed tutorials
     * [Reads quality inspection with cumulative distribution function](./docs/cumulative_distribution_function.md)
     * [Alpha diversity analysis](./docs/alpha_diversity_analysis.md)
+    * [Beta diversity analysis](./docs/beta_diversity_analysis.md)
diff --git a/docs/alpha_diversity_analysis.md b/docs/alpha_diversity_analysis.md
@@ -19,7 +19,7 @@ Open a new working R script, and load our funtion-packed R script from which you
 Load a merged metaphlan profile which contains metadata and taxonomic abundances. Here, we are going to use an example file for demostration.
 
 ```{r}
->mpa_df <- data.frame(read.csv("path_to_the_package/KunDH-2023-CRM-MSM_metagenomics/example_data/>merged_abundance_table_species_sgb_md.tsv",
+>mpa_df <- data.frame(read.csv("path_to_the_package/KunDH-2023-CRM-MSM_metagenomics/example_data/merged_abundance_table_species_sgb_md.tsv",
                       header = TRUE,
                       sep = "\t"))
 ```
diff --git a/docs/beta_diversity_analysis.md b/docs/beta_diversity_analysis.md
@@ -0,0 +1,131 @@
+# Beta Diversity Analysis
+This tutorial is to use R-based functions as well as Python scripts to estimate the beta diversity of microbiomes using metaphlan profiles.
+
+## R-based method
+
+#### R packages required
+
+* [vegan](https://cran.r-project.org/web/packages/vegan/index.html)
+* [ggplot2](https://ggplot2.tidyverse.org/)
+* [ape](https://cran.r-project.org/web/packages/ape/index.html)
+* [tidyverse](https://www.tidyverse.org/packages/)
+
+#### Beta diversity analysis, visualization and significance assessment
+
+Open a new working R script, and load our funtion-packed R script from which you can use relavant modules.
+
+```{r}
+>source(file = "path_to_the_package/KunDH-2023-CRM-MSM_metagenomics/scripts/functions/beta_diversity_funcs.R")
+```
+
+Load a [matrix table](../example_data/matrix_species_relab.tsv) of species relative abundances quantified by MetaPhlAn and a [metadata table](../example_data/metadata_of_matrix_species_relab.tsv) which matches the matrix table row by row, namely in both matrix table and metadata table each row indicates the sample sample.
+
+```{r}
+>matrix <- read.csv("path_to_the_package/KunDH-2023-CRM-MSM_metagenomics/example_data/matrix_species_relab.tsv",
+                    header = TRUE,
+                    sep = "\t")
+>metadata <- read.csv("path_to_the_package/KunDH-2023-CRM-MSM_metagenomics/example_data/metadata_of_matrix_species_relab.tsv",
+                    header = TRUE,
+                    sep = "\t")
+```
+
+Now, you would like to test the significance of the sample segragating due to the variable of interest while adjusting covariables such as BMI and disease status, etc. Here, we use function `est_permanova` which implements [PERMANOVA](https://rdrr.io/rforge/vegan/man/adonis.html) analysis, specifying arguments:
+  * `mat`: the loaded matrix from metaphlan-style table, [dataframe].
+  * `md`: the metadata table pairing with the matrix, [dataframe].
+  * `variable`: specify the variable for testing, [string].
+  * `covariables`: give a vector of covariables for adjustment, [vector].
+  * `nper`: the number of permutation, [int], default: [999].
+  * `to_rm`: a vector of values in "variable" column where the corresponding rows will be removed first.
+  * `by_method`: "terms" will assess significance for each term, sequentially; "margin" will assess the marginal effects of the terms.
+
+Here, we show an example by testing variable *condom use* while adjusting covariables including *antibiotics use*, *HIV status*, *BMI*, *Diet* and *Inflamatory bowel diseases* which might play a role in explaining the inter-individual variation in the gut microbiome composition.
+
+```{r}
+>est_permanova(mat = matrix, 
+              md = metadata, 
+              variable = "condom_use", 
+              covariables = c("Antibiotics_6mo", "HIV_status", "inflammatory_bowel_disease", "BMI_kg_m2_WHO", "diet"),
+              nper = 999, 
+              to_rm = c("no_receptive_anal_intercourse"),
+              by_method = "margin")
+
+                           Df SumOfSqs      R2      F Pr(>F)   
+condom_use                  4   1.2161 0.08194 1.5789  0.008 **
+Antibiotics_6mo             2   0.4869 0.03281 1.2643  0.160   
+HIV_status                  1   0.3686 0.02484 1.9146  0.030 * 
+inflammatory_bowel_disease  1   0.2990 0.02015 1.5529  0.066 . 
+BMI_kg_m2_WHO               5   1.8376 0.12382 1.9087  0.002 **
+diet                        3   0.8579 0.05781 1.4853  0.036 * 
+Residual                   49   9.4347 0.63571                 
+Total                      65  14.8412 1.00000                 
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+```
+
+Next, to visualize the sample segragation based on microbiome beta diversity we can use function `plot_pcoa` function which needs input arguments:
+  * `mat`: the loaded matrix from metaphlan-style table, [dataframe].
+  * `md`: the metadata table pairing with the matrix, [dataframe].
+  * `dist_method`: the method for calculating beta diversity, [string]. default: ["bray"]. For other methods, refer to [vegdist()](https://rdrr.io/cran/vegan/man/vegdist.html). 
+  * `fsize`: the font size of labels, [int]. default: [11]
+  * `dsize`: the dot size of scatter plot, [int]. default: [3]
+  * `fstyle`: the font style, [string]. default: ["Arial"]
+  * `variable`: specify the variable name based on which to group samples, [string].
+  * `to_rm`: a vector of values in "variable" column where the corresponding rows will be excluded first before analysis.
+
+Below, we are showcasing how to inspect the beta diversity of microbiomes from the angle of five different variables.
+
+```{r}
+>pcoa_condom_use <- pcoa_plot(mat = matrix,
+                             md = metadata,
+                             dist_method = "bray",
+                             fsize = 11,
+                             dsize = 3,
+                             fstyle = "Arial",
+                             variable = "condom_use",
+                             to_rm = c("no_receptive_anal_intercourse"))
+>pcoa_STI <- pcoa_plot(mat = matrix,
+                             md = metadata,
+                             dist_method = "bray",
+                             fsize = 11,
+                             dsize = 3,
+                             fstyle = "Arial",
+                             variable = "STI")
+>pcoa_number_of_partners <- pcoa_plot(mat = matrix,
+                             md = metadata,
+                             dist_method = "bray",
+                             fsize = 11,
+                             dsize = 3,
+                             fstyle = "Arial",
+                             variable = "number_partners")
+>pcoa_rai <- pcoa_plot(mat = matrix,
+                             md = metadata,
+                             dist_method = "bray",
+                             fsize = 11,
+                             dsize = 3,
+                             fstyle = "Arial",
+                             variable = "receptive_anal_intercourse")
+>pcoa_oral_sex <- pcoa_plot(mat = matrix,
+                             md = metadata,
+                             dist_method = "bray",
+                             fsize = 11,
+                             dsize = 3,
+                             fstyle = "Arial",
+                             variable = "oral.sex")
+>pcoa_lubricant_use <- pcoa_plot(mat = matrix,
+                             md = metadata,
+                             dist_method = "bray",
+                             fsize = 11,
+                             dsize = 3,
+                             fstyle = "Arial",
+                             variable = "lubricant")
+
+>ggarrange(pcoa_rai, pcoa_lubricant_use, pcoa_STI,
+           pcoa_oral_sex, pcoa_number_of_partners, pcoa_condom_use,
+           nrow = 2, ncol = 3) 
+```
+
+![Combined beta diversities](../images/beta_diversity_R_outputs.png)
+
+## Python-based method
+
+## A method mixing R and Python
diff --git a/example_data/matrix_species_relab.tsv b/example_data/matrix_species_relab.tsv
diff --git a/example_data/metadata_of_matrix_species_relab.tsv b/example_data/metadata_of_matrix_species_relab.tsv
@@ -0,0 +1,67 @@
+sample	sexual_orientation.x	HIV_status	Ethnicity	Antibiotics_6mo	BMI_kg_m2_WHO	STI	diet	number_partners	receptive_anal_intercourse	oral sex	lubricant	condom_use	inflammatory_bowel_disease
+P057	MSM	negative	Caucasian	Yes	ObeseClassI	negative	no	>3	no	yes	no	no_receptive_anal_intercourse	no
+P054	MSM	positive	Caucasian	No	Overweight	negative	vegetarian	0_3	yes	yes	always	always	no
+P052	MSM	positive	Caucasian	No	Normal	positive	carb_and_proteinrich	>3	yes	yes	sometimes	sometimes	no
+P050	MSM	negative	Caucasian	No	Normal	negative	no	>3	yes	no	always	sometimes	no
+P049	MSM	negative	Caucasian	Yes	Overweight	negative	no	>3	yes	yes	sometimes	sometimes	yes
+P048	MSM	negative	Caucasian	Yes	Normal	positive	no	>3	yes	yes	always	sometimes	no
+P047	MSM	negative	Caucasian	No	ObeseClassIII	negative	no	>3	yes	yes	always	sometimes	no
+P046	MSM	negative	Caucasian	Yes	Normal	positive	no	>3	yes	yes	always	sometimes	yes
+P044	MSM	positive	Caucasian	Yes	Overweight	positive	no	>3	yes	yes	always	no	yes
+P043	MSM	negative	Caucasian	No	Overweight	positive	no	0_3	yes	yes	sometimes	sometimes	yes
+P042	MSM	positive	Caucasian	Yes	Normal	positive	no	>3	yes	yes	always	sometimes	yes
+P039	MSM	positive	Caucasian	Yes	Overweight	positive	no	>3	yes	yes	always	sometimes	yes
+P037	MSM	negative	Caucasian	Yes	Normal	positive	no	>3	yes	yes	sometimes	sometimes	yes
+P036	MSM	negative	Caucasian	Yes	Normal	positive	no	>3	yes	yes	sometimes	no	yes
+P035	MSM	negative	Caucasian	Yes	Normal	positive	no	>3	yes	yes	always	no	yes
+P034	MSM	positive	Caucasian		Overweight	negative	no	0_3	no	no	no	no_receptive_anal_intercourse	no
+P029	MSM	positive	Caucasian	No	Overweight	positive	no	>3	yes	yes	always	always	no
+P028	MSM	negative	Caucasian	No	Normal	positive	no	>3	yes	yes	sometimes	no	no
+P027	MSM	positive	Caucasian	No	Normal	negative	no	0_3	yes	yes	always	always	no
+P014	MSM	positive	Caucasian	No	Normal	negative	no	0_3	yes	yes	always	always	no
+P007	MSM	positive	Caucasian	No	Overweight	negative	no	0_3	yes	no	always	always	no
+P006	MSM	positive	Caucasian	No	ObeseClassII	negative	no	>3	yes	yes	sometimes	always	no
+P003	MSM	negative	Caucasian	No	Normal	negative	no	0_3	no	no	no	no_receptive_anal_intercourse	yes
+P002	MSM	negative	Caucasian	No	Overweight	negative	low_carb	0_3	no	yes	no	no_receptive_anal_intercourse	no
+MSM_L2_F2	MSM	positive	Caucasian	No	Overweight	positive	no	0_3	yes	yes		always	yes
+MSM_L2_F1	MSM	positive	Caucasian	No	Normal	negative	no	0_3	no	no	no	no_receptive_anal_intercourse	no
+MSM_L2_E9	MSM	negative	Caucasian	Yes	ObeseClassI	negative	no	>3	no	yes	no	no_receptive_anal_intercourse	no
+MSM_L2_E8	MSM	positive	Caucasian	Yes	Overweight	negative	no	>3	yes	yes	sometimes	sometimes	no
+MSM_L2_E7	MSM	positive	Caucasian	No	Overweight	negative	no	0_3	yes	yes	always	always	no
+MSM_L2_E6	MSM	positive	Caucasian	No	Overweight	negative	vegetarian	0_3	yes	yes	always	always	no
+MSM_L2_E5	MSM	positive	Caucasian	Yes	Normal	negative	low_carb	0_3	yes	yes	sometimes	sometimes	no
+MSM_L2_E4	MSM	positive	Caucasian	No	Normal	positive	carb_and_proteinrich	>3	yes	yes	sometimes	sometimes	no
+MSM_L2_E3	MSM	positive	Caucasian	No	Normal	negative	no	>3	yes	yes	sometimes	sometimes	no
+MSM_L2_E2	MSM	negative	Caucasian	No	Normal	negative	no	>3	yes	no	always	sometimes	no
+MSM_L2_E1	MSM	negative	Caucasian	Yes	Overweight	negative	no	>3	yes	yes	sometimes	sometimes	yes
+MSM_L2_E12	MSM	positive	Caucasian	No	Overweight	negative	no	0_3	yes	no	always	always	no
+MSM_L2_E11	MSM	positive	Caucasian	No	Normal	negative	no	0_3	yes	yes	no	no	no
+MSM_L2_D9	MSM	positive	Caucasian	No	Overweight	negative	no	0_3	yes	yes	sometimes	no	no
+MSM_L2_D8	MSM	positive	Caucasian	Yes	Overweight	positive	no	>3	yes	yes	always	no	yes
+MSM_L2_D7	MSM	negative	Caucasian	No	Overweight	positive	no	0_3	yes	yes	sometimes	sometimes	yes
+MSM_L2_D6	MSM	positive	Caucasian	Yes	Normal	positive	no	>3	yes	yes	always	sometimes	yes
+MSM_L2_D5	MSM	negative	Caucasian	Yes	Normal	positive	no	>3	yes	yes		sometimes	no
+MSM_L2_D3	MSM	positive	Caucasian	Yes	Overweight	positive	no	>3	yes	yes	always	sometimes	yes
+MSM_L2_D2	MSM	positive	Caucasian	Yes	Overweight	positive	no	0_3	yes	no	sometimes	sometimes	yes
+MSM_L2_D1	MSM	negative	Caucasian	Yes	Normal	positive	no	>3	yes	yes	sometimes	sometimes	yes
+MSM_L2_D12	MSM	negative	Caucasian	Yes	Normal	positive	no	>3	yes	yes	always	sometimes	no
+MSM_L2_D11	MSM	negative	Caucasian	No	ObeseClassIII	negative	no	>3	yes	yes	always	sometimes	no
+MSM_L2_D10	MSM	negative	Caucasian	Yes	Normal	positive	no	>3	yes	yes	always	sometimes	yes
+MSM_L2_C9	MSM	positive	Caucasian	No	Normal	positive	vegetarian	0_3	yes	yes	always	always	no
+MSM_L2_C8	MSM	positive	Caucasian	Yes	Normal	positive	no	>3	yes			no	no
+MSM_L2_C7	MSM	positive	Caucasian	No	Normal	positive	no	0_3	yes	yes	always	always	no
+MSM_L2_C6	MSM	positive	Caucasian	Yes	Underweight	negative	no	0_3	no		no	no_receptive_anal_intercourse	yes
+MSM_L2_C2	MSM	positive	Caucasian	No	Overweight	negative	no	0_3	yes	no	always		no
+MSM_L2_C1	MSM	positive	Caucasian	No	Overweight	negative	no	0_3	yes	no			no
+MSM_L2_C12	MSM	negative	Caucasian	Yes	Normal	positive	no	>3	yes	yes	sometimes	no	yes
+MSM_L2_B9	MSM	positive	Caucasian	No	Normal	positive	no	0_3	yes	yes	sometimes	no	no
+MSM_L2_B8	MSM	positive	Caucasian	No	Normal	negative	no	>3	yes	yes	sometimes	no	no
+MSM_L2_B6	MSM	positive	Caucasian	Yes	Overweight	positive	no	0_3	yes	yes	always	no	no
+MSM_L2_B5	MSM	positive	Caucasian	No	Normal	negative	no	>3	yes	yes		no	no
+MSM_L2_B12	MSM	positive	Caucasian	No	Normal	negative	no	>3	yes	yes	always	always	no
+MSM_L2_B11	MSM	positive	Caucasian	No	Normal	negative	low_carb	>3	yes			no	yes
+MSM_L2_A8	MSM	positive	Caucasian	No	Normal	positive	no	>3	yes	yes	sometimes	no	yes
+MSM_L2_A5	MSM	positive	Caucasian	No	Overweight	positive	no	>3	yes	yes	sometimes	no	no
+MSM_L2_A1	MSM	positive	Asian	No	Normal	negative	no	>3	yes	yes	always	sometimes	no
+MSM_L2_A12	MSM	positive	Caucasian	No	Overweight	negative	no	>3	yes	yes	always	no	no
+MSM_L2_A10	MSM	positive	Caucasian	No	Normal	positive	no	0_3	yes	yes	sometimes	no	no
diff --git a/images/beta_diversity_R_outputs.png b/images/beta_diversity_R_outputs.png
diff --git a/scripts/alpha_diversity_analysis.R b/scripts/alpha_diversity_analysis.R
@@ -1,22 +1,26 @@
 
 
 ###### Testing code  ######
-mpa_df <- data.frame(read.csv("/Users/kunhuang/R_analysis_mirror/msm_analysis/manuscript_prep/alpha_diversity_msm_nomsm_reproduce/merged_abundance_table_species_sgb_md.tsv",
-                     header = TRUE,
-                     sep = "\t"))
+mat <- read.csv("/Users/kunhuang/R_analysis_mirror/msm_analysis/sexual_practice_analysis/msm_mpa4_run2_matrix.tsv",
+                    header = TRUE,
+                    sep = "\t")
+md <- read.csv("/Users/kunhuang/R_analysis_mirror/msm_analysis/sexual_practice_analysis/msm_mpa4_run2_metadata.tsv",
+                    header = TRUE,
+                    sep = "\t")
 
-View(mpa_df)
-source(file = "/Users/kunhuang/repos/KunDH-2023-CRM-MSM_metagenomics/scripts/functions/alpha_diversity_funcs.R")
-SE <- SE_converter(1:5, 6, mpa_df)
 
 
-alpha_df <- est_alpha_diversity(SE)
+source(file = "/Users/kunhuang/repos/KunDH-2023-CRM-MSM_metagenomics/scripts/functions/beta_diversity_funcs.R")
 
-View(alpha_df)
+coor_df <- generate_coordis_df(mat, md, "euclidean")
+View(coor_df)
 
-make_boxplot(alpha_df, "sexual_orientation", "shannon", stats = FALSE, pal = c("#888888", "#eb2525"),
-                       font_size = 18)
-##### Testing code  ######
-
-a <- felm_fixed(alpha_df, c("HIV_status", "antibiotics_6month"), "sexual_orientation", "shannon")
-summary(a)
+pcoa_plot(mat, md, "bray", "condom_use", 20, 4, to_rm = c("no_receptive_anal_intercourse"))
+est_permanova(mat, md, "condom_use", c("Antibiotics_6mo", "HIV_status", "inflammatory_bowel_disease", "BMI_kg_m2_WHO", "diet"))
+est_permanova(mat = mat, 
+              md = md, 
+              variable = "condom_use", 
+              covariables = c("Antibiotics_6mo", "HIV_status", "inflammatory_bowel_disease", "BMI_kg_m2_WHO", "diet"),
+              nper = 999, 
+              to_rm = c("no_receptive_anal_intercourse"),
+              by_method = "margin")
diff --git a/scripts/functions/beta_diversity_funcs.R b/scripts/functions/beta_diversity_funcs.R
@@ -0,0 +1,76 @@
+
+generate_coordis_df <- function(mat, md, dist_method = "bray") {
+  # mat: the loaded matrix from mpa-style dataframe.
+  # md: the dataframe containing metadata.
+  # dist_method: the method for calculating beta diversity. default: ["bray"]. For other method, refer to vegdist()
+  # this function is to prepare metadata-added coordinates dataframe.
+  bray_dist <- vegan::vegdist(mat, dist_method)
+  coordinates <- as.data.frame(ape::pcoa(bray_dist)$vectors)
+  coor_df <- cbind(coordinates, md)
+  coor_df
+}
+
+pcoa_plot <- function(mat,
+                      md, 
+                      dist_method, 
+                      variable, 
+                      fsize = 11, 
+                      dsize = 1, 
+                      fstyle = "Arial", 
+                      to_rm = NULL) {
+  # mat: the loaded matrix from mpa-style dataframe, [dataframe].
+  # md: the dataframe containing metadata, [dataframe].
+  # dist_method: the method for calculating beta diversity, [string]. default: ["bray"]. For other method, refer to vegdist(). 
+  # fsize: the font size, [int].
+  # dsize: the dot size, [int].
+  # fstyle: the font style, [string].
+  # variable: specify the variable name for separating groups, [string].
+  # to_rm: a vector of values in "variable" column where the corresponding rows will be removed first.
+  # this function is to draw pcoa plot with confidence ellipse
+  coordis_df <- generate_coordis_df(mat, md, dist_method)
+  if (is.null(to_rm)) {
+    coordis_df <- coordis_df[!(is.na(coordis_df[, variable]) | coordis_df[, variable] == ""), ]
+  }
+  else {
+    coordis_df <- coordis_df[!(is.na(coordis_df[, variable]) | coordis_df[, variable] == "" | coordis_df[, variable] %in% to_rm), ]
+  }
+  eval(substitute(ggplot(coordis_df, aes(Axis.1, Axis.2, color = c)),list(c = as.name(variable)))) +
+    geom_point(size = dsize) + 
+    theme_bw() +
+    eval(substitute(geom_polygon(stat = "ellipse", aes(fill = c), alpha = 0.1), list(c = as.name(variable)))) +
+    labs(x = "PC1", y = "PC2") +
+    theme(text = element_text(size = fsize, family = fstyle)) +
+    theme(legend.position="bottom") 
+}
+
+est_permanova <- function(mat, 
+                          md, 
+                          variable, 
+                          covariables = NULL, 
+                          nper = 999, 
+                          to_rm = NULL, 
+                          by_method = "margin"){
+  # mat: the loaded matrix from mpa-style dataframe, [dataframe].
+  # md: the dataframe containing metadata, [dataframe].
+  # variable: specify the variable for testing, [string].
+  # covariables: give a vector of covariables for adjustment, [vector].
+  # nper: the number of permutation, [int], default: [999].
+  # to_rm: a vector of values in "variable" column where the corresponding rows will be removed first.
+  # by_method: "terms" will assess significance for each term, sequentially; "margin" will assess the marginal effects of the terms.
+  if (is.null(to_rm)) {
+    clean_md <- md[!(is.na(md[, variable]) | md[, variable] == ""), ]
+  } else {
+    clean_md <- md[!(is.na(md[, variable]) | md[, variable] == "" | md[, variable] %in% to_rm), ]
+  }
+  clean_idx = rownames(clean_md)
+  clean_mat <- mat[rownames(mat) %in% clean_idx, ]
+  if (is.null(covariables)) {
+    est <- eval(substitute(adonis2(mat ~ cat, data = md, permutations = nper, by = by_method), list(cat = as.name(variable))))
+  } else {
+    mat_char <- deparse(substitute(mat))
+    str1 <- paste0(c(variable, paste0(covariables, collapse = " + ")), collapse = " + ")
+    str2 <- paste0(c(mat_char, str1), collapse = " ~ ")
+    est <- vegan::adonis2(eval(parse(text = str2)), data = md, permutations = nper, by = by_method)
+  }
+  est
+}