Skip to content

Commit 5a775eb

Browse files
committed
added tutorial for R-based beta diversity analysis
1 parent 36b0e11 commit 5a775eb

8 files changed

+361
-15
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ Computational workflows for reproducing analysis in the study of Huang et al., 2
77
3. Detailed tutorials
88
* [Reads quality inspection with cumulative distribution function](./docs/cumulative_distribution_function.md)
99
* [Alpha diversity analysis](./docs/alpha_diversity_analysis.md)
10+
* [Beta diversity analysis](./docs/beta_diversity_analysis.md)

docs/alpha_diversity_analysis.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Open a new working R script, and load our funtion-packed R script from which you
1919
Load a merged metaphlan profile which contains metadata and taxonomic abundances. Here, we are going to use an example file for demostration.
2020

2121
```{r}
22-
>mpa_df <- data.frame(read.csv("path_to_the_package/KunDH-2023-CRM-MSM_metagenomics/example_data/>merged_abundance_table_species_sgb_md.tsv",
22+
>mpa_df <- data.frame(read.csv("path_to_the_package/KunDH-2023-CRM-MSM_metagenomics/example_data/merged_abundance_table_species_sgb_md.tsv",
2323
header = TRUE,
2424
sep = "\t"))
2525
```

docs/beta_diversity_analysis.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Beta Diversity Analysis
2+
This tutorial is to use R-based functions as well as Python scripts to estimate the beta diversity of microbiomes using metaphlan profiles.
3+
4+
## R-based method
5+
6+
#### R packages required
7+
8+
* [vegan](https://cran.r-project.org/web/packages/vegan/index.html)
9+
* [ggplot2](https://ggplot2.tidyverse.org/)
10+
* [ape](https://cran.r-project.org/web/packages/ape/index.html)
11+
* [tidyverse](https://www.tidyverse.org/packages/)
12+
13+
#### Beta diversity analysis, visualization and significance assessment
14+
15+
Open a new working R script, and load our funtion-packed R script from which you can use relavant modules.
16+
17+
```{r}
18+
>source(file = "path_to_the_package/KunDH-2023-CRM-MSM_metagenomics/scripts/functions/beta_diversity_funcs.R")
19+
```
20+
21+
Load a [matrix table](../example_data/matrix_species_relab.tsv) of species relative abundances quantified by MetaPhlAn and a [metadata table](../example_data/metadata_of_matrix_species_relab.tsv) which matches the matrix table row by row, namely in both matrix table and metadata table each row indicates the sample sample.
22+
23+
```{r}
24+
>matrix <- read.csv("path_to_the_package/KunDH-2023-CRM-MSM_metagenomics/example_data/matrix_species_relab.tsv",
25+
header = TRUE,
26+
sep = "\t")
27+
>metadata <- read.csv("path_to_the_package/KunDH-2023-CRM-MSM_metagenomics/example_data/metadata_of_matrix_species_relab.tsv",
28+
header = TRUE,
29+
sep = "\t")
30+
```
31+
32+
Now, you would like to test the significance of the sample segragating due to the variable of interest while adjusting covariables such as BMI and disease status, etc. Here, we use function `est_permanova` which implements [PERMANOVA](https://rdrr.io/rforge/vegan/man/adonis.html) analysis, specifying arguments:
33+
* `mat`: the loaded matrix from metaphlan-style table, [dataframe].
34+
* `md`: the metadata table pairing with the matrix, [dataframe].
35+
* `variable`: specify the variable for testing, [string].
36+
* `covariables`: give a vector of covariables for adjustment, [vector].
37+
* `nper`: the number of permutation, [int], default: [999].
38+
* `to_rm`: a vector of values in "variable" column where the corresponding rows will be removed first.
39+
* `by_method`: "terms" will assess significance for each term, sequentially; "margin" will assess the marginal effects of the terms.
40+
41+
Here, we show an example by testing variable *condom use* while adjusting covariables including *antibiotics use*, *HIV status*, *BMI*, *Diet* and *Inflamatory bowel diseases* which might play a role in explaining the inter-individual variation in the gut microbiome composition.
42+
43+
```{r}
44+
>est_permanova(mat = matrix,
45+
md = metadata,
46+
variable = "condom_use",
47+
covariables = c("Antibiotics_6mo", "HIV_status", "inflammatory_bowel_disease", "BMI_kg_m2_WHO", "diet"),
48+
nper = 999,
49+
to_rm = c("no_receptive_anal_intercourse"),
50+
by_method = "margin")
51+
52+
Df SumOfSqs R2 F Pr(>F)
53+
condom_use 4 1.2161 0.08194 1.5789 0.008 **
54+
Antibiotics_6mo 2 0.4869 0.03281 1.2643 0.160
55+
HIV_status 1 0.3686 0.02484 1.9146 0.030 *
56+
inflammatory_bowel_disease 1 0.2990 0.02015 1.5529 0.066 .
57+
BMI_kg_m2_WHO 5 1.8376 0.12382 1.9087 0.002 **
58+
diet 3 0.8579 0.05781 1.4853 0.036 *
59+
Residual 49 9.4347 0.63571
60+
Total 65 14.8412 1.00000
61+
---
62+
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
63+
```
64+
65+
Next, to visualize the sample segragation based on microbiome beta diversity we can use function `plot_pcoa` function which needs input arguments:
66+
* `mat`: the loaded matrix from metaphlan-style table, [dataframe].
67+
* `md`: the metadata table pairing with the matrix, [dataframe].
68+
* `dist_method`: the method for calculating beta diversity, [string]. default: ["bray"]. For other methods, refer to [vegdist()](https://rdrr.io/cran/vegan/man/vegdist.html).
69+
* `fsize`: the font size of labels, [int]. default: [11]
70+
* `dsize`: the dot size of scatter plot, [int]. default: [3]
71+
* `fstyle`: the font style, [string]. default: ["Arial"]
72+
* `variable`: specify the variable name based on which to group samples, [string].
73+
* `to_rm`: a vector of values in "variable" column where the corresponding rows will be excluded first before analysis.
74+
75+
Below, we are showcasing how to inspect the beta diversity of microbiomes from the angle of five different variables.
76+
77+
```{r}
78+
>pcoa_condom_use <- pcoa_plot(mat = matrix,
79+
md = metadata,
80+
dist_method = "bray",
81+
fsize = 11,
82+
dsize = 3,
83+
fstyle = "Arial",
84+
variable = "condom_use",
85+
to_rm = c("no_receptive_anal_intercourse"))
86+
>pcoa_STI <- pcoa_plot(mat = matrix,
87+
md = metadata,
88+
dist_method = "bray",
89+
fsize = 11,
90+
dsize = 3,
91+
fstyle = "Arial",
92+
variable = "STI")
93+
>pcoa_number_of_partners <- pcoa_plot(mat = matrix,
94+
md = metadata,
95+
dist_method = "bray",
96+
fsize = 11,
97+
dsize = 3,
98+
fstyle = "Arial",
99+
variable = "number_partners")
100+
>pcoa_rai <- pcoa_plot(mat = matrix,
101+
md = metadata,
102+
dist_method = "bray",
103+
fsize = 11,
104+
dsize = 3,
105+
fstyle = "Arial",
106+
variable = "receptive_anal_intercourse")
107+
>pcoa_oral_sex <- pcoa_plot(mat = matrix,
108+
md = metadata,
109+
dist_method = "bray",
110+
fsize = 11,
111+
dsize = 3,
112+
fstyle = "Arial",
113+
variable = "oral.sex")
114+
>pcoa_lubricant_use <- pcoa_plot(mat = matrix,
115+
md = metadata,
116+
dist_method = "bray",
117+
fsize = 11,
118+
dsize = 3,
119+
fstyle = "Arial",
120+
variable = "lubricant")
121+
122+
>ggarrange(pcoa_rai, pcoa_lubricant_use, pcoa_STI,
123+
pcoa_oral_sex, pcoa_number_of_partners, pcoa_condom_use,
124+
nrow = 2, ncol = 3)
125+
```
126+
127+
![Combined beta diversities](../images/beta_diversity_R_outputs.png)
128+
129+
## Python-based method
130+
131+
## A method mixing R and Python

example_data/matrix_species_relab.tsv

Lines changed: 67 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
sample sexual_orientation.x HIV_status Ethnicity Antibiotics_6mo BMI_kg_m2_WHO STI diet number_partners receptive_anal_intercourse oral sex lubricant condom_use inflammatory_bowel_disease
2+
P057 MSM negative Caucasian Yes ObeseClassI negative no >3 no yes no no_receptive_anal_intercourse no
3+
P054 MSM positive Caucasian No Overweight negative vegetarian 0_3 yes yes always always no
4+
P052 MSM positive Caucasian No Normal positive carb_and_proteinrich >3 yes yes sometimes sometimes no
5+
P050 MSM negative Caucasian No Normal negative no >3 yes no always sometimes no
6+
P049 MSM negative Caucasian Yes Overweight negative no >3 yes yes sometimes sometimes yes
7+
P048 MSM negative Caucasian Yes Normal positive no >3 yes yes always sometimes no
8+
P047 MSM negative Caucasian No ObeseClassIII negative no >3 yes yes always sometimes no
9+
P046 MSM negative Caucasian Yes Normal positive no >3 yes yes always sometimes yes
10+
P044 MSM positive Caucasian Yes Overweight positive no >3 yes yes always no yes
11+
P043 MSM negative Caucasian No Overweight positive no 0_3 yes yes sometimes sometimes yes
12+
P042 MSM positive Caucasian Yes Normal positive no >3 yes yes always sometimes yes
13+
P039 MSM positive Caucasian Yes Overweight positive no >3 yes yes always sometimes yes
14+
P037 MSM negative Caucasian Yes Normal positive no >3 yes yes sometimes sometimes yes
15+
P036 MSM negative Caucasian Yes Normal positive no >3 yes yes sometimes no yes
16+
P035 MSM negative Caucasian Yes Normal positive no >3 yes yes always no yes
17+
P034 MSM positive Caucasian Overweight negative no 0_3 no no no no_receptive_anal_intercourse no
18+
P029 MSM positive Caucasian No Overweight positive no >3 yes yes always always no
19+
P028 MSM negative Caucasian No Normal positive no >3 yes yes sometimes no no
20+
P027 MSM positive Caucasian No Normal negative no 0_3 yes yes always always no
21+
P014 MSM positive Caucasian No Normal negative no 0_3 yes yes always always no
22+
P007 MSM positive Caucasian No Overweight negative no 0_3 yes no always always no
23+
P006 MSM positive Caucasian No ObeseClassII negative no >3 yes yes sometimes always no
24+
P003 MSM negative Caucasian No Normal negative no 0_3 no no no no_receptive_anal_intercourse yes
25+
P002 MSM negative Caucasian No Overweight negative low_carb 0_3 no yes no no_receptive_anal_intercourse no
26+
MSM_L2_F2 MSM positive Caucasian No Overweight positive no 0_3 yes yes always yes
27+
MSM_L2_F1 MSM positive Caucasian No Normal negative no 0_3 no no no no_receptive_anal_intercourse no
28+
MSM_L2_E9 MSM negative Caucasian Yes ObeseClassI negative no >3 no yes no no_receptive_anal_intercourse no
29+
MSM_L2_E8 MSM positive Caucasian Yes Overweight negative no >3 yes yes sometimes sometimes no
30+
MSM_L2_E7 MSM positive Caucasian No Overweight negative no 0_3 yes yes always always no
31+
MSM_L2_E6 MSM positive Caucasian No Overweight negative vegetarian 0_3 yes yes always always no
32+
MSM_L2_E5 MSM positive Caucasian Yes Normal negative low_carb 0_3 yes yes sometimes sometimes no
33+
MSM_L2_E4 MSM positive Caucasian No Normal positive carb_and_proteinrich >3 yes yes sometimes sometimes no
34+
MSM_L2_E3 MSM positive Caucasian No Normal negative no >3 yes yes sometimes sometimes no
35+
MSM_L2_E2 MSM negative Caucasian No Normal negative no >3 yes no always sometimes no
36+
MSM_L2_E1 MSM negative Caucasian Yes Overweight negative no >3 yes yes sometimes sometimes yes
37+
MSM_L2_E12 MSM positive Caucasian No Overweight negative no 0_3 yes no always always no
38+
MSM_L2_E11 MSM positive Caucasian No Normal negative no 0_3 yes yes no no no
39+
MSM_L2_D9 MSM positive Caucasian No Overweight negative no 0_3 yes yes sometimes no no
40+
MSM_L2_D8 MSM positive Caucasian Yes Overweight positive no >3 yes yes always no yes
41+
MSM_L2_D7 MSM negative Caucasian No Overweight positive no 0_3 yes yes sometimes sometimes yes
42+
MSM_L2_D6 MSM positive Caucasian Yes Normal positive no >3 yes yes always sometimes yes
43+
MSM_L2_D5 MSM negative Caucasian Yes Normal positive no >3 yes yes sometimes no
44+
MSM_L2_D3 MSM positive Caucasian Yes Overweight positive no >3 yes yes always sometimes yes
45+
MSM_L2_D2 MSM positive Caucasian Yes Overweight positive no 0_3 yes no sometimes sometimes yes
46+
MSM_L2_D1 MSM negative Caucasian Yes Normal positive no >3 yes yes sometimes sometimes yes
47+
MSM_L2_D12 MSM negative Caucasian Yes Normal positive no >3 yes yes always sometimes no
48+
MSM_L2_D11 MSM negative Caucasian No ObeseClassIII negative no >3 yes yes always sometimes no
49+
MSM_L2_D10 MSM negative Caucasian Yes Normal positive no >3 yes yes always sometimes yes
50+
MSM_L2_C9 MSM positive Caucasian No Normal positive vegetarian 0_3 yes yes always always no
51+
MSM_L2_C8 MSM positive Caucasian Yes Normal positive no >3 yes no no
52+
MSM_L2_C7 MSM positive Caucasian No Normal positive no 0_3 yes yes always always no
53+
MSM_L2_C6 MSM positive Caucasian Yes Underweight negative no 0_3 no no no_receptive_anal_intercourse yes
54+
MSM_L2_C2 MSM positive Caucasian No Overweight negative no 0_3 yes no always no
55+
MSM_L2_C1 MSM positive Caucasian No Overweight negative no 0_3 yes no no
56+
MSM_L2_C12 MSM negative Caucasian Yes Normal positive no >3 yes yes sometimes no yes
57+
MSM_L2_B9 MSM positive Caucasian No Normal positive no 0_3 yes yes sometimes no no
58+
MSM_L2_B8 MSM positive Caucasian No Normal negative no >3 yes yes sometimes no no
59+
MSM_L2_B6 MSM positive Caucasian Yes Overweight positive no 0_3 yes yes always no no
60+
MSM_L2_B5 MSM positive Caucasian No Normal negative no >3 yes yes no no
61+
MSM_L2_B12 MSM positive Caucasian No Normal negative no >3 yes yes always always no
62+
MSM_L2_B11 MSM positive Caucasian No Normal negative low_carb >3 yes no yes
63+
MSM_L2_A8 MSM positive Caucasian No Normal positive no >3 yes yes sometimes no yes
64+
MSM_L2_A5 MSM positive Caucasian No Overweight positive no >3 yes yes sometimes no no
65+
MSM_L2_A1 MSM positive Asian No Normal negative no >3 yes yes always sometimes no
66+
MSM_L2_A12 MSM positive Caucasian No Overweight negative no >3 yes yes always no no
67+
MSM_L2_A10 MSM positive Caucasian No Normal positive no 0_3 yes yes sometimes no no

images/beta_diversity_R_outputs.png

667 KB
Loading

scripts/alpha_diversity_analysis.R

Lines changed: 18 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,26 @@
11

22

33
###### Testing code ######
4-
mpa_df <- data.frame(read.csv("/Users/kunhuang/R_analysis_mirror/msm_analysis/manuscript_prep/alpha_diversity_msm_nomsm_reproduce/merged_abundance_table_species_sgb_md.tsv",
5-
header = TRUE,
6-
sep = "\t"))
4+
mat <- read.csv("/Users/kunhuang/R_analysis_mirror/msm_analysis/sexual_practice_analysis/msm_mpa4_run2_matrix.tsv",
5+
header = TRUE,
6+
sep = "\t")
7+
md <- read.csv("/Users/kunhuang/R_analysis_mirror/msm_analysis/sexual_practice_analysis/msm_mpa4_run2_metadata.tsv",
8+
header = TRUE,
9+
sep = "\t")
710

8-
View(mpa_df)
9-
source(file = "/Users/kunhuang/repos/KunDH-2023-CRM-MSM_metagenomics/scripts/functions/alpha_diversity_funcs.R")
10-
SE <- SE_converter(1:5, 6, mpa_df)
1111

1212

13-
alpha_df <- est_alpha_diversity(SE)
13+
source(file = "/Users/kunhuang/repos/KunDH-2023-CRM-MSM_metagenomics/scripts/functions/beta_diversity_funcs.R")
1414

15-
View(alpha_df)
15+
coor_df <- generate_coordis_df(mat, md, "euclidean")
16+
View(coor_df)
1617

17-
make_boxplot(alpha_df, "sexual_orientation", "shannon", stats = FALSE, pal = c("#888888", "#eb2525"),
18-
font_size = 18)
19-
##### Testing code ######
20-
21-
a <- felm_fixed(alpha_df, c("HIV_status", "antibiotics_6month"), "sexual_orientation", "shannon")
22-
summary(a)
18+
pcoa_plot(mat, md, "bray", "condom_use", 20, 4, to_rm = c("no_receptive_anal_intercourse"))
19+
est_permanova(mat, md, "condom_use", c("Antibiotics_6mo", "HIV_status", "inflammatory_bowel_disease", "BMI_kg_m2_WHO", "diet"))
20+
est_permanova(mat = mat,
21+
md = md,
22+
variable = "condom_use",
23+
covariables = c("Antibiotics_6mo", "HIV_status", "inflammatory_bowel_disease", "BMI_kg_m2_WHO", "diet"),
24+
nper = 999,
25+
to_rm = c("no_receptive_anal_intercourse"),
26+
by_method = "margin")
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
2+
generate_coordis_df <- function(mat, md, dist_method = "bray") {
3+
# mat: the loaded matrix from mpa-style dataframe.
4+
# md: the dataframe containing metadata.
5+
# dist_method: the method for calculating beta diversity. default: ["bray"]. For other method, refer to vegdist()
6+
# this function is to prepare metadata-added coordinates dataframe.
7+
bray_dist <- vegan::vegdist(mat, dist_method)
8+
coordinates <- as.data.frame(ape::pcoa(bray_dist)$vectors)
9+
coor_df <- cbind(coordinates, md)
10+
coor_df
11+
}
12+
13+
pcoa_plot <- function(mat,
14+
md,
15+
dist_method,
16+
variable,
17+
fsize = 11,
18+
dsize = 1,
19+
fstyle = "Arial",
20+
to_rm = NULL) {
21+
# mat: the loaded matrix from mpa-style dataframe, [dataframe].
22+
# md: the dataframe containing metadata, [dataframe].
23+
# dist_method: the method for calculating beta diversity, [string]. default: ["bray"]. For other method, refer to vegdist().
24+
# fsize: the font size, [int].
25+
# dsize: the dot size, [int].
26+
# fstyle: the font style, [string].
27+
# variable: specify the variable name for separating groups, [string].
28+
# to_rm: a vector of values in "variable" column where the corresponding rows will be removed first.
29+
# this function is to draw pcoa plot with confidence ellipse
30+
coordis_df <- generate_coordis_df(mat, md, dist_method)
31+
if (is.null(to_rm)) {
32+
coordis_df <- coordis_df[!(is.na(coordis_df[, variable]) | coordis_df[, variable] == ""), ]
33+
}
34+
else {
35+
coordis_df <- coordis_df[!(is.na(coordis_df[, variable]) | coordis_df[, variable] == "" | coordis_df[, variable] %in% to_rm), ]
36+
}
37+
eval(substitute(ggplot(coordis_df, aes(Axis.1, Axis.2, color = c)),list(c = as.name(variable)))) +
38+
geom_point(size = dsize) +
39+
theme_bw() +
40+
eval(substitute(geom_polygon(stat = "ellipse", aes(fill = c), alpha = 0.1), list(c = as.name(variable)))) +
41+
labs(x = "PC1", y = "PC2") +
42+
theme(text = element_text(size = fsize, family = fstyle)) +
43+
theme(legend.position="bottom")
44+
}
45+
46+
est_permanova <- function(mat,
47+
md,
48+
variable,
49+
covariables = NULL,
50+
nper = 999,
51+
to_rm = NULL,
52+
by_method = "margin"){
53+
# mat: the loaded matrix from mpa-style dataframe, [dataframe].
54+
# md: the dataframe containing metadata, [dataframe].
55+
# variable: specify the variable for testing, [string].
56+
# covariables: give a vector of covariables for adjustment, [vector].
57+
# nper: the number of permutation, [int], default: [999].
58+
# to_rm: a vector of values in "variable" column where the corresponding rows will be removed first.
59+
# by_method: "terms" will assess significance for each term, sequentially; "margin" will assess the marginal effects of the terms.
60+
if (is.null(to_rm)) {
61+
clean_md <- md[!(is.na(md[, variable]) | md[, variable] == ""), ]
62+
} else {
63+
clean_md <- md[!(is.na(md[, variable]) | md[, variable] == "" | md[, variable] %in% to_rm), ]
64+
}
65+
clean_idx = rownames(clean_md)
66+
clean_mat <- mat[rownames(mat) %in% clean_idx, ]
67+
if (is.null(covariables)) {
68+
est <- eval(substitute(adonis2(mat ~ cat, data = md, permutations = nper, by = by_method), list(cat = as.name(variable))))
69+
} else {
70+
mat_char <- deparse(substitute(mat))
71+
str1 <- paste0(c(variable, paste0(covariables, collapse = " + ")), collapse = " + ")
72+
str2 <- paste0(c(mat_char, str1), collapse = " ~ ")
73+
est <- vegan::adonis2(eval(parse(text = str2)), data = md, permutations = nper, by = by_method)
74+
}
75+
est
76+
}

0 commit comments

Comments
 (0)