Integrating multi-omics underlying Down syndrome with OmicsPLS |
Exercises |
Said el Bouhaddani and Jeanine Houwing-Duistermaat |
These are exercises for the OmicsPLS short course. There are several questions throughout the text, and the corresponding R code to answer the question is given after the question. The answer key contains all output of each code block as well as brief answers to the questions. Note that some questions don't have a unique answer, but the idea should be clear.
In this part, we consider a data integration approach to link the information in methylation and glycomics data. This 'joint information' is then related to Down syndrome. In the exercises, we work with a subset of methylation only on chromosome 21.
A flexible data integration approach for two heterogeneous datasets is O2PLS and is available in the OmicsPLS package. We will use OmicsPLS to select the most important genes corresponding with the methylation sites, this mapping can be found in the CpG_groups
object where each CpG (methylation) site has one or more associated genes. The number of methylation sites is found by running length(CpG_groups)
, and the number of genes is found with length(unique(CpG_groups))
We need several packages for data handling, fitting and visualizing the results. Run this code to see which are not yet installed. All packages can be installed with install.packages
, except disgenet2r, which has a separate install command shown below.
req_pack <- c("MASS", "parallel", "tidyverse", "magrittr",
"OmicsPLS", "httr", "disgenet2r", "GGally")
if(sum(!(req_pack %in% installed.packages()[,1])) > 0){
cat("\nThe following packages are missing:\n")
req_pack[which(!(req_pack %in% installed.packages()[,1]))]
} else cat("\nNo packages missing.\n")
library(MASS) # statistical tools, such as lda
library(parallel) # parallel computing
library(tidyverse) # dataset & viz tools
library(magrittr) # pipe %>% operators
library(OmicsPLS) # data integration toolkit
## Also needed but not loaded
# install.packages("httr")
# install.packages("GGally")
# remotes::install_bitbucket("ibi_group/disgenet2r")
The datasets are found in the DownSyndrome.RData
file. We work with a subset of the methylation data measured only on chromosome 21. A simple load
statement should load them in your workspace. The str
function can be used to get a first impression of the data objects.
## str gives an overview of all kinds of objects
cat("Methylation data:\n")
cat("\nGlycomics data:\n")
cat("\nCpG mapping to gene:\n")
Before any analysis can be performed, you should consider calculating some descriptives about the data.
:::: {.bluebox .question data-latex=""} Exercises. Plot boxplots of (a subset of) the data columns. Also describe the demographics: case-controls, age and sex distributions. Are there any remarkable observations? ::::
:::: {.blueboxx data-latex=""} Answers (Click here)
To apply OmicsPLS, we first need to decide on the number of components to retain. A cross-validation is usually performed. In the cross-validation, a grid is specified as well as the number of folds. If desired, you can use multiple cores to speed up the calculations. On a Windows machine, this requires copying the data matrices to each parallel process, so keep an eye on memory usage.
Note that there are other ways besides cross-validation, such as the scree plot (click here for more info).
:::: {.bluebox .question data-latex=""} Perform a cross-validation for the number of joint and specific components. What is the optimal number of components for each part? ::::
:::: {.blueboxx data-latex=""} Answers (Click here)
crossval_o2m_adjR2(X = methylation, Y = glycomics,
a = 1:5, ax = 0:10, ay = 0:9, nr_folds = 10, nr_cores = 1)
## -> 4 2 6
## Code to run a scree plot is
# par(mfrow=c(1,3))
# plot(svd(crossprod(methylation,glycomics),0,0)$d^2 %>%
# (function(e) e/sum(e)), main='Joint Scree plot')
# plot(svd(tcrossprod(methylation),0,0)$d %>% (function(e) e/sum(e)),
# main="Methylation Scree plot")
# plot(svd(crossprod(glycomics),0,0)$d %>% (function(e) e/sum(e)),
# main="Glycomics Scree plot")
# par(mfrow=c(1,1))
# ## -> 3 5 1
r <- 4; rx <- 2; ry <- 6
We fit O2PLS to the methylation and glycomics data, and calculate the variance explained by the joint and specific parts.
fit <- o2m(methylation, glycomics, r, rx, ry)
Next, we inspect the loadings. Each of the 3322 loading values in the methylation parts represents a CpG site indicated by a cg ID. For the glycomics parts, we have 10 glycan peaks/IDs.
Each label is a cg ID or glycan ID, and the axes represent the respective components.
:::: {.bluebox .question data-latex=""} Give an interpretation of these results. Which features have highest loadings, in which components? Which glycan and methylation features have the highest covariance according to the plot? Click "zoom" in RStudio if the labels don't fit on the screen. ::::
:::: {.blueboxx data-latex=""} Answers (Click here)
plot(fit,loading_name = "Yj",i=1,j=2,label = "col") + theme_bw()
plot(fit,loading_name = "Xj",i=1,j=2,label = "col") + theme_bw()
Next, we investigate if these joint components are associated with Down syndrome. We first look at the scatterplot of the scores, colored by DS.
:::: {.bluebox .question data-latex=""} Give an interpretation. Are the joint scores able to separate Down syndrome? ::::
:::: {.blueboxx data-latex=""} Answers (Click here)
data.frame(Group = ClinicalVars$group, JPC = scores(fit, "Xjoint")) %>%
names_to = "Comp",
values_to = "Scores") %>%
ggplot(aes(x=Comp, y=Scores, col=Group)) +
geom_boxplot() + xlab("Component") + ylab("Methylation scores") +
data.frame(Group = ClinicalVars$group, JPC = scores(fit, "Yjoint")) %>%
names_to = "Comp",
values_to = "Scores") %>%
ggplot(aes(x=Comp, y=Scores, col=Group)) +
geom_boxplot() + xlab("Component") + ylab("Glycomics scores") +
## Another (fancy) approach is to run the following for multiple plots in one go
aes(col = ClinicalVars$group), progress = FALSE,
title = "Joint X and Y components against each other") + theme_bw()
We perform a logistic regression with the Down syndrome status as outcome, and the joint methylation scores as covariates. We exclude the mothers for now.
:::: {.bluebox .question data-latex=""} Which group category is the reference? Are there joint scores that are significantly associated with Down syndrome? Are the p-values correctly interpretable in this case? Why (not)? ::::
:::: {.blueboxx data-latex=""} Answers (Click here)
glm_datmat <- data.frame(JPC=scores(fit, "Xjoint"),
outc = ClinicalVars$group, age=ClinicalVars$age, sex=ClinicalVars$sex)
glm(outc ~ ., data = glm_datmat%>% filter(outc != "MA"), family = "binomial") %>%
We saw that joint methylation component one seemed to be significantly associated with Down syndrome: the mean scores differed significantly between DS and SB. Of interest is the genes corresponding with the top CpG sites, are their target genes representing some biological pathway? To this end, we use String-DB to cluster the top genes. Although there is an R package for String-DB, we are going to use the String-DB website. On the website, click "multiple proteins". The input there is the list of top genes.
Although determining a threshold to select the number of 'top' CpG sites is not straightforward, we are going to select 200 based on earlier analysis of these data. We also need to map from cg ID to gene ID.
top_cg <- order(loadings(fit, subset=1)^2,decreasing = TRUE)
gene_list <- CpG_groups[top_cg[1:200]]
gene_list %<>% paste0(collapse = ";") %>%
str_split(";") %>% unlist %>% unique
Copy-paste the top genes in the String-DB website.
:::: {.bluebox .question data-latex=""} Is there any remarkable clustering visible? Go to the analysis tab, is there any significant enrichment? ::::
:::: {.blueboxx data-latex=""} Answers (Click here)
# gene_list %>% cat(sep="\n")
You can also use the DisGeNet R package to perform Disease-gene enrichment. If you cannot install the package, the output is given below.
:::: {.bluebox .question data-latex=""} Which disease clusters are most significant? What does this say about the top genes based on integrating methylation and glycomics data? ::::
:::: {.blueboxx data-latex=""} Answers (Click here)
### Bonus, run this code to perform DisGeNet enrichment
# disgenet_api_key <- "271e054761763b144a97872b059fd573186bdd9f"
httr::timeout(4e9) # if needed to give curl more time
DGN_DE <- disgenet2r::disease_enrichment(gene_list, database = "ALL")
c("Description", "FDR", "Ratio", "BgRatio")]
Description FDR Ratio BgRatio
236 Down Syndrome 4.983930e-35 38/65 766/21666
2736 Complete Trisomy 21 Syndrome 2.709284e-34 36/65 669/21666
2320 DOWN SYNDROME CRITICAL REGION 9.293346e-19 13/65 57/21666
1853 Chromosome 21 monosomy 1.849628e-07 5/65 13/21666
2468 Alzheimer disease type 1 1.452115e-05 3/65 3/21666
169 Cognition Disorders 8.819130e-05 12/65 607/21666
375 Hirschsprung Disease 8.819130e-05 10/65 384/21666
548 Mental Retardation 1.006005e-04 11/65 505/21666
207 Presenile dementia 2.703029e-03 11/65 718/21666
303 Fragile X Syndrome 4.152165e-03 6/65 194/21666
BONUS: You can also combine the String-DB and DisGeNet analyses by making an interaction netwerk of the genes in a particular disease term. Below is the code to print the genes that are in the Down Syndrome disease term. This list can be copy-pasted into String-DB and analyzed.
DGN_DownS <- DGN_DE@qresult %>%
filter(Description == "Down Syndrome") %>%
pull(shared_symbol) %>% str_split(";") %>% unlist
# cat(DGN_DownS, sep="\n")