update tutorial update validationPlot

daniellyz · May 9, 2018 · 8ba1370 · 8ba1370
1 parent f0bc850
commit 8ba1370
Show file tree

Hide file tree

Showing 12 changed files with 68 additions and 6 deletions.
diff --git a/R/MetICA_extract_model.R b/R/MetICA_extract_model.R
@@ -10,6 +10,7 @@
 #' \itemize{
 #'  \item{"S"}{Scores of MetICA components. Data matrix contains n rows (samples) and ics columns (components)}
 #'  \item{"A"}{Loadings of MetICA components. Data matrix contains p rows (metabolic features) and ics columns (components)}
+#'  \item{"A1"}{Normalized loadings of MetICA components. Data matrix contains p rows (metabolic features) and ics columns (components)}
 #'  \item{"eval$kurtosis"}{Vector of length ics containing Kurtosis measure of each MetICA component.}
 #'  \item{"eval$cluster_size"}{Number of IPCA estimates in each corresponding cluster.}
 #'  \item{"eval$divergence_index"}{Divergence indexes of clusters.}
@@ -75,6 +76,9 @@ MetICA_extract_model<-function(M1,ics,tops=ics){
   if (!is.null(tops)){
     par(mfrow=c(1,1))
 
+    total_number2=total_number[order(total_number,decreasing = T)[1:tops]]
+    barplot(total_number2,font = 2,ylab="Total number of estimates", cex.lab = 1.3)
+
     kurtosis2=kurtosis[order(kurtosis,decreasing = T)[1:tops]]
     barplot(kurtosis2,font = 2,ylab="Component kurtosis", cex.lab = 1.3)
 

diff --git a/README.md b/README.md
@@ -22,11 +22,13 @@ install_github("daniellyz/MetICA2")
 library(MetICA)
 ```
 
-## Check the function manuals before using
+## Check the function manuals before starting
+
 ```R
 help(MetICA)
 help(validationPlot)
-
+help(MetICA_extract_model)
+```
 
 ## An example of data analysis using MetICA
 
@@ -52,14 +54,69 @@ X = t(X) # Transpose the data since MetICA accepts samples x variables matrices
 ### Run MetICA simulations:
 
 ```{r}
-# Begin a MetICA simulation with 2000 estimated components in total. The samples are not time-dependent, so trend = FALSE. Numbers of clusters are evaluated between 2 and 10:
-M1=MetICA(X,pcs = 10,max_iter = 400,boot.prop = 0.3,max.cluster = 10,trends = F)
-# Users can confirm the number of pcs used for denoising if they think enough variance is explained, they can modify the number of pcs at this moment as well:
+# Begin a MetICA simulation with 2000 estimated components in total. The samples are not time-dependent, so trend = FALSE. Numbers of clusters are evaluated between 2 and 15:
+M1=MetICA(X,pcs = 10,max_iter = 400,boot.prop = 0.3,max.cluster = 15,trends = F)
 ```
+The function will display the percentage of variance explained based on the number of pcs chosen. User can modify this value: 
+
 ![choose](inst/Launch_MetICA.JPG)
 
+### Some plots to decide the number of MetICA components:
+
 ```{r}
-# Validation plot to decide the number of clusters
 results=validationPlot(M1)
+```
+* Left plot: Clusters that only contain few random estimates are likely generated by chance and might not represent algorithm convergence. Small size clusters (fewer than 20 estimates) appeared at 7 components.  
+* Right plot: Kurtosis of cluster centers (extracted MetICA components). High kurtosis components (bigger than 1) usually have heavy residuals. Here such component appeared at 11 components, and it can be informative especially when you are investigating outliers.
+
+![choose](inst/Validation1.JPG)
 
+* Left plot: Higher percentage of bootstrapping-generated estimates indicate the biological validity of the cluster. Cluster with higher than 45\% bootstrapping estimates are considered "safe" (in the green zone). In this example, all components were in the safe zone. 
+* Right plot: The distance between centers of bootstrapping and without-bootstrapping estimates is measured. A distance smaller than 0.8 (in the green zone) means that the cluster is stable towards bootstrapping, and can be considered reliable. We observe here that an unreliable cluster appeared at ics = 6, and another appeared at ics = 12.
+
+![choose](inst/Validation2.JPG)
+
+* Left plot: A geometric index is used to measure how compact is each cluster. The green zone shows a compact cluster and a good algorithm convergence. In this example, almost all clusters were in the green zone from 12 clusters on.
+* Right plot: If users cannot decide the optimal number of components (clusters) based on previous plots, we recommend a global dispersion index called R index (please set cluster_index = T to allow the calculation). Number of clusters with lowest R index indicate the best clustering quality, here ics = 8.   
+
+![choose](inst/Validation3.JPG)
+
+### Extract the MetICA model
+
+```{r}
+# According to validation, we chose ics = 8 as optimal number of components.
+M2=MetICA_extract_model(M1,ics = 8)
+```
+The function orders the extracted components using different criteria:
+
+* Left plot: number of estimates used to generate each component. IC3 here represents a cluster with very few estimates, so it should not be prioitised for biological interpretations.
+* Right plot: kurtosis of each component based on component scores. IC8 has higher kurtosis and might be informative.
+
+![choose](inst/Order1.jpg)
+
+* Left plot: compactness of each cluster. IC1, IC8, IC2 and IC7 showed the best convergence of algorithm.
+* Right plot: stability of each cluster towards bootstrapping. IC6, IC8, IC1, IC7 and IC2 seem to be more reliable. 
+
+![choose](inst/Order2.jpg)
+
+### Biological interpretation
+
+```{r}
+library(ade4)
+# Ploting 5th and 8th MetICA components:
+s.class(M2$S[,c(1,8)], yeast_metabolome$strains,cellipse=0,cpoint=0,clabel=1.5,add.p=F,grid=F) 
+```
+Similar to the visualization of PCA scores, MetICA also allows the comparison of metabolic profiles. The following figure compares the metabolic profiles of 15 yeast strains (Biological replicates of the same strain is connected)  
+
+![choose](inst/score.png)
+
+If the separation on the first component matched with previous knowledges about yeast strains (e.g. phenotype separation), the variables (mass features) that have high loadings on this component might be potential biomarkers. To extract the top 100: 
+
+```{r}
+top100=order(M2$A1[,1],decreasing=T)[1:100]
+top100_loading=M2$A1[top100,1]
+cbind(yeast_metabolome$features[top100,],Loadings=top100_loading)
+```
+### Collaborators
 
+![choose](inst/Logo.jpg)
diff --git a/data/yeast_metabolome.rda b/data/yeast_metabolome.rda
diff --git a/inst/Launch_MetICA.JPG b/inst/Launch_MetICA.JPG
diff --git a/inst/Logo.jpg b/inst/Logo.jpg
diff --git a/inst/Order1.jpg b/inst/Order1.jpg
diff --git a/inst/Order2.jpg b/inst/Order2.jpg
diff --git a/inst/Validation1.JPG b/inst/Validation1.JPG
diff --git a/inst/Validation2.JPG b/inst/Validation2.JPG
diff --git a/inst/Validation3.JPG b/inst/Validation3.JPG
diff --git a/inst/score.png b/inst/score.png
diff --git a/man/MetICA_extract_model.Rd b/man/MetICA_extract_model.Rd