-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMetICA.Rd
75 lines (66 loc) · 6.06 KB
/
MetICA.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/MetICA.R
\name{MetICA}
\alias{MetICA}
\title{MetICA simulations on metabolomics data}
\usage{
MetICA(X, pcs = 15, max_iter = 400, boot.prop = 0.3, max.cluster = 20,
trends = T, verbose = T)
}
\arguments{
\item{X}{A numeric matrix obtained from metabolomics expriments. Its dimension should be n (samples) × p (metabolic features), either centered or not. No missing value is allowed. Normalized data is not recommended for MetICA!}
\item{pcs}{Number of principal components used to whiten the data before ICA, also number of components estimated in each IPCA run. It should be at least 3. Its value can be modified after the function is launched and percentage of variance explained is calculated. We recommend that PCA whitening should keep at least 80 percent of total variance.}
\item{max_iter}{Number of IPCA iterations. It should be at least 50 to provide reliable results. More than 500 runs can lead to long computational time. To avoid computer memory issues, the total number of estimates (pcs × max_iter) must be under 25 000.}
\item{boot.prop}{Proportion of samples replaced in bootstrap iterations (when X is resampled). It should not exceed 0.4.}
\item{max.cluster}{The number of clusters in HCA of estimated components is evaluated from 2 to max.cluster. Its value can be modified in the function if one cluster contains fewer than 30 estimates.}
\item{trends}{Boolean variable. TRUE if your observations are time-dependent (e.g. blood samples taken over a period of time from a patient).}
\item{verbose}{Boolean variable. If TRUE the completion of each stage of the algorithm will be reminded by a message.}
}
\value{
A model (a list object) that contains results from each stage of the simulation
\itemize{
\item{Stage1$X0 Original metabolomics x data matrix}
\item{Stage1$S Estimated scores from IPCA runs. Data matrix contains n rows (samples) and pcs × max_iter columns (estimated components)}
\item{Stage1$A Estimated loadings from IPCA runs. Data matrix contains p rows (metabolic features) and pcs × max_iter columns (estimated components)}
\item{Stage1$boot_id pcs × max_iter vector indicating bootstrap/without bootstrap iterations.}
\item{Stage1$type_id pcs × max_iter vector indicating deflation/parallel iterations.}
\item{Stage2$clusterObj A object of class hclust from HCA analysis of estimated components.}
\item{Stage2$trends Boolean object defined by users.}
\item{Stage3$max.cluster Maximal number of clusters evaluated.}
\item{Stage3$S_history List object of length max.cluster. Each element of the list (from 2nd element on) is a data matrix with n rows (samples) and nb columns (number of clusters) that represents center of each generated cluster.}
\item{Stage3$A_history List object of length max.cluster. Each element of the list is a data matrix with p columns (metabolic features) and nb columns (number of clusters) that represents center of each generated cluster.}
\item{Stage3$kurt_history List object of length max.cluster. Each element of the list is a numeric vector that describes the kurtosis of each cluster center obtained.}
\item{Stage3$tn_history Number of estimates in each cluster. }
\item{Stage3$bn_history Proportion of bootstrap estimates in each cluster.}
\item{Stage3$dispersion_history Dispersion index of each cluster. Lower index (closer to 0) indicates a compact cluster thus a good algorithm convergence.}
\item{Stage3$boot_eval Bootstrap stability index of each cluster. Lower index (closer to 0) means that there's no big difference between bootstrap and no bootstrap estimates, and the cluster is stable towards bootstrapping.}
}
}
\description{
The main function for MetICA simulation on a sample × variables (n × p) metabolomics data matrix.
}
\details{
MetICA is a three-stage algorithm:
\itemize{
\item{Stage_1 We performs multiple IPCA iterations on data matrix X. IPCA from mixOmics package combines the advantages of both PCA and ICA, and is therefore dedicated to large biological datasets. For all IPCA iterations, the initial un-mixing matrices are randomly generated from gaussian distributions. For half of our simulations, the initial data matrix X is re-sampled through bootstrapping. We use parallel-based FastICA for the first half of IPCA runs and deflation-based for the rest.}
\item{Stage_2 Estimated components from all IPCA runs are then submitted to hierarchical cluster analysis (HCA). The spearman distance metric is used since the objective of MetICA is to group estimated components that have similar shapes or trends.}
\item{Stage_3 One general difficulty of ICA on metabolomics data analysis is the selection of number of component. In MetICA, the problem becomes the choice of cluster number since components generated are the geometric centers of clusters. By increasing the number of clusters in HCA, geometric indexes of each cluster are calculated to reflect the convergence of IPCA algorithm. The stability of clusters against bootstrapping is evaluated as well as the kurtosis of cluster centers.}
}
}
\examples{
data(bacteria_peptides)
# Perform 100 IPCA simulations on centered metabolomics data:
M1=MetICA(bacteria_peptides$X,pcs = 20,max_iter = 100,boot.prop = 0.3,max.cluster = 40,trends = T)
# Generate validation plots along with geometric index calculation to help decide number of clusters
validationPlot(M1)
# According to the validation, we now choose 10 components:
M2=MetICA_extract_model(M1,10,tops=7)
}
\references{
A. Hyvarinen and E. Oja, Independent Component Analysis: Algorithms and Applications, Neural Networks (2000) vol. 13 no. 4-5
Fangzhou Yao, Jeff Coquery and Kim-Anh Le Cao, Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets, BMC Bioinformatics (2012) Vol. 13 no. 24
Youzhong Liu, Kirill Smirnov, Marianna Lucio, Regis D. Gougeon, Herve Alexandre and Philippe Schmitt-Kopplin, MetICA: independent component analysis for high-resolution mass-spectrometry based non-targeted metabolomics, BMC Bioinformatics (2016) Vol. 17 no. 114
}
\author{
Youzhong Liu, \email{[email protected]}
}