description/description.Rnw

%% comment line1
%% comment line2
\documentclass{article}
\usepackage{float}
\usepackage[sc]{mathpazo}
\usepackage[T1]{fontenc}
\usepackage{geometry}
\usepackage{pdfpages}
\usepackage{amsmath}
\geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm}
\setcounter{secnumdepth}{2}
\setcounter{tocdepth}{2}
\usepackage{url}
\usepackage[unicode=true,pdfusetitle,
 bookmarks=true,bookmarksnumbered=true,bookmarksopen=true,bookmarksopenlevel=2,
 breaklinks=false,pdfborder={0 0 1},backref=false,colorlinks=false]
 {hyperref}
\usepackage[nottoc]{tocbibind}
\hypersetup{
 pdfstartview={XYZ null null 1}}
\usepackage{breakurl}
\renewcommand*\contentsname{Outline}


\begin{document}
<<setup, include=FALSE, cache=FALSE>>=
library(knitr)
# set global chunk options
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(formatR.arrow=TRUE,width=90)
@

\title{ClusterID-based consensus clustering}

\author{Maxime Tarabichi}
\maketitle

\section*{Description}
\subsection*{Consensus cluster assignment}
Let us use m methods \(M_{1,\dotsc, m}\) to cluster v single
nucleotide variants
\(V_{1,\dotsc, v}\). \\
Each method \(M_i\) assigns to each variant \(V_j\) one
cluster ID \[A(M_i,V_j)\]\\
For each variant \(V_j\), we write the vector of
assignment\\
\[\vec{VA}(V_j)=[A(M_1,V_j), A(M_2, V_j), \dotsc, A(M_m, V_j)]\] \\ \\
Two mutations \(V_a\) and \(V_b\) sharing vector of assignments, i.e
\(\vec{VA}(V_a)=\vec{VA}(V_b)\), have been assigned to the same cluster within all
methods. \\
Across all mutations, there will be \(u\) unique vectors
of assignment \(\vec{UVA}_{1\dotsc u}\), shared by \(N_{1\dotsc u}\) number of
mutations, respectively. \\
For each unique vector of assignment \(\vec{UVA}_k\), we
compute a distance \(d_{kl}\) to all \(\vec{UVA}_l\).
That distance is defined as a the sum of two distances
\[d_{kl}=d1(k,l)+d2(k,l)\]with:\\
\[d1(k,l)=Cmax(N_{1\dotsc u})sum{(\vec{UVA}_k!=\vec{UVA}_l)}\]
\[d2(k,l)=-max(N_k,N_l)\]
C is set to 10 and ensures that d1=method assignment prevails over
d2=number of mutations in a cluster.
Next we perform hierarchical clustering on the distance matrix
\(D_{uxu}=d_{kl}\) using Ward's criterion and cut the tree to
get a desired number of clusters of \(\vec{UVA}\). This method thus takes
the final number of consensus clusters as input.\\
Finally,  given \(\vec{UVAs_c}\) in each consensus cluster c,
mutations \(V_i\) for which \(\vec{VA}(V_i)\in \vec{UVAs_c}\) are
assigned to \(c\). \\
\subsection*{Optimal K}
To obtain the optimal number of clusters K, we derive consensus matrices for
\(K\in[2,max(K_{methods})]\). If a majority of methods have K=1,
we set K=1. Otherwise for each values of K, we repeat the consensus
clustering procedure 100 times, after independent sampling with
replacement from the methods. We then obtain a consensus matrix for
each K value, corresponding to the proportion of times the mutations were
clustered together. We take optimal K as the K yielding the matrix with
minimal PAC \cite{senbabaoglu_critical_2014}.\\
\subsection*{Consensus CCF}
The CCFs of each consensus cluster can then be calculated by
summarising the CCFs of individual mutations assigned to that
cluster. We take the weighted average of these CCFs with the weights
of each mutation proportional to its number of aligned reads.\\
This last step requires to compute a consensus CCF for each
mutation. We rescale CCF of each method using the consensus purity
(\(CCF_{rescaled}=\frac{purity_{method}}{purity_{consensus}}CCF_{method}\)) and
  take the median across methods. \\


\newpage
\section*{Example}

Let's generate simple dummy data for illustration purposes.

<<generateData, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'>>=
## #####################################
## generate data
## list with cluster assignments of each method
allA <- list(method1=c(rep(1,1000),rep(2,1000)),
             method2=c(rep(1,800),rep(2,900), rep(3,300)),
             method3=c(rep(1,800),rep(2,600), rep(3,600)),
             method4=c(rep(1,1200),rep(2,600), rep(3,200)),
             method5=c(rep(4,200),rep(3,500), rep(2,700),rep(1,600)),
             method6=c(rep(4,600),rep(3,800), rep(2,400),rep(1,200))
             )
allA <- lapply(allA,function(x) {names(x) <- paste("mutation",1:2000,sep="");return(x);})
sapply(allA,head)
## simulate a dummy underlying ccf
cov <- rpois(700,70)
cov2 <- rpois(600,70)
ccfs <- c(sort(rbinom(700, cov, .5)*2/sample(cov),decreasing=T),
          sort(rbinom(700, cov, .33)*2/cov,decreasing=T),
          sort(rbinom(600, cov, .1)*2/cov2,decreasing=T))
## #####################################
## plot ccfs
hist(ccfs,100)
abline(v=c(1,.66,.2),lwd=10,col=rgb(.5,0,0,.6))
## #####################################
@

This is a dummy -not necessarily realistic- example to illustrate how
the methods would treat the typical input, i.e. a list of vectors of
cluster assignments (=cluster IDs).
We start with the cluster assignments from 6 methods for the same
2,000 mutations. In this example, the truth has three clusters
of 700, 700, and 600 mutations, resp. For illustration purposes, their
cancer cell fractions (CCFs) are modelled as binomials around 100\%, 66\% and 20\% CCF,
respectively. However, only the cluster assignments are actually used for
consensus clustering.\\

In this example the methods find 2, 3, 3, 3, 4 and 4 clusters resp., and each
differ slightly in cluster assignments. Unlike the four first methods,
the two last methods have assigned highest cluster ID to the lowest
CCF cluster. Because the cluster-ID consensus clustering is not based
on the CCF, this will have no impact on the final consensus clustering.\\

The two first mutations are assigned to cluster1, cluster1, cluster1,
cluster1, cluster4 and cluster4, resp. by each of the methods and therefore their
vector of assignments is defined as: 1-1-1-1-4-4.
The last mutation is assigned to cluster2, cluster3, cluster3, cluster3,
cluster1, and cluster1, resp. by each of the methods and its vector of
assignments is defined as: 2-3-3-3-1-1.\\

The consensus clustering distance between the two first mutations is
minimal (d=0) since they have been assigned to the same cluster by all methods.
The consensus clustering distance between these the two first
mutations and the last mutation is maximal (d=6), since all methods have
assigned these two mutations to different clusters.\\

This distance between mutations is used to cluster all mutations into an
optimal number K of consensus clusters. K is chosen to minimise the
Proportion of Ambiguous Clusters (PAC)
\cite{senbabaoglu_critical_2014} of the consensus matrices
corresponding to K.\\

Let's first cluster mutations based on their distances for a
given number of clusters K=4.\\


<<clusterData, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'>>=
## similarity between two vectors of assignments
votedist <- function(g1,g2)
{
    suppressWarnings(if(T){
    g1 <- as.numeric(strsplit(g1,split="-")[[1]])
    g2 <- as.numeric(strsplit(g2,split="-")[[1]])
    return(sum(g1==g2,na.rm=T))})
}

## distance matrix between pairs of unique vectors of assignments
votedist.matrix <- function(allgs,nMethods)
{
    m <- matrix(0,length(allgs),length(allgs))
    for(i in 1:(length(allgs)-1))
        for(j in (i+1):length(allgs))
        {
            m[i,j] <- votedist(allgs[i],allgs[j])
            m[j,i] <- votedist(allgs[i],allgs[j])
        }
    rownames(m) <- allgs
    colnames(m) <- allgs
    nMethods-m
}

## fast cc using hclust, with method="ward linkage"
fastConsensusClustering <- function(allA,
                                    nbClusters,
                                    keepnames=NULL)
{
    nMut <- length(allA[[1]])
    nMethods <- length(allA)
    matA <- t(sapply(allA,function(x) x))
    clsA <- NULL
    for(i in 1:nrow(matA))
        clsA <- paste(clsA,matA[i,],sep=if(i==1) "" else "-")
    head(clsA)
    vuA <- sort(table(clsA),decreasing=T)
    head(vuA)
    distF <- as.dist(votedist.matrix(names(vuA),nMethods)*max(vuA)*10)
    hc <- hclust(distF,method="ward.D")
    lClusts <- lapply(nbClusters,function(nC) cutree(hc,k=nC))
    return(lClusts)
}
## run and print results
lResultClusters <- fastConsensusClustering(allA,nbClusters=4)
print(lResultClusters)
@

We can see after printing the results that each mutation has been
grouped together with all other mutations with the same vector of
assignments.
We are actually clustering unique vectors of assignments rather than
individual mutations. This will prove very useful when clustering high
number of mutations, as the time complexity of the method does not
depend on the number of mutations but the number of vectors of assignments.\\

The clusters assigned are not necessarily ordered with regards to the
underlying CCFs (see example below).


<<getCCF, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'>>=
## compute consensus ccfs
clsA <- NULL
for(i in 1:length(allA))
    clsA <- paste(clsA,allA[[i]],sep=if(i==1) "" else "-")
ccClusters <- unique(lResultClusters[[1]])
ccClusters
meanCCFs <- sapply(ccClusters,
                  function(x)
                      mean(ccfs[lResultClusters[[1]][clsA]==x]))
## print ccfs
print(meanCCFs)
@


Let's try with K=3 clusters.


<<optimalK2, eval=T, size="scriptsize", fig.width=6, fig.height=4, out.width='1.0\\linewidth'>>=
lResultClusters <- fastConsensusClustering(allA,nbClusters=3)
## final cc with K=2
print(lResultClusters)
clsA <- NULL
for(i in 1:length(allA))
    clsA <- paste(clsA,allA[[i]],sep=if(i==1) "" else "-")
ccClusters <- unique(lResultClusters[[1]])
ccClusters
meanCCFs <- sapply(ccClusters,
                  function(x)
                      mean(ccfs[lResultClusters[[1]][clsA]==x]))
## final ccfs with K=2
print(meanCCFs)
@

To obtain the optimal number of clusters \(optK\), we derive \(consensus matrices\) for
\(optK\in[2,max(K_{methods})]\). If a majority of methods have \(K=1\), we
set \(optK=1\). Otherwise for each value of \(optK\), we repeat the consensus
clustering procedure 100 times, after independent sampling with
replacement from the methods. We then obtain one \(consensus matrix\) for
each K value, corresponding to proportion of times the mutations were
co-clustered. We take \(optK=\operatorname{arg\,min}_K PAC(consensus matrix(K))\) \cite{senbabaoglu_critical_2014}.\\

<<findOptimalK, eval=T, size="scriptsize", fig.width=6, fig.height=4, out.width='1.0\\linewidth'>>=
dyn.load("scoringlite.so")
makeHardAss <- function(v)
{
    matrix(.C("hardass",
              as.integer(length(v)),
              as.integer(v),
              as.integer(rep(0,length(v)*length(v))))[[3]],
           length(v),length(v))
}

CCM <- function(ccm, allCC,ii,jj,repeats,i,rn)
{
    for(j in 1:repeats)
    {
        v <- allCC[[ii]][[jj]][[j]][[i]][rn]
        v[is.na(v)] <- (max(v,na.rm=T)+1):(max(v,na.rm=T)+sum(is.na(v)))
        ccm <- ccm+makeHardAss(v)
        gc()
    }
    ccm
}

chooseOptimalK <- function(ccm,maxK)
{
    ## from Dr. Yasin Şenbabaoğlu:
    ## shenbaba.weebly.com/blog/how-to-use-the-pac-measure-in-consensus-clustering
    Kvec = 2:maxK
    x1 = 0.1; x2 = 0.9 ## threshold defining the intermediate sub-interval
    PAC = rep(NA,length(Kvec))
    names(PAC) = paste("K=",Kvec,sep="") # from 2 to maxK
    for(i in Kvec){
        M = ccm[[i]]
        Fn = ecdf(M[lower.tri(M)])
        PAC[i-1] = Fn(x2) - Fn(x1)
    }
    ## The optimal K
    optK = Kvec[which.min(PAC)]
    list(optK=optK,PAC=PAC)
}

votedist.matrix.nbmut <- function(allgs,tA)
{
    m <- matrix(NA,length(allgs),length(allgs))
    for(i in 1:(length(allgs)-1))
        for(j in (i+1):length(allgs))
            m[j,i] <- m[i,j] <- -max(tA[allgs[i]],tA[allgs[j]])
    rownames(m) <- allgs
    colnames(m) <- allgs
    m
}

fastConsensusClustering <- function(allA,
                                    nbClusters,
                                    keepnames)
{
    nMut <- length(allA[[1]])
    nMethods <- length(allA)
    matA <- t(sapply(allA,function(x) x))
    clsA <- NULL
    for(i in 1:nrow(matA))
        clsA <- paste(clsA,matA[i,],sep=if(i==1) "" else "-")
    vuA <- sort(table(clsA),decreasing=T)
    dist1 <- votedist.matrix(names(vuA),nMethods)*max(vuA)*10
    dist2 <- votedist.matrix.nbmut(names(vuA),vuA)
    distF <- as.dist(dist1+dist2)
    hc <- hclust(distF,method="ward.D")
    lClusts <- lapply(nbClusters,function(nC) cutree(hc,k=nC))
    return(lapply(lClusts,function(x){
        clusts <- x[clsA]
        names(clusts) <- keepnames
        clusts
    }))
}

consensusMatrix <- function(allA,
                            pMethods=c(0.5,0.75,0.8,1,1.1,1.2),
                            pFeatures=c(0.5,0.75,0.8,1,1.1,1.2),
                            repeats=2)
{
    nbClustersMethods <-sapply(allA,function(x) length(unique(x[!is.na(x)])))
    if(median(nbClustersMethods)==1) return(1)
    nbClusters <- max(nbClustersMethods)
    nbMethods <- length(allA)
    nbFeatures <- length(allA[[1]])
    timeCC <- system.time(allCC <- lapply(pMethods,function(x)
        lapply(pFeatures,function(y)
            lapply(1:repeats,function(smp)
            {
                keepMethod <- sample(1:nbMethods,round(nbMethods*x),rep=T)
                keepFeatures <- sample(1:nbFeatures,round(nbFeatures*x),rep=T)
                lAa <- lapply(keepMethod,function(a)
                {
                    allA[[a]][keepFeatures]
                })
                fastConsensusClustering(lAa,
                                        1:nbClusters,
                                        keepnames=paste("snv",
                                                        keepFeatures,
                                                        sep=":"))
            }))))
    clsA <- paste("snv",1:length(allA[[1]]),sep=":")
    l <- length(clsA)
    ccm <- list()
    for(i in 2:nbClusters)
    {
        ccm[[i]] <- matrix(0,l,l)
        rownames(ccm[[i]]) <- colnames(ccm[[i]]) <- clsA
        for(ii in 1:length(pMethods))
        {
            for(jj in 1:length(pFeatures))
            {
                ccm[[i]] <- CCM(ccm[[i]],allCC,ii,jj,repeats,i,clsA)
            }
        }
    }
    K <- chooseOptimalK(lapply(ccm,function(x)
    {
        x/(length(pMethods)*length(pFeatures)*repeats)
    }),
    nbClusters)
    list(K=K,ccm=ccm,timeCC=timeCC)
}

ccm <- consensusMatrix(allA,pMethods=1,pFeatures=1,repeats=100)
str(ccm)

plotCCM <- function(ccm,...)
{
    par(mar=c(0,0,3,0))
    plot(0,0,xaxt="n",yaxt="n",frame=F,xlab="",ylab="",col=rgb(0,0,0,0),
         xlim=c(0,1),ylim=c(0,1),...)
    rasterImage(ccm/100,0,0,1,1)
}

par(mfcol=c(2,2))
par(mar=c(0,0,3,0))
plot.new()
tmp_null <- lapply(2:4,function(x) plotCCM(ccm$ccm[[x]],
                                           main=if(x==ccm$K$optK)
                                                    paste0("optimal K=",x,
                                                           " PAC=",signif(ccm$K$PAC[x-1],2))
                                                else paste0("K=",x,
                                                            " PAC=",signif(ccm$K$PAC[x-1],2))))
@

For this dummy example we find that the PAC is minimal for K=4, which
reflects in the consensus matrix (see heatmaps).

Once the optimal K is defined, we can either run the
fastConsensusClustering with K=optimalK or we can use the consensus
matrix to cluster mutations together. We go for the latter.


<<findClustsOptimalK, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'>>=
findClusts <- function(ccm,nbClust)
{
    hc <- hclust(as.dist(100-ccm),method="ward.D")
    clusts <- cutree(hc,k=nbClust)
}
finalClusts <- findClusts(ccm$ccm[[ccm$K$optK]],ccm$K$optK)
table(finalClusts)
@

\section*{Conclusion}
This method takes the cluster assignments (=cluster IDs) and CCF of
all mutations for each method and returns the consensus cluster
assignments, the corresponding optimal number of clusters and their
consensus CCFs.

<<sessionInfo, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'>>=
sessionInfo()
@


\bibliographystyle{unsrt}
\bibliography{bibCC}


\end{document}