-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdescription.Rnw
441 lines (394 loc) · 15.9 KB
/
description.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
%% comment line1
%% comment line2
\documentclass{article}
\usepackage{float}
\usepackage[sc]{mathpazo}
\usepackage[T1]{fontenc}
\usepackage{geometry}
\usepackage{pdfpages}
\usepackage{amsmath}
\geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm}
\setcounter{secnumdepth}{2}
\setcounter{tocdepth}{2}
\usepackage{url}
\usepackage[unicode=true,pdfusetitle,
bookmarks=true,bookmarksnumbered=true,bookmarksopen=true,bookmarksopenlevel=2,
breaklinks=false,pdfborder={0 0 1},backref=false,colorlinks=false]
{hyperref}
\usepackage[nottoc]{tocbibind}
\hypersetup{
pdfstartview={XYZ null null 1}}
\usepackage{breakurl}
\renewcommand*\contentsname{Outline}
\begin{document}
<<setup, include=FALSE, cache=FALSE>>=
library(knitr)
# set global chunk options
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(formatR.arrow=TRUE,width=90)
@
\title{ClusterID-based consensus clustering}
\author{Maxime Tarabichi}
\maketitle
\section*{Description}
\subsection*{Consensus cluster assignment}
Let us use m methods \(M_{1,\dotsc, m}\) to cluster v single
nucleotide variants
\(V_{1,\dotsc, v}\). \\
Each method \(M_i\) assigns to each variant \(V_j\) one
cluster ID \[A(M_i,V_j)\]\\
For each variant \(V_j\), we write the vector of
assignment\\
\[\vec{VA}(V_j)=[A(M_1,V_j), A(M_2, V_j), \dotsc, A(M_m, V_j)]\] \\ \\
Two mutations \(V_a\) and \(V_b\) sharing vector of assignments, i.e
\(\vec{VA}(V_a)=\vec{VA}(V_b)\), have been assigned to the same cluster within all
methods. \\
Across all mutations, there will be \(u\) unique vectors
of assignment \(\vec{UVA}_{1\dotsc u}\), shared by \(N_{1\dotsc u}\) number of
mutations, respectively. \\
For each unique vector of assignment \(\vec{UVA}_k\), we
compute a distance \(d_{kl}\) to all \(\vec{UVA}_l\).
That distance is defined as a the sum of two distances
\[d_{kl}=d1(k,l)+d2(k,l)\]with:\\
\[d1(k,l)=Cmax(N_{1\dotsc u})sum{(\vec{UVA}_k!=\vec{UVA}_l)}\]
\[d2(k,l)=-max(N_k,N_l)\]
C is set to 10 and ensures that d1=method assignment prevails over
d2=number of mutations in a cluster.
Next we perform hierarchical clustering on the distance matrix
\(D_{uxu}=d_{kl}\) using Ward's criterion and cut the tree to
get a desired number of clusters of \(\vec{UVA}\). This method thus takes
the final number of consensus clusters as input.\\
Finally, given \(\vec{UVAs_c}\) in each consensus cluster c,
mutations \(V_i\) for which \(\vec{VA}(V_i)\in \vec{UVAs_c}\) are
assigned to \(c\). \\
\subsection*{Optimal K}
To obtain the optimal number of clusters K, we derive consensus matrices for
\(K\in[2,max(K_{methods})]\). If a majority of methods have K=1,
we set K=1. Otherwise for each values of K, we repeat the consensus
clustering procedure 100 times, after independent sampling with
replacement from the methods. We then obtain a consensus matrix for
each K value, corresponding to the proportion of times the mutations were
clustered together. We take optimal K as the K yielding the matrix with
minimal PAC \cite{senbabaoglu_critical_2014}.\\
\subsection*{Consensus CCF}
The CCFs of each consensus cluster can then be calculated by
summarising the CCFs of individual mutations assigned to that
cluster. We take the weighted average of these CCFs with the weights
of each mutation proportional to its number of aligned reads.\\
This last step requires to compute a consensus CCF for each
mutation. We rescale CCF of each method using the consensus purity
(\(CCF_{rescaled}=\frac{purity_{method}}{purity_{consensus}}CCF_{method}\)) and
take the median across methods. \\
\newpage
\section*{Example}
Let's generate simple dummy data for illustration purposes.
<<generateData, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'>>=
## #####################################
## generate data
## list with cluster assignments of each method
allA <- list(method1=c(rep(1,1000),rep(2,1000)),
method2=c(rep(1,800),rep(2,900), rep(3,300)),
method3=c(rep(1,800),rep(2,600), rep(3,600)),
method4=c(rep(1,1200),rep(2,600), rep(3,200)),
method5=c(rep(4,200),rep(3,500), rep(2,700),rep(1,600)),
method6=c(rep(4,600),rep(3,800), rep(2,400),rep(1,200))
)
allA <- lapply(allA,function(x) {names(x) <- paste("mutation",1:2000,sep="");return(x);})
sapply(allA,head)
## simulate a dummy underlying ccf
cov <- rpois(700,70)
cov2 <- rpois(600,70)
ccfs <- c(sort(rbinom(700, cov, .5)*2/sample(cov),decreasing=T),
sort(rbinom(700, cov, .33)*2/cov,decreasing=T),
sort(rbinom(600, cov, .1)*2/cov2,decreasing=T))
## #####################################
## plot ccfs
hist(ccfs,100)
abline(v=c(1,.66,.2),lwd=10,col=rgb(.5,0,0,.6))
## #####################################
@
This is a dummy -not necessarily realistic- example to illustrate how
the methods would treat the typical input, i.e. a list of vectors of
cluster assignments (=cluster IDs).
We start with the cluster assignments from 6 methods for the same
2,000 mutations. In this example, the truth has three clusters
of 700, 700, and 600 mutations, resp. For illustration purposes, their
cancer cell fractions (CCFs) are modelled as binomials around 100\%, 66\% and 20\% CCF,
respectively. However, only the cluster assignments are actually used for
consensus clustering.\\
In this example the methods find 2, 3, 3, 3, 4 and 4 clusters resp., and each
differ slightly in cluster assignments. Unlike the four first methods,
the two last methods have assigned highest cluster ID to the lowest
CCF cluster. Because the cluster-ID consensus clustering is not based
on the CCF, this will have no impact on the final consensus clustering.\\
The two first mutations are assigned to cluster1, cluster1, cluster1,
cluster1, cluster4 and cluster4, resp. by each of the methods and therefore their
vector of assignments is defined as: 1-1-1-1-4-4.
The last mutation is assigned to cluster2, cluster3, cluster3, cluster3,
cluster1, and cluster1, resp. by each of the methods and its vector of
assignments is defined as: 2-3-3-3-1-1.\\
The consensus clustering distance between the two first mutations is
minimal (d=0) since they have been assigned to the same cluster by all methods.
The consensus clustering distance between these the two first
mutations and the last mutation is maximal (d=6), since all methods have
assigned these two mutations to different clusters.\\
This distance between mutations is used to cluster all mutations into an
optimal number K of consensus clusters. K is chosen to minimise the
Proportion of Ambiguous Clusters (PAC)
\cite{senbabaoglu_critical_2014} of the consensus matrices
corresponding to K.\\
Let's first cluster mutations based on their distances for a
given number of clusters K=4.\\
<<clusterData, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'>>=
## similarity between two vectors of assignments
votedist <- function(g1,g2)
{
suppressWarnings(if(T){
g1 <- as.numeric(strsplit(g1,split="-")[[1]])
g2 <- as.numeric(strsplit(g2,split="-")[[1]])
return(sum(g1==g2,na.rm=T))})
}
## distance matrix between pairs of unique vectors of assignments
votedist.matrix <- function(allgs,nMethods)
{
m <- matrix(0,length(allgs),length(allgs))
for(i in 1:(length(allgs)-1))
for(j in (i+1):length(allgs))
{
m[i,j] <- votedist(allgs[i],allgs[j])
m[j,i] <- votedist(allgs[i],allgs[j])
}
rownames(m) <- allgs
colnames(m) <- allgs
nMethods-m
}
## fast cc using hclust, with method="ward linkage"
fastConsensusClustering <- function(allA,
nbClusters,
keepnames=NULL)
{
nMut <- length(allA[[1]])
nMethods <- length(allA)
matA <- t(sapply(allA,function(x) x))
clsA <- NULL
for(i in 1:nrow(matA))
clsA <- paste(clsA,matA[i,],sep=if(i==1) "" else "-")
head(clsA)
vuA <- sort(table(clsA),decreasing=T)
head(vuA)
distF <- as.dist(votedist.matrix(names(vuA),nMethods)*max(vuA)*10)
hc <- hclust(distF,method="ward.D")
lClusts <- lapply(nbClusters,function(nC) cutree(hc,k=nC))
return(lClusts)
}
## run and print results
lResultClusters <- fastConsensusClustering(allA,nbClusters=4)
print(lResultClusters)
@
We can see after printing the results that each mutation has been
grouped together with all other mutations with the same vector of
assignments.
We are actually clustering unique vectors of assignments rather than
individual mutations. This will prove very useful when clustering high
number of mutations, as the time complexity of the method does not
depend on the number of mutations but the number of vectors of assignments.\\
The clusters assigned are not necessarily ordered with regards to the
underlying CCFs (see example below).
<<getCCF, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'>>=
## compute consensus ccfs
clsA <- NULL
for(i in 1:length(allA))
clsA <- paste(clsA,allA[[i]],sep=if(i==1) "" else "-")
ccClusters <- unique(lResultClusters[[1]])
ccClusters
meanCCFs <- sapply(ccClusters,
function(x)
mean(ccfs[lResultClusters[[1]][clsA]==x]))
## print ccfs
print(meanCCFs)
@
Let's try with K=3 clusters.
<<optimalK2, eval=T, size="scriptsize", fig.width=6, fig.height=4, out.width='1.0\\linewidth'>>=
lResultClusters <- fastConsensusClustering(allA,nbClusters=3)
## final cc with K=2
print(lResultClusters)
clsA <- NULL
for(i in 1:length(allA))
clsA <- paste(clsA,allA[[i]],sep=if(i==1) "" else "-")
ccClusters <- unique(lResultClusters[[1]])
ccClusters
meanCCFs <- sapply(ccClusters,
function(x)
mean(ccfs[lResultClusters[[1]][clsA]==x]))
## final ccfs with K=2
print(meanCCFs)
@
To obtain the optimal number of clusters \(optK\), we derive \(consensus matrices\) for
\(optK\in[2,max(K_{methods})]\). If a majority of methods have \(K=1\), we
set \(optK=1\). Otherwise for each value of \(optK\), we repeat the consensus
clustering procedure 100 times, after independent sampling with
replacement from the methods. We then obtain one \(consensus matrix\) for
each K value, corresponding to proportion of times the mutations were
co-clustered. We take \(optK=\operatorname{arg\,min}_K PAC(consensus matrix(K))\) \cite{senbabaoglu_critical_2014}.\\
<<findOptimalK, eval=T, size="scriptsize", fig.width=6, fig.height=4, out.width='1.0\\linewidth'>>=
dyn.load("scoringlite.so")
makeHardAss <- function(v)
{
matrix(.C("hardass",
as.integer(length(v)),
as.integer(v),
as.integer(rep(0,length(v)*length(v))))[[3]],
length(v),length(v))
}
CCM <- function(ccm, allCC,ii,jj,repeats,i,rn)
{
for(j in 1:repeats)
{
v <- allCC[[ii]][[jj]][[j]][[i]][rn]
v[is.na(v)] <- (max(v,na.rm=T)+1):(max(v,na.rm=T)+sum(is.na(v)))
ccm <- ccm+makeHardAss(v)
gc()
}
ccm
}
chooseOptimalK <- function(ccm,maxK)
{
## from Dr. Yasin Şenbabaoğlu:
## shenbaba.weebly.com/blog/how-to-use-the-pac-measure-in-consensus-clustering
Kvec = 2:maxK
x1 = 0.1; x2 = 0.9 ## threshold defining the intermediate sub-interval
PAC = rep(NA,length(Kvec))
names(PAC) = paste("K=",Kvec,sep="") # from 2 to maxK
for(i in Kvec){
M = ccm[[i]]
Fn = ecdf(M[lower.tri(M)])
PAC[i-1] = Fn(x2) - Fn(x1)
}
## The optimal K
optK = Kvec[which.min(PAC)]
list(optK=optK,PAC=PAC)
}
votedist.matrix.nbmut <- function(allgs,tA)
{
m <- matrix(NA,length(allgs),length(allgs))
for(i in 1:(length(allgs)-1))
for(j in (i+1):length(allgs))
m[j,i] <- m[i,j] <- -max(tA[allgs[i]],tA[allgs[j]])
rownames(m) <- allgs
colnames(m) <- allgs
m
}
fastConsensusClustering <- function(allA,
nbClusters,
keepnames)
{
nMut <- length(allA[[1]])
nMethods <- length(allA)
matA <- t(sapply(allA,function(x) x))
clsA <- NULL
for(i in 1:nrow(matA))
clsA <- paste(clsA,matA[i,],sep=if(i==1) "" else "-")
vuA <- sort(table(clsA),decreasing=T)
dist1 <- votedist.matrix(names(vuA),nMethods)*max(vuA)*10
dist2 <- votedist.matrix.nbmut(names(vuA),vuA)
distF <- as.dist(dist1+dist2)
hc <- hclust(distF,method="ward.D")
lClusts <- lapply(nbClusters,function(nC) cutree(hc,k=nC))
return(lapply(lClusts,function(x){
clusts <- x[clsA]
names(clusts) <- keepnames
clusts
}))
}
consensusMatrix <- function(allA,
pMethods=c(0.5,0.75,0.8,1,1.1,1.2),
pFeatures=c(0.5,0.75,0.8,1,1.1,1.2),
repeats=2)
{
nbClustersMethods <-sapply(allA,function(x) length(unique(x[!is.na(x)])))
if(median(nbClustersMethods)==1) return(1)
nbClusters <- max(nbClustersMethods)
nbMethods <- length(allA)
nbFeatures <- length(allA[[1]])
timeCC <- system.time(allCC <- lapply(pMethods,function(x)
lapply(pFeatures,function(y)
lapply(1:repeats,function(smp)
{
keepMethod <- sample(1:nbMethods,round(nbMethods*x),rep=T)
keepFeatures <- sample(1:nbFeatures,round(nbFeatures*x),rep=T)
lAa <- lapply(keepMethod,function(a)
{
allA[[a]][keepFeatures]
})
fastConsensusClustering(lAa,
1:nbClusters,
keepnames=paste("snv",
keepFeatures,
sep=":"))
}))))
clsA <- paste("snv",1:length(allA[[1]]),sep=":")
l <- length(clsA)
ccm <- list()
for(i in 2:nbClusters)
{
ccm[[i]] <- matrix(0,l,l)
rownames(ccm[[i]]) <- colnames(ccm[[i]]) <- clsA
for(ii in 1:length(pMethods))
{
for(jj in 1:length(pFeatures))
{
ccm[[i]] <- CCM(ccm[[i]],allCC,ii,jj,repeats,i,clsA)
}
}
}
K <- chooseOptimalK(lapply(ccm,function(x)
{
x/(length(pMethods)*length(pFeatures)*repeats)
}),
nbClusters)
list(K=K,ccm=ccm,timeCC=timeCC)
}
ccm <- consensusMatrix(allA,pMethods=1,pFeatures=1,repeats=100)
str(ccm)
plotCCM <- function(ccm,...)
{
par(mar=c(0,0,3,0))
plot(0,0,xaxt="n",yaxt="n",frame=F,xlab="",ylab="",col=rgb(0,0,0,0),
xlim=c(0,1),ylim=c(0,1),...)
rasterImage(ccm/100,0,0,1,1)
}
par(mfcol=c(2,2))
par(mar=c(0,0,3,0))
plot.new()
tmp_null <- lapply(2:4,function(x) plotCCM(ccm$ccm[[x]],
main=if(x==ccm$K$optK)
paste0("optimal K=",x,
" PAC=",signif(ccm$K$PAC[x-1],2))
else paste0("K=",x,
" PAC=",signif(ccm$K$PAC[x-1],2))))
@
For this dummy example we find that the PAC is minimal for K=4, which
reflects in the consensus matrix (see heatmaps).
Once the optimal K is defined, we can either run the
fastConsensusClustering with K=optimalK or we can use the consensus
matrix to cluster mutations together. We go for the latter.
<<findClustsOptimalK, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'>>=
findClusts <- function(ccm,nbClust)
{
hc <- hclust(as.dist(100-ccm),method="ward.D")
clusts <- cutree(hc,k=nbClust)
}
finalClusts <- findClusts(ccm$ccm[[ccm$K$optK]],ccm$K$optK)
table(finalClusts)
@
\section*{Conclusion}
This method takes the cluster assignments (=cluster IDs) and CCF of
all mutations for each method and returns the consensus cluster
assignments, the corresponding optimal number of clusters and their
consensus CCFs.
<<sessionInfo, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'>>=
sessionInfo()
@
\bibliographystyle{unsrt}
\bibliography{bibCC}
\end{document}