-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
405 lines (370 loc) · 15.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
---
author:
- Maxime Tarabichi
bibliography:
- 'bibCC.bib'
title: 'ClusterID-based consensus clustering'
---
```{r setup, include=FALSE, cache=FALSE}
library(knitr)
# set global chunk options
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(formatR.arrow=TRUE,width=90)
```
# Description
This is a description of CICC concepts illustrated on some dummy data. The current code has evolved from the one shown here. Figures are in the pdf in description folder.
## Consensus cluster assignment
Let us use m methods $M_{1,\dotsc, m}$ to cluster v single nucleotide
variants $V_{1,\dotsc, v}$.\
Each method $M_i$ assigns to each variant $V_j$ one cluster ID
$$A(M_i,V_j)$$\
For each variant $V_j$, we write the vector of assignment\
$$\vec{VA}(V_j)=[A(M_1,V_j), A(M_2, V_j), \dotsc, A(M_m, V_j)]$$\
\
Two mutations $V_a$ and $V_b$ sharing vector of assignments, i.e
$\vec{VA}(V_a)=\vec{VA}(V_b)$, have been assigned to the same cluster
within all methods.\
Across all mutations, there will be $u$ unique vectors of assignment
$\vec{UVA}_{1\dotsc u}$, shared by $N_{1\dotsc u}$ number of mutations,
respectively.\
For each unique vector of assignment $\vec{UVA}_k$, we compute a
distance $d_{kl}$ to all $\vec{UVA}_l$. That distance is defined as a
the sum of two distances $$d_{kl}=d1(k,l)+d2(k,l)$$with:\
$$d1(k,l)=Cmax(N_{1\dotsc u})sum{(\vec{UVA}_k!=\vec{UVA}_l)}$$
$$d2(k,l)=-max(N_k,N_l)$$ C is set to 10 and ensures that d1=method
assignment prevails over d2=number of mutations in a cluster. Next we
perform hierarchical clustering on the distance matrix $D_{uxu}=d_{kl}$
using Ward's criterion and cut the tree to get a desired number of
clusters of $\vec{UVA}$. This method thus takes the final number of
consensus clusters as input.\
Finally, given $\vec{UVAs_c}$ in each consensus cluster c, mutations
$V_i$ for which $\vec{VA}(V_i)\in \vec{UVAs_c}$ are assigned to $c$.\
## Optimal K
To obtain the optimal number of clusters K, we derive consensus matrices
for $K\in[2,max(K_{methods})]$. If a majority of methods have K=1, we
set K=1. Otherwise for each values of K, we repeat the consensus
clustering procedure 100 times, after independent sampling with
replacement from the methods. We then obtain a consensus matrix for each
K value, corresponding to the proportion of times the mutations were
clustered together. We take optimal K as the K yielding the matrix with
minimal PAC [@senbabaoglu_critical_2014].\
## Consensus CCF
The CCFs of each consensus cluster can then be calculated by summarising
the CCFs of individual mutations assigned to that cluster. We take the
weighted average of these CCFs with the weights of each mutation
proportional to its number of aligned reads.\
This last step requires to compute a consensus CCF for each mutation. We
rescale CCF of each method using the consensus purity
($CCF_{rescaled}=\frac{purity_{method}}{purity_{consensus}}CCF_{method}$)
and take the median across methods.\
# Example
Let's generate simple dummy data for illustration purposes.
```{r generateData, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'}
## #####################################
## generate data
## list with cluster assignments of each method
allA <- list(method1=c(rep(1,1000),rep(2,1000)),
method2=c(rep(1,800),rep(2,900), rep(3,300)),
method3=c(rep(1,800),rep(2,600), rep(3,600)),
method4=c(rep(1,1200),rep(2,600), rep(3,200)),
method5=c(rep(4,200),rep(3,500), rep(2,700),rep(1,600)),
method6=c(rep(4,600),rep(3,800), rep(2,400),rep(1,200))
)
allA <- lapply(allA,function(x) {names(x) <- paste("mutation",1:2000,sep="");return(x);})
sapply(allA,head)
## simulate a dummy underlying ccf
cov <- rpois(700,70)
cov2 <- rpois(600,70)
ccfs <- c(sort(rbinom(700, cov, .5)*2/sample(cov),decreasing=T),
sort(rbinom(700, cov, .33)*2/cov,decreasing=T),
sort(rbinom(600, cov, .1)*2/cov2,decreasing=T))
## #####################################
## plot ccfs
hist(ccfs,100)
abline(v=c(1,.66,.2),lwd=10,col=rgb(.5,0,0,.6))
## #####################################
```
This is a dummy -not necessarily realistic- example to illustrate how
the methods would treat the typical input, i.e. a list of vectors of
cluster assignments (=cluster IDs). We start with the cluster
assignments from 6 methods for the same 2,000 mutations. In this
example, the truth has three clusters of 700, 700, and 600 mutations,
resp. For illustration purposes, their cancer cell fractions (CCFs) are
modelled as binomials around 100%, 66% and 20% CCF, respectively.
However, only the cluster assignments are actually used for consensus
clustering.\
In this example the methods find 2, 3, 3, 3, 4 and 4 clusters resp., and
each differ slightly in cluster assignments. Unlike the four first
methods, the two last methods have assigned highest cluster ID to the
lowest CCF cluster. Because the cluster-ID consensus clustering is not
based on the CCF, this will have no impact on the final consensus
clustering.\
The two first mutations are assigned to cluster1, cluster1, cluster1,
cluster1, cluster4 and cluster4, resp. by each of the methods and
therefore their vector of assignments is defined as: 1-1-1-1-4-4. The
last mutation is assigned to cluster2, cluster3, cluster3, cluster3,
cluster1, and cluster1, resp. by each of the methods and its vector of
assignments is defined as: 2-3-3-3-1-1.\
The consensus clustering distance between the two first mutations is
minimal (d=0) since they have been assigned to the same cluster by all
methods. The consensus clustering distance between these the two first
mutations and the last mutation is maximal (d=6), since all methods have
assigned these two mutations to different clusters.\
This distance between mutations is used to cluster all mutations into an
optimal number K of consensus clusters. K is chosen to minimise the
Proportion of Ambiguous Clusters (PAC) [@senbabaoglu_critical_2014] of
the consensus matrices corresponding to K.\
Let's first cluster mutations based on their distances for a given
number of clusters K=4.\
```{r clusterData, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'}
## similarity between two vectors of assignments
votedist <- function(g1,g2)
{
suppressWarnings(if(T){
g1 <- as.numeric(strsplit(g1,split="-")[[1]])
g2 <- as.numeric(strsplit(g2,split="-")[[1]])
return(sum(g1==g2,na.rm=T))})
}
## distance matrix between pairs of unique vectors of assignments
votedist.matrix <- function(allgs,nMethods)
{
m <- matrix(0,length(allgs),length(allgs))
for(i in 1:(length(allgs)-1))
for(j in (i+1):length(allgs))
{
m[i,j] <- votedist(allgs[i],allgs[j])
m[j,i] <- votedist(allgs[i],allgs[j])
}
rownames(m) <- allgs
colnames(m) <- allgs
nMethods-m
}
## fast cc using hclust, with method="ward linkage"
fastConsensusClustering <- function(allA,
nbClusters,
keepnames=NULL)
{
nMut <- length(allA[[1]])
nMethods <- length(allA)
matA <- t(sapply(allA,function(x) x))
clsA <- NULL
for(i in 1:nrow(matA))
clsA <- paste(clsA,matA[i,],sep=if(i==1) "" else "-")
head(clsA)
vuA <- sort(table(clsA),decreasing=T)
head(vuA)
distF <- as.dist(votedist.matrix(names(vuA),nMethods)*max(vuA)*10)
hc <- hclust(distF,method="ward.D")
lClusts <- lapply(nbClusters,function(nC) cutree(hc,k=nC))
return(lClusts)
}
## run and print results
lResultClusters <- fastConsensusClustering(allA,nbClusters=4)
print(lResultClusters)
```
We can see after printing the results that each mutation has been
grouped together with all other mutations with the same vector of
assignments. We are actually clustering unique vectors of assignments
rather than individual mutations. This will prove very useful when
clustering high number of mutations, as the time complexity of the
method does not depend on the number of mutations but the number of
vectors of assignments.\
The clusters assigned are not necessarily ordered with regards to the
underlying CCFs (see example below).
```{r getCCF, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'}
## compute consensus ccfs
clsA <- NULL
for(i in 1:length(allA))
clsA <- paste(clsA,allA[[i]],sep=if(i==1) "" else "-")
ccClusters <- unique(lResultClusters[[1]])
ccClusters
meanCCFs <- sapply(ccClusters,
function(x)
mean(ccfs[lResultClusters[[1]][clsA]==x]))
## print ccfs
print(meanCCFs)
```
Let's try with K=3 clusters.
```{r optimalK2, eval=T, size="scriptsize", fig.width=6, fig.height=4, out.width='1.0\\linewidth'}
lResultClusters <- fastConsensusClustering(allA,nbClusters=3)
## final cc with K=2
print(lResultClusters)
clsA <- NULL
for(i in 1:length(allA))
clsA <- paste(clsA,allA[[i]],sep=if(i==1) "" else "-")
ccClusters <- unique(lResultClusters[[1]])
ccClusters
meanCCFs <- sapply(ccClusters,
function(x)
mean(ccfs[lResultClusters[[1]][clsA]==x]))
## final ccfs with K=2
print(meanCCFs)
```
To obtain the optimal number of clusters $optK$, we derive
$consensus matrices$ for $optK\in[2,max(K_{methods})]$. If a majority of
methods have $K=1$, we set $optK=1$. Otherwise for each value of $optK$,
we repeat the consensus clustering procedure 100 times, after
independent sampling with replacement from the methods. We then obtain
one $consensus matrix$ for each K value, corresponding to proportion of
times the mutations were co-clustered. We take
$optK=\operatorname{arg\,min}_K PAC(consensus matrix(K))$
[@senbabaoglu_critical_2014].\
```{r findOptimalK, eval=T, size="scriptsize", fig.width=6, fig.height=4, out.width='1.0\\linewidth'}
dyn.load("scoringlite.so")
makeHardAss <- function(v)
{
matrix(.C("hardass",
as.integer(length(v)),
as.integer(v),
as.integer(rep(0,length(v)*length(v))))[[3]],
length(v),length(v))
}
CCM <- function(ccm, allCC,ii,jj,repeats,i,rn)
{
for(j in 1:repeats)
{
v <- allCC[[ii]][[jj]][[j]][[i]][rn]
v[is.na(v)] <- (max(v,na.rm=T)+1):(max(v,na.rm=T)+sum(is.na(v)))
ccm <- ccm+makeHardAss(v)
gc()
}
ccm
}
chooseOptimalK <- function(ccm,maxK)
{
## from Dr. Yasin Şenbabaoğlu:
## shenbaba.weebly.com/blog/how-to-use-the-pac-measure-in-consensus-clustering
Kvec = 2:maxK
x1 = 0.1; x2 = 0.9 ## threshold defining the intermediate sub-interval
PAC = rep(NA,length(Kvec))
names(PAC) = paste("K=",Kvec,sep="") # from 2 to maxK
for(i in Kvec){
M = ccm[[i]]
Fn = ecdf(M[lower.tri(M)])
PAC[i-1] = Fn(x2) - Fn(x1)
}
## The optimal K
optK = Kvec[which.min(PAC)]
list(optK=optK,PAC=PAC)
}
votedist.matrix.nbmut <- function(allgs,tA)
{
m <- matrix(NA,length(allgs),length(allgs))
for(i in 1:(length(allgs)-1))
for(j in (i+1):length(allgs))
m[j,i] <- m[i,j] <- -max(tA[allgs[i]],tA[allgs[j]])
rownames(m) <- allgs
colnames(m) <- allgs
m
}
fastConsensusClustering <- function(allA,
nbClusters,
keepnames)
{
nMut <- length(allA[[1]])
nMethods <- length(allA)
matA <- t(sapply(allA,function(x) x))
clsA <- NULL
for(i in 1:nrow(matA))
clsA <- paste(clsA,matA[i,],sep=if(i==1) "" else "-")
vuA <- sort(table(clsA),decreasing=T)
dist1 <- votedist.matrix(names(vuA),nMethods)*max(vuA)*10
dist2 <- votedist.matrix.nbmut(names(vuA),vuA)
distF <- as.dist(dist1+dist2)
hc <- hclust(distF,method="ward.D")
lClusts <- lapply(nbClusters,function(nC) cutree(hc,k=nC))
return(lapply(lClusts,function(x){
clusts <- x[clsA]
names(clusts) <- keepnames
clusts
}))
}
consensusMatrix <- function(allA,
pMethods=c(0.5,0.75,0.8,1,1.1,1.2),
pFeatures=c(0.5,0.75,0.8,1,1.1,1.2),
repeats=2)
{
nbClustersMethods <-sapply(allA,function(x) length(unique(x[!is.na(x)])))
if(median(nbClustersMethods)==1) return(1)
nbClusters <- max(nbClustersMethods)
nbMethods <- length(allA)
nbFeatures <- length(allA[[1]])
timeCC <- system.time(allCC <- lapply(pMethods,function(x)
lapply(pFeatures,function(y)
lapply(1:repeats,function(smp)
{
keepMethod <- sample(1:nbMethods,round(nbMethods*x),rep=T)
keepFeatures <- sample(1:nbFeatures,round(nbFeatures*x),rep=T)
lAa <- lapply(keepMethod,function(a)
{
allA[[a]][keepFeatures]
})
fastConsensusClustering(lAa,
1:nbClusters,
keepnames=paste("snv",
keepFeatures,
sep=":"))
}))))
clsA <- paste("snv",1:length(allA[[1]]),sep=":")
l <- length(clsA)
ccm <- list()
for(i in 2:nbClusters)
{
ccm[[i]] <- matrix(0,l,l)
rownames(ccm[[i]]) <- colnames(ccm[[i]]) <- clsA
for(ii in 1:length(pMethods))
{
for(jj in 1:length(pFeatures))
{
ccm[[i]] <- CCM(ccm[[i]],allCC,ii,jj,repeats,i,clsA)
}
}
}
K <- chooseOptimalK(lapply(ccm,function(x)
{
x/(length(pMethods)*length(pFeatures)*repeats)
}),
nbClusters)
list(K=K,ccm=ccm,timeCC=timeCC)
}
ccm <- consensusMatrix(allA,pMethods=1,pFeatures=1,repeats=100)
str(ccm)
plotCCM <- function(ccm,...)
{
par(mar=c(0,0,3,0))
plot(0,0,xaxt="n",yaxt="n",frame=F,xlab="",ylab="",col=rgb(0,0,0,0),
xlim=c(0,1),ylim=c(0,1),...)
rasterImage(ccm/100,0,0,1,1)
}
par(mfcol=c(2,2))
par(mar=c(0,0,3,0))
plot.new()
tmp_null <- lapply(2:4,function(x) plotCCM(ccm$ccm[[x]],
main=if(x==ccm$K$optK)
paste0("optimal K=",x,
" PAC=",signif(ccm$K$PAC[x-1],2))
else paste0("K=",x,
" PAC=",signif(ccm$K$PAC[x-1],2))))
```
For this dummy example we find that the PAC is minimal for K=4, which
reflects in the consensus matrix (see heatmaps).
Once the optimal K is defined, we can either run the
fastConsensusClustering with K=optimalK or we can use the consensus
matrix to cluster mutations together. We go for the latter.
```{r findClustsOptimalK, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'}
findClusts <- function(ccm,nbClust)
{
hc <- hclust(as.dist(100-ccm),method="ward.D")
clusts <- cutree(hc,k=nbClust)
}
finalClusts <- findClusts(ccm$ccm[[ccm$K$optK]],ccm$K$optK)
table(finalClusts)
```
# Conclusion
This method takes the cluster assignments (=cluster IDs) and CCF of all
mutations for each method and returns the consensus cluster assignments,
the corresponding optimal number of clusters and their consensus CCFs.
```{r sessionInfo, eval=T, size="scriptsize", fig.width=10, fig.height=4, out.width='1.0\\linewidth'}
sessionInfo()
```
\bibliographystyle{unsrt}