-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathSupplementaryNotebook.Rmd
415 lines (330 loc) · 11.4 KB
/
SupplementaryNotebook.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
---
title: "AluNet Supplementary"
author: "Brandt Bessell, Xuyuan Zhang, Sicheng Qian"
date: "2023-12-08"
---
This section will walk through installing our package and replicating the results
of our benchmarking. First set your working directory
```{r, results='hide', message=FALSE, warning=FALSE}
# change this!
setwd("C:/Users/bbessell/OneDrive - Michigan Medicine/Documents/Courses/BIOSTAT615/test_package")
#setwd("C:/Users/bbessell/Projects/GitHub/AluNet")
```
To run our package, first install it and some other dependencies we will use to benchmark
```{r, results='hide', message=FALSE, warning=FALSE}
if (!require("igraph")) {
install.packages("igraph")
}
if (!require("microbenchmark")) {
install.packages("microbenchmark")
}
if (!require("devtools")) {
install.packages("devtools")
}
devtools::install_github("babessell1/AluNet/AluNet")
# if the above doesn't work, try
#url <- "https://github.com/babessell1/AluNet/raw/main/AluNet_1.0.tar.gz"
#download.file(url, "AluNet_1.0.tar.gz")
#install.packages("AluNet_1.0.tar.gz",repo=NULL)
```
Then we will load it, igraph, and microbenchmark to benchmark against the current gold standard of clustering algorithm implementations.
```{r, results='hide', message=FALSE, warning=FALSE}
library("igraph")
library("microbenchmark")
library("AluNet")
```
First, we check against several toy models. Note that the RNG is different for each implementation of Leiden so we will run several different seeds for each one to show that they are functioning similarly.
To do this we define a function to to run igraph's Leiden algorithm and plot the output, we define it here since it is not part of our package.
```{r}
runIgraph <- function(gdf, iterations, gamma, theta) {
ldc <- cluster_leiden(
gdf,
resolution_parameter=gamma,
objective_function="CPM",
weights=g_list[["weight"]],
beta=theta,
n_iterations=iterations,
vertex_weights = rep(1, length(unique(c(g_list[["to"]], g_list[["from"]]))))
)
return(ldc)
}
plotIgraph <- function(gdf, ldc) {
color_palette <- heat.colors(100)
# create colormap for the communities
get_community_colors <- function(num_communities) {
rainbow(num_communities)
}
# map communities to the colormap
num_communities <- length(unique(ldc$membership))
community_colors <- get_community_colors(num_communities)
colors <- community_colors[as.factor(ldc$membership)]
plot(
gdf,
layout = layout_with_fr(gdf),
vertex.color = colors, # Color nodes based on community
#edge.color = color_palette[g_list$weight], # Map edge color from cold to hot
vertex.label = V(gdf)$community,
edge.arrow.size = 0, # Set arrow size to 0 to remove arrows
main = "iGraph's C-based clustering"
)
# Add a legend for community colors
legend("topright", legend=unique(ldc$membership),
col = community_colors, pch = 19,
title = "Communities", cex = 0.8)
}
```
Our first toy model is "noperfectmatching" which has very well defined communities except for the central node which can belong to any of them depending on the starting conditions. This is a simple model and thus should converge in only a single iteration, despite us setting a maximum of 10.
We define our resolution parameter, gamma as .1 which should be small enough to identify more granular communities that will make it easier to compare its accuracy to igraph, we set theta to 0.01 because it is the middle of the recommended ranges of theta by the authors of the Leiden algorithm which should result in a moderate degree of randomness in the refine step. Since the toy models come from the igraph package, we also have to convert them into a format interpretable by our function, which is a list of connections "to" "from" and "weight."
```{r}
toy = "noperfectmatching"
iterations <- 10
gamma <- .1 # gamma > 0
theta <- 0.01 # good theta is 0.005 < 0.05
g <- as_edgelist(make_graph(toy))
g_list <- list(
to=g[,1],from=g[,2],
weight=rep(1, length(g[,1])) # all 1
)
```
```{r}
seed <- 1
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```
```{r}
gdf <- make_graph(toy)
set.seed(1)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
Next we benchmark the model to show that igraph, which is developed and maintained by a very large team of computer scientists, is much more optimized than our implementation.
```{r, results='hide', message=FALSE, warning=FALSE}
rm(.Random.seed)
# Use microbenchmark to time the function
mb_ours <- microbenchmark(
runLeiden(g_list, 1, gamma, theta),
times=50
)
mb_igraph <- microbenchmark(
runIgraph(gdf, 1, gamma, theta),
times=50
)
```
```{r}
# Print the result
print(rbind(mb_ours, mb_igraph))
```
Next we try with a larger more complex toy models to also show that igraph's implementation scales better than ours (4:1) compared to (5:1) for a graph that is twice as big.
```{r, results='hide', message=FALSE, warning=FALSE}
toy = "Zachary"
g <- as_edgelist(make_graph(toy))
g_list <- list(
to=g[,1],from=g[,2],
weight=rep(1, length(g[,1])) # all 1
)
gdf <- make_graph(toy)
rm(.Random.seed)
# Use microbenchmark to time the function
mb_ours <- microbenchmark(
runLeiden(g_list, 3, gamma, theta),
times=50
)
mb_igraph <- microbenchmark(
runIgraph(gdf, 3, gamma, theta),
times=50
)
```
```{r}
print(rbind(mb_ours, mb_igraph))
```
Now, let's increase the iteration number to 10 so that the partition can converge, and try several random seeds to see if the sets of partitions produced by each method overlap.
```{r}
iterations <- 10
gamma <- .1 # gamma > 0
theta <- 0.01 # good theta is 0.005 < 0.05
g <- as_edgelist(make_graph(toy))
g_list <- list(
to=g[,1],from=g[,2],
weight=rep(1, length(g[,1])) # all 1
)
gdf <- make_graph(toy)
```
```{r}
seed <- 1
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```
```{r}
gdf <- make_graph(toy)
set.seed(1)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
Notice these models are slightly different, Since this graph is a bit more complicated, and prone to RNG-based variation, which cannot be equalized by setting the same seed due to differences in implementation, we can rerun the igraph implementation with 2 as our seed instead of 1, to get the same partition.
```{r}
gdf <- make_graph(toy)
set.seed(2)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
Let's see if we consistently produce similar community configurations.
```{r}
seed <- 3
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```
```{r}
gdf <- make_graph(toy)
set.seed(3)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
```{r}
seed <- 4
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```
```{r}
gdf <- make_graph(toy)
set.seed(4)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
```{r}
seed <- 5
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```
```{r}
gdf <- make_graph(toy)
set.seed(5)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
Increasing the resolution parameter, gamma, should decrease the size of the communities, let's see if that happens.
```{r}
gamma <- 0.5
```
```{r}
seed <- 1
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```
```{r}
gdf <- make_graph(toy)
set.seed(1)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
Let's create a temporary directory and double check we have a few more dependencies to download and process some data as well as plot our results. To download the processed data directly, skip to the next section.
```{r, eval=FALSE}
if (!require("R.utils")) {
install.packages("R.utils")
}
library("R.utils")
if (!require("RColorBrewer")) {
install.packages("RColorBrewer")
}
library(RColorBrewer)
if (!require("dplyr")) {
install.packages("dplyr")
}
library(dplyr)
if (!require("stringr")) {
install.packages("stringr")
}
library(stringr)
dir.create("temp")
```
To show its utility on biological data, we first download some Fit-Hi-C data[@Lee2023]
```{r, eval=FALSE}
file_path <- paste0(getwd(), "/temp")
download_hic <- function(file_path){
library(R.utils)
setwd(file_path)
file_name <- "hic_data.txt.gz"
url <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE148434&format=file&file=GSE148434%5FNS%5Fall%2E5kb%2E1e%2D2%2EFitHiC%2Eall%2Etxt%2Egz"
download.file(url, paste(file_path, file_name, sep = "/"), mode = "wb")
gunzip(file_name, remove=FALSE)
}
download_hic(file_path)
```
Then we download Dfam mobile elements[@Storer2021] (be careful, it is very large). We place it in a folder in the current working directory.
```{r, eval=FALSE}
download_dfam <- function(file_path){
url <- "https://www.dfam.org/releases/Dfam_3.8/annotations/hg38/hg38.hits.gz"
download_directory <- paste(file_path, "uncleaned_data", sep = "/")
# Create the directory
if (!file.exists(download_directory)) {
dir.create(download_directory)
}
file_name <- "hg38.hits.gz"
hg38_file <- paste(download_directory, file_name, sep = "/")
download.file(url, hg38_file, mode = "wb")
gunzip(hg38_file, remove=FALSE)
}
download_dfam(file_path)
```
Then we download some example ATAC-seq data for the same tissue and disease context [@Corces2020]
```{r, eval=FALSE}
download_atac <- function(file_path){
library(readxl)
url <- "https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-020-00721-x/MediaObjects/41588_2020_721_MOESM4_ESM.xlsx"
download_directory <- paste(file_path, "uncleaned_data", sep = "/")
# Create the directory
if (!file.exists(download_directory)) {
dir.create(download_directory)
}
file_name <- "atac_data.xlsx"
atac_file <- paste(download_directory, file_name, sep = "/")
download.file(url, atac_file, mode = "wb")
# Read the Excel file
atac_data <- read_excel(atac_file)
# Specify the name of the output CSV file
output_csv_name <- "atac_data.csv"
output_csv_file <- paste(download_directory, output_csv_name, sep = "/")
# Save as CSV
write.csv(atac_data, output_csv_file, row.names = FALSE)
}
download_atac(file_path)
```
Then we run several helper functions to convert each of the data into dataframe (matrix)
```{r, eval=FALSE}
data_dir <- paste0(getwd(), "/temp")
returned_path <- clean_alu.R(
paste0(data_dir, "/hic_data.txt"),
paste0(data_dir, "/uncleaned_data/hg38.hits"),
data_dir
)
```
Then we use the files we made to generate the edges:
```{r, eval=FALSE}
cleaning_dir <- clean_alu(paste0(data_dir, "/highest_probability"), data_dir)
merge_hic_with_alu(data_dir, cleaning_dir)
```
```{r}
dataframe <- read.csv("edges_data_frame.csv")
dataframe$group <- NULL
list <- as.list(dataframe)
```
```{r}
result <- runLeiden(list, 1, 1, 0.01, 1)
```
If you don't want to download the large datasets, you can download our processed data instead to run our implementation of the Leiden algorithm on it.
```{r}
url <- "https://github.com/babessell1/AluNet/raw/main/final_data.csv"
download.file(url, "final_data.csv.csv")
dataframe_ <- read.csv("final_data.csv")
dataframe$group <- NULL
list_ <- as.list(dataframe)
```
Then we run the Leiden algorithm.
```{r}
result <- runLeiden(list_, 10, 1, 0.01, 1)
```
Plot the results and save as a high-dpi png.
```{r}
#png("alu_plot.png", width = 1000, height = 1000, res = 300)
plotLeiden(result, TRUE)
#dev.off()
```