-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathBiostat615_finalReport.Rmd
548 lines (417 loc) · 33 KB
/
Biostat615_finalReport.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
---
title: "Weighted Leiden Algorithm for Modeling Alu-Mediated Enhancer-Promoter Interaction
Networks"
author:
- Brandt Bessell ([email protected])
- Xuyuan Zhang ([email protected])
- Sicheng Qian ([email protected])
output:
pdf_document: default
html_document:
df_print: kable
bibliography: references.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(kableExtra)
```
# Code Availability
All code for this project is available at:
https://github.com/babessell1/AluNet
# Introduction
Alu elements are transposable elements (TE) that hijack LINE1 encoded retrotransposition proteins to copy and paste themselves into genomic DNA[@Doucet2015] Discoveries that transposable element polymorphisms in the human population could modulate gene expression[@Wang2017] and active evolutionary pressure toward enhancers paired with long-range interactions with promoters[@Su2014] have challenged earlier conceptions they they were “junk” DNA.[@Pray2008] This year, substantial evidence has been provided suggesting that a complementary sequence motif found in Alu elements of the same family can induce RNA polymerase-II transcribed RNA-RNA interactions in cases where an Alu sequences are embedded in a promoter upstream antisense transcript and another Alu sequence is embedded in an enhancer transcript.[@Liang2023] Based on this mechanism, it has been speculated that mostly disjoint enhancer promoter interaction networks (EPRI) may exist that can be rapidly rewired and have implications in evolutionary and disease contexts[@Wen2023] and that the inherent RNA polymerase-III transcription of Alu elements essential for their retrotransposition could yield similar RNA-RNA based dynamics in non-enhancer/promoter embedded elements. Moreover, it has been demonstrated with FISH that LINE1 and Alu elements both exclusively colocalize with their specific subfamilies in cellular compartments.[@Lu2021] For reasons which will not be discussed in detail in this report, it is hypothesized that understanding the dynamics of these Alu communities could be relevant to one of the group member’s research into neurodegenerative disease mechanisms.
# Problem Description
With a lack of RIC-seq data to pull from to study the phenomenon, but an abundance of collaborators with the ability to provide Pore-C targeted capture data, we propose that a heuristic community detection approach could be useful for identifying candidate loci for target capture. The algorithm we settled on was the Leiden algorithm[@Traag2019] due to specific improvements over the Louvain algorithm[@Blondel2008] which will be discussed in detail in the algorithm description section. Another reason we decided to select the Leiden algorithm is that it is necessarily highly iterative and thus provides a good use case for using Rcpp which was the main course topic of interest for some of the group members who were interested in learning C++, which is important for their subfield. There is also a practical reason for rewriting the algorithm in C++. igraph[@Csardi2006] is written in C and maintained by a large team of computer scientists working in the field of network analysis and it is unrealistic to believe we can even come close to competing with their computational speed within the timeframe of this project. However, in their C-based implementation, it is difficult to add low-level customized features to the algorithm due to inflexibility resulting from a lack of encapsulation and innate challenges with memory management. By rewriting the algorithm in object-oriented C++, it will be potentially useful to one of the group members who may want to extend the functionality of the algorithm. For example, on-the-fly checking for intact RNA polymerase-III/II binding domains, more complex weighting schemes using local sequence alignment and expected bindable transcript lengths, as well as dynamic storage of relevant properties for (and results of) these calculations within partition, community, and graph objects. Extending the Leiden algorithm with these features is beyond the scope of this class and will be reserved for potential future projects. Instead, the goal of our project was to get the algorithm working in C++ to have a foundation for these future projects. Our second aim was to use our implementation to demonstrate the Leiden algorithm’s viability in identifying candidate chromatin modulating Alu communities with a crude weighting scheme incorporating publicly available Hi-C[@Lee2023], ATAC-seq data[@Corces2020], and the Dfam transposable elements database[@Storer2021].
# Package Features
Our package includes two main user-callable functions. The first, createAluGraph creates our weighted adjacency matrix by using taking the Hadamard product of DNA-fragment contact probabilities from Fit Hi-C callsets, a boolean matrix of whether or not both fragments contain accessible chromatin (peaks) based on the ATAC-seq data, and counts of the expected number of potential simultaneous Alu-mediated RNA-RNA interactions (assuming that the Alu has a functional RNA polymerase III binding and that the Alu’s are far enough apart that there is only one Alu sequence per transcript). It outputs the adjacency matrix as a list of connections which is typically more memory efficient. Several helper functions for creating the input matrices for createAluGraph are included as well. The second primary, user-callable function is runLeiden which is exported from C++. It takes the output of createAluGraph, or any other R list of connections with the items “to”, “from”, and “weights.” The function runLeiden interfaces the Optimizer class with R allowing the algorithm to be called as an R function. The Optimizer class contains utilities for running each step of the Leiden algorithm as defined by the original paper[@Traag2019]. The Partition class contains utilities for creating and modifying partitions. The Community class contains utilities for creating and modifying communities, and the Graph class contains utilities for creating and modifying nodes and edges in a graph. Lastly, the package contains a function plotLeiden to plot the results of our algorithm which requires graph.data.frame from igraph.[@Csardi2006]
# Algorithm Description
The following Two sections are based on [@Traag2019].
## Louvain Algorithm
The Louvain algorithm has historically been a popular community detection algorithm. However, it has a significant drawback: it can produce poorly connected communities. In extreme cases, the communities identified by the Louvain algorithm might even be disconnected, which is a problem that becomes more pronounced with iterations getting increasingly large. To overcome these limitations, a new algorithm known as the Leiden algorithm was proposed.[@Traag2019]. Since the Leiden algorithm builds off of the Louvain algorithm, will first introduce the Louvain algorithm and then discuss the improvements Traag et al. made to the Leiden algorithm.
To ensure clarity in the upcoming discussion about the algorithms, it's essential to establish some key terminologies: $E$ (Edge): This represents a connection between two nodes, $v$ and $u$, in a network. $V$ (Node Set): Every node in the network, denoted as $v$, is associated with a set of edges ($E$). $G$ (Graph): The graph encompasses all nodes and their connecting edges, expressed as $V(E)$. $C$ (Community Set): Each community is composed of a subset of nodes and edges from the graph, represented as a subset of $V(E)$. $P$ (Partition): A partition comprises a collection of communities, each containing a subset of nodes and edges, expressed as $C(V(E))$. $H$ (Quality): This is a measure used to evaluate the effectiveness or suitability of a particular partition or community structure within the graph.
In community detection algorithms, the goal is often to optimize some function of “quality,” which is a quantitative way to represent how community-like identified communities are. The Louvain algorithm historically functioned by optimizing modularity but other, arguably better metrics have been developed between the Louvain algorithm and Leiden algorithm’s inceptions including the Constant Potts Model (CPM). $H = \sum_{c} \left[ e_c - \gamma \left( \frac{n_c}{2} \right) \right]$, which specifically attempts to maximize the internal edges in communities while ensuring these communities remain relatively small. In other words, CPM attempts to optimize the density of the communities it labels as communities. A crucial component of this optimization is the resolution parameter, $\gamma$, which determines the threshold for the density of connections both within and outside a community. Consider a community $c$ with $e_c$ edges and $n_c$ nodes. If this community is divided into two separate communities, $r$ and $s$, $\frac{e_{r \rightarrow s}}{2n_r n_s} < \gamma$, with $e_{r \leftrightarrow s}$ representing the number of links between them, the algorithm assesses whether such a split is advantageous. The ratio of $e_{r \leftrightarrow s}$
denotes the density of links between $r$ and $s$. Therefore, the algorithm seeks to keep the link density between different communities lower than $\gamma$, while maintaining a higher density within the communities. This approach gives a clear and practical interpretation of the $\gamma$ hyperparameter, guiding the formation and division of communities in the network.
The Louvain algorithm optimizes community labeling through two subroutines: (1) local moving of nodes, and (2) network aggregation. In the first moving phase, nodes are initialized as singletons (members of their own community) and are systematically moved into other communities until the partition quality cannot be improved anymore. The moving phase knows when to stop because the nodes are all marked as stable when they are moved, and marked as unstable when one of their neighbors moves and they are not in the same community as the new community their neighbor was moved to. Following this, the second phase involves constructing an aggregated network based on the newly formed partitions. Here, each community becomes a singular node within this aggregated network. These two processes repeat iteratively until there is no further improvement in the quality function. A significant drawback of the Louvain algorithm is its tendency to create poorly connected or even disconnected communities. In such cases, parts of a community are only interconnected through paths that extend outside their community. This issue often arises when a node, serving as a bridge within its original community, is moved to a different community, leading to a disconnection in the former. Surprisingly, other nodes in the disconnected community might not shift to new communities, as they could still be strongly linked within their original community despite its disconnection. Intriguingly, repeated iterations of the algorithm tend to exacerbate the issue, rather than resolving it. This reveals a critical limitation in the Louvain algorithm’s ability to maintain cohesive and connected communities throughout its optimization process.
## Leiden Algorithm
The Leiden algorithm involves three distinct phases, two of which are mostly the same as the Leiden algorithm: (1) the local moving of nodes, (2) the refinement of the partition, and (3) the aggregation of the network based on the refined partition.
A significant difference between the Leiden and Louvain algorithms lies in their approach to local moving of nodes. Contrary to the Louvain algorithm, which iteratively visits all nodes in a network until no further node movements increase the quality function, the Leiden algorithm adopts a more efficient strategy. In its fast local move procedure, only nodes in a changed neighborhood are visited. The fast local move procedure can be summarized in the following steps:
First we initialize a queue with all nodes in the network, adding them in a random order, we also initialize several vectors to track changes made during this step, most notably a boolean vector marking each node as stable or unstable. We iterate over and pop each node from the front of the queue, and create an empty community as a neighbor for it if it is not in a community by itself. For each neighboring community, we check the difference in the partition quality resulting from moving this node from its current community to the neighboring one with the equation:
$\Delta H$ can be efficiently calculated as:
\[
\Delta H = C^\prime - u_{\text{weight}} \cdot c_{\text{weight}} \cdot \gamma
\]
Where $C^\prime$ is the cluster $u$ is being moved to, and $u$ is the node which has moved.
We select the community to move to as the community resulting in the highest delta quality. Regardless of whether or not the node moves to a new community here, the node is marked as stable. However, if the node does move, any neighbor that is not already queued, and not already in the community it moved to is marked as unstable and added to the end of the queue. This process terminates when the queue is empty and all nodes are marked as stable.
The Leiden algorithm introduces a second subroutine between the moving and aggregation subroutines, called the refinement phase. This phase aims to develop a more granular refined partition $P_{refined}$ where each community in the partition can consist of multiple subcommunities from the partition, $P$, identified by the previous move nodes phase. To obtain $P_{refined}$, we first initialize it as a singleton partition, where each node forms its own community. The algorithm then undergoes a process of local mergers within this partition. Specifically, nodes that are isolated in a community in $P_{refined}$ can be merged into a different community. However, these mergers are restricted to occur only within each community of the original partition P. Furthermore, a node is only merged into a community in$P_{refined}$ if both the node and the community exhibit sufficient connectivity to their respective communities in P. If multiple quality improving mergers are identified, it selects on based on a probability weighted by the delta quality improvement to the partition, rather than selecting the best one every time which helps prevent the partition from converging to a local minima, another benefit of the Leiden approach over the Louvain.
As a result of this refinement phase, communities in P are often, but not always, divided into multiple smaller communities in $P_{refined}$. Based on above changes, the Leiden algorithm opens up more possibilities for identifying high-quality partitions.
An important aspect of the refinement phase in the refine partition is the node merging strategy. Unlike a purely greedy approach, nodes are not necessarily merged with the community that offers the largest increase in the quality function. Instead, a node may be merged with any community that results in an increase in the quality function, with the probability of the communities selection proportional to the delta quality improvement. This approach effectively mitigates the risk of encountering local maxima or minima in the optimization process. The selection of the community for merging a node is randomized, with the likelihood of a community being chosen correlating to the magnitude of the increase in the quality function it offers. By excluding mergers that would decrease the quality function, the refinement phase is rendered more efficient. Also, even when node mergers that decrease the quality function are excluded, based on the theorem, for any graph $G = (V, E)$, there exists a sequence of moves that non-decreasingly improves the modularity until it reaches an optimal partition $P^*$ , and this can be achieved in $n - |P^*|$ steps, where $n$ is the number of nodes in the graph and $|P^*|$ is the number of communities in the optimal partition, the optimal partition of a set of nodes can still be uncovered.
The aggregate network in the Leiden algorithm is then constructed based on this refined partition $P_{refined}$, although the initial partition for the aggregate network is still based on P, similar to the Louvain algorithm. The 3 subroutines iterate for a specified number of times, or until convergence.
## Weight calculation
The biological intuition for our weight calculations is that we are looking for open transcribable chromatin with complementary Alu elements in it with evidence of chromatin modulation. Therefore we construct a boolean edge weights where $A$ is 1 if both DNA fragments are open (accessible) and 0 otherwise. Our Alu element matrix is the expected maximum number of possible simultaneous RNA-RNA bonds between two complementary Alu elements, assuming all Alu elements in the same family are not mutated beyond the ability to be transcribed or be sufficiently complementary with one another. The counts $N$ represent the number of Alu elements within a certain window around the DNA fragments being considered. This provides an expectation of potential connections or interactions influenced by the presence of these elements.
For any two DNA fragments, $i$ and $j$, and a specific Alu family, $k$, the edge weight $e_{i,j}$ is calculated. This weight is the minimum of the boolean values $a_i$ and $a_j$ multiplied by the sum of the minimum number of Alu elements of family $k$ in both fragments, times the fragment contact probabilities $h_{i,j}$. $e_{i,j}$ is the edge weight between DNA fragment $i$ and $j$. $a_i$ is a boolean value indicating whether a window coincides with accessible chromatin for fragment $i$. $n_{k,i}$ is the number of Alu elements of the family $k$ in fragment $i$. $h_{i,j}$ is the contact probabilities between fragments $i$ and $j$. We can represent this as a matrix operation:
$$
E = A \odot N \odot H
$$
Where E is the weighted adjacency matrix, A is the chromatin accessibility matrix, N is the Alu element matrix, and H is the Hi-C contact probability matrix.
The combination of these three datasets—Hi-C, ATAC-seq (chromatin accessibility), and Alu element — constructs a crude weighted graph where the edges are more likely to cluster with one another if their potentially chromatin modulating Alus within the DNA fragment.
# Evaluation
To evaluate our implementation of the Leiden algorithm, we compare it to the gold-standard of network analysis software igraph[@Csardi2006] using several toy models including “noperfectmatching” and “Zachary.” For the more complex toy model, Zachary, we try several random seeds to account for randomness and show that the models are generating overlapping sets of models. We also test each function on several increasing resolution hyperparameters to demonstrate the expected behavior that the number of communities increases until eventually, the resolution is too high to identify any communities. To evaluate efficiency, we used microbenchmark[@Mersmann2023] to show, as expected, that the igraph implementation is orders of magnitude faster than our implementation, and scales better going from a 16 node graph “noperfectmatching” to the 32 node graph “Zachary.” Suggesting that our implementation should be optimized further before we start adding the planned features.
```{r, echo=FALSE}
table1 <- data.frame(
c("Ours", "igraph"),
c("501276.3", "147.5"),
c("501276.3", "147.5")
)
colnames(table1) <- c("Implementation",
"16-Node Graph\nAverage Time [microseconds]",
"32-Node Graph\nAverage Time [microseconds]")
knitr::kable(
table1,
caption="Table 1: Benchmarking Results comparing our implementation speed to igraph", format="html", align=c('c', 'c', 'c')
) %>% kable_styling(
full_width = F, # This line is for applying default Bootstrap theme
bootstrap_options = c("striped", "hover", "condensed", "responsive")
) %>% row_spec(
0,
bold = T,
color = "white",
background = "gray"
)
```
To see if our weighting scheme with the Leiden implementation provides sensible results on our biological data and runs in a reasonable amount of time to be useful on the scale of data it was designed for, we create weights from chromosome 21 using createAluGraph and our runLeiden implementation. As expected, our plot shows that AluJ’s are not likely to be as involved in chromatin remodeling via RNA-RNA interactions. This is sensible because AluJ’s are very old and not likely to be transcribed and/or not likely to be as complementary to each other due to accumulated mutations. A plot of these communities is shown in Figure 1.
```{r, echo=FALSE}
knitr::include_graphics("./alu_plot.png")
```
All relevant code to run the benchmarking and generate the above plot can be found in the supplementary notebook.
# Contributions
Brandt Bessell came up with the project idea, contributed to developing, debugging and testing all parts of the Leiden implementation, converted it to a documented and installable package, made the presentation slides and helped write the project proposal and project update. He wrote the introduction, the problem description, package features, evaluation, references, and supplementary sections of the report. He helped write and proofread the algorithm description of the report. He benchmarked the algorithm on toy models. He consulted the matrix weight calculation team and did proofreading.
Xuyuan Zhang contributed to writing, debugging and testing the Leiden implementation, wrote the functions and documentation for creating the input weights as well as plotting the outputs of our Leiden implementation. He helped test those functions on our biological data. He helped discuss the presentation, helped write the project proposal and update and helped write the weight calculation section of the final report.
Sicheng Qian, helped write some of the functions for creating the input weights. He helped write the project proposal, helped write the Leiden algorithm description section of the final report, and made the .bib format for markdown.
# Supplemental
This section will walk through installing our package and replicating the results
of our benchmarking. First set your working directory
```{r, results='hide', message=FALSE, warning=FALSE}
# change this!
setwd("C:/Users/bbessell/OneDrive - Michigan Medicine/Documents/Courses/BIOSTAT615/test_package")
#setwd("C:/Users/bbessell/Projects/GitHub/AluNet")
```
To run our package, first install it and some other dependencies we will use to benchmark
```{r, results='hide', message=FALSE, warning=FALSE}
if (!require("igraph")) {
install.packages("igraph")
}
if (!require("microbenchmark")) {
install.packages("microbenchmark")
}
if (!require("devtools")) {
install.packages("devtools")
}
devtools::install_github("babessell1/AluNet/AluNet")
# if the above doesn't work, try
#url <- "https://github.com/babessell1/AluNet/raw/main/AluNet_1.0.tar.gz"
#download.file(url, "AluNet_1.0.tar.gz")
#install.packages("AluNet_1.0.tar.gz",repo=NULL)
```
Then we will load it, igraph, and microbenchmark to benchmark against the current gold standard of clustering algorithm implementations.
```{r, results='hide', message=FALSE, warning=FALSE}
library("igraph")
library("microbenchmark")
library("AluNet")
```
First, we check against several toy models. Note that the RNG is different for each implementation of Leiden so we will run several different seeds for each one to show that they are functioning similarly.
To do this we define a function to to run igraph's Leiden algorithm and plot the output, we define it here since it is not part of our package.
```{r}
runIgraph <- function(gdf, iterations, gamma, theta) {
ldc <- cluster_leiden(
gdf,
resolution_parameter=gamma,
objective_function="CPM",
weights=g_list[["weight"]],
beta=theta,
n_iterations=iterations,
vertex_weights = rep(1, length(unique(c(g_list[["to"]], g_list[["from"]]))))
)
return(ldc)
}
plotIgraph <- function(gdf, ldc) {
color_palette <- heat.colors(100)
# create colormap for the communities
get_community_colors <- function(num_communities) {
rainbow(num_communities)
}
# map communities to the colormap
num_communities <- length(unique(ldc$membership))
community_colors <- get_community_colors(num_communities)
colors <- community_colors[as.factor(ldc$membership)]
plot(
gdf,
layout = layout_with_fr(gdf),
vertex.color = colors, # Color nodes based on community
#edge.color = color_palette[g_list$weight], # Map edge color from cold to hot
vertex.label = V(gdf)$community,
edge.arrow.size = 0, # Set arrow size to 0 to remove arrows
main = "iGraph's C-based clustering"
)
# Add a legend for community colors
legend("topright", legend=unique(ldc$membership),
col = community_colors, pch = 19,
title = "Communities", cex = 0.8)
}
```
Our first toy model is "noperfectmatching" which has very well defined communities except for the central node which can belong to any of them depending on the starting conditions. This is a simple model and thus should converge in only a single iteration, despite us setting a maximum of 10.
We define our resolution parameter, gamma as .1 which should be small enough to identify more granular communities that will make it easier to compare its accuracy to igraph, we set theta to 0.01 because it is the middle of the recommended ranges of theta by the authors of the Leiden algorithm which should result in a moderate degree of randomness in the refine step. Since the toy models come from the igraph package, we also have to convert them into a format interpretable by our function, which is a list of connections "to" "from" and "weight."
```{r}
toy = "noperfectmatching"
iterations <- 10
gamma <- .1 # gamma > 0
theta <- 0.01 # good theta is 0.005 < 0.05
g <- as_edgelist(make_graph(toy))
g_list <- list(
to=g[,1],from=g[,2],
weight=rep(1, length(g[,1])) # all 1
)
```
```{r}
seed <- 1
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```
```{r}
gdf <- make_graph(toy)
set.seed(1)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
Next we benchmark the model to show that igraph, which is developed and maintained by a very large team of computer scientists, is much more optimized than our implementation.
```{r, results='hide', message=FALSE, warning=FALSE}
rm(.Random.seed)
# Use microbenchmark to time the function
mb_ours <- microbenchmark(
runLeiden(g_list, 1, gamma, theta),
times=50
)
mb_igraph <- microbenchmark(
runIgraph(gdf, 1, gamma, theta),
times=50
)
```
```{r}
# Print the result
print(rbind(mb_ours, mb_igraph))
```
Next we try with a larger more complex toy models to also show that igraph's implementation scales better than ours (4:1) compared to (5:1) for a graph that is twice as big.
```{r, results='hide', message=FALSE, warning=FALSE}
toy = "Zachary"
g <- as_edgelist(make_graph(toy))
g_list <- list(
to=g[,1],from=g[,2],
weight=rep(1, length(g[,1])) # all 1
)
gdf <- make_graph(toy)
rm(.Random.seed)
# Use microbenchmark to time the function
mb_ours <- microbenchmark(
runLeiden(g_list, 3, gamma, theta),
times=50
)
mb_igraph <- microbenchmark(
runIgraph(gdf, 3, gamma, theta),
times=50
)
```
```{r}
print(rbind(mb_ours, mb_igraph))
```
Now, let's increase the iteration number to 10 so that the partition can converge, and try several random seeds to see if the sets of partitions produced by each method overlap.
```{r}
iterations <- 10
gamma <- .1 # gamma > 0
theta <- 0.01 # good theta is 0.005 < 0.05
g <- as_edgelist(make_graph(toy))
g_list <- list(
to=g[,1],from=g[,2],
weight=rep(1, length(g[,1])) # all 1
)
gdf <- make_graph(toy)
```
```{r}
seed <- 1
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```
```{r}
gdf <- make_graph(toy)
set.seed(1)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
Notice these models are slightly different, Since this graph is a bit more complicated, and prone to RNG-based variation, which cannot be equalized by setting the same seed due to differences in implementation, we can rerun the igraph implementation with 2 as our seed instead of 1, to get the same partition.
```{r}
gdf <- make_graph(toy)
set.seed(2)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
Let's see if we consistently produce similar community configurations.
```{r}
seed <- 3
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```
```{r}
gdf <- make_graph(toy)
set.seed(3)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
```{r}
seed <- 4
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```
```{r}
gdf <- make_graph(toy)
set.seed(4)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
```{r}
seed <- 5
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```
```{r}
gdf <- make_graph(toy)
set.seed(5)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
Increasing the resolution parameter, gamma, should decrease the size of the communities, let's see if that happens.
```{r}
gamma <- 0.5
```
```{r}
seed <- 1
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```
```{r}
gdf <- make_graph(toy)
set.seed(1)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```
Let's create a temporary directory and double check we have a few more dependencies to download and process some data as well as plot our results. To download the processed data directly, skip to the next section.
```{r, eval=FALSE}
if (!require("R.utils")) {
install.packages("R.utils")
}
library("R.utils")
if (!require("RColorBrewer")) {
install.packages("RColorBrewer")
}
library(RColorBrewer)
if (!require("dplyr")) {
install.packages("dplyr")
}
library(dplyr)
if (!require("stringr")) {
install.packages("stringr")
}
library(stringr)
dir.create("temp")
```
To show its utility on biological data, we first download some Fit-Hi-C data[@Lee2023]
```{r, eval=FALSE}
file_path <- paste0(getwd(), "/temp")
download_hic <- function(file_path){
library(R.utils)
setwd(file_path)
file_name <- "hic_data.txt.gz"
url <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE148434&format=file&file=GSE148434%5FNS%5Fall%2E5kb%2E1e%2D2%2EFitHiC%2Eall%2Etxt%2Egz"
download.file(url, paste(file_path, file_name, sep = "/"), mode = "wb")
gunzip(file_name, remove=FALSE)
}
download_hic(file_path)
```
Then we download Dfam mobile elements[@Storer2021] (be careful, it is very large). We place it in a folder in the current working directory.
```{r, eval=FALSE}
download_dfam <- function(file_path){
url <- "https://www.dfam.org/releases/Dfam_3.8/annotations/hg38/hg38.hits.gz"
download_directory <- paste(file_path, "uncleaned_data", sep = "/")
# Create the directory
if (!file.exists(download_directory)) {
dir.create(download_directory)
}
file_name <- "hg38.hits.gz"
hg38_file <- paste(download_directory, file_name, sep = "/")
download.file(url, hg38_file, mode = "wb")
gunzip(hg38_file, remove=FALSE)
}
download_dfam(file_path)
```
Then we download some example ATAC-seq data for the same tissue and disease context [@Corces2020]
```{r, eval=FALSE}
download_atac <- function(file_path){
library(readxl)
url <- "https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-020-00721-x/MediaObjects/41588_2020_721_MOESM4_ESM.xlsx"
download_directory <- paste(file_path, "uncleaned_data", sep = "/")
# Create the directory
if (!file.exists(download_directory)) {
dir.create(download_directory)
}
file_name <- "atac_data.xlsx"
atac_file <- paste(download_directory, file_name, sep = "/")
download.file(url, atac_file, mode = "wb")
# Read the Excel file
atac_data <- read_excel(atac_file)
# Specify the name of the output CSV file
output_csv_name <- "atac_data.csv"
output_csv_file <- paste(download_directory, output_csv_name, sep = "/")
# Save as CSV
write.csv(atac_data, output_csv_file, row.names = FALSE)
}
download_atac(file_path)
```
Then we run several helper functions to convert each of the data into dataframe (matrix)
```{r, eval=FALSE}
data_dir <- paste0(getwd(), "/temp")
returned_path <- clean_alu.R(
paste0(data_dir, "/hic_data.txt"),
paste0(data_dir, "/uncleaned_data/hg38.hits"),
data_dir
)
```
Then we use the files we made to generate the edges:
```{r, eval=FALSE}
cleaning_dir <- clean_alu(paste0(data_dir, "/highest_probability"), data_dir)
merge_hic_with_alu(data_dir, cleaning_dir)
```
```{r}
dataframe <- read.csv("edges_data_frame.csv")
dataframe$group <- NULL
list <- as.list(dataframe)
```
```{r}
result <- runLeiden(list, 1, 1, 0.01, 1)
```
If you don't want to download the large datasets, you can download our processed data instead to run our implementation of the Leiden algorithm on it.
```{r}
url <- "https://github.com/babessell1/AluNet/raw/main/final_data.csv"
download.file(url, "final_data.csv.csv")
dataframe_ <- read.csv("final_data.csv")
dataframe$group <- NULL
list_ <- as.list(dataframe)
```
Then we run the Leiden algorithm.
```{r}
result <- runLeiden(list_, 10, 1, 0.01, 1)
```
Plot the results and save as a high-dpi png.
```{r}
#png("alu_plot.png", width = 1000, height = 1000, res = 300)
plotLeiden(result, TRUE)
#dev.off()
```
# References