Biostat615_finalReport.Rmd

---
title: "Weighted Leiden Algorithm for Modeling Alu-Mediated Enhancer-Promoter Interaction
  Networks"
author:
- Brandt Bessell (bbessell@umich.edu)
- Xuyuan Zhang (zxuyuan@umich.edu)
- Sicheng Qian (sichengq@umich.edu)
output:
  pdf_document: default
  html_document:
    df_print: kable
bibliography: references.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(kableExtra)
```

# Code Availability

All code for this project is available at:
https://github.com/babessell1/AluNet

# Introduction

Alu elements are transposable elements (TE) that hijack LINE1 encoded retrotransposition proteins to copy and paste themselves into genomic DNA[@Doucet2015] Discoveries that transposable element polymorphisms in the human population could modulate gene expression[@Wang2017] and active evolutionary pressure toward enhancers paired with long-range interactions with promoters[@Su2014] have challenged earlier conceptions they they were “junk” DNA.[@Pray2008] This year, substantial evidence has been provided suggesting that a complementary sequence motif found in Alu elements of the same family can induce RNA polymerase-II transcribed RNA-RNA interactions in cases where an Alu sequences are embedded in a promoter upstream antisense transcript and another Alu sequence is embedded in an enhancer transcript.[@Liang2023] Based on this mechanism, it has been speculated that mostly disjoint enhancer promoter interaction networks (EPRI) may exist that can be rapidly rewired and have implications in evolutionary and disease contexts[@Wen2023] and that the inherent RNA polymerase-III transcription of Alu elements essential for their retrotransposition could yield similar RNA-RNA based dynamics in non-enhancer/promoter embedded elements. Moreover, it has been demonstrated with FISH that LINE1 and Alu elements both exclusively colocalize with their specific subfamilies in cellular compartments.[@Lu2021] For reasons which will not be discussed in detail in this report, it is hypothesized that understanding the dynamics of these Alu communities could be relevant to one of the group member’s research into neurodegenerative disease mechanisms.

# Problem Description

With a lack of RIC-seq data to pull from to study the phenomenon, but an abundance of collaborators with the ability to provide Pore-C targeted capture data, we propose that a heuristic community detection approach could be useful for identifying candidate loci for target capture. The algorithm we settled on was the Leiden algorithm[@Traag2019] due to specific improvements over the Louvain algorithm[@Blondel2008] which will be discussed in detail in the algorithm description section. Another reason we decided to select the Leiden algorithm is that it is necessarily highly iterative and thus provides a good use case for using Rcpp which was the main course topic of interest for some of the group members who were interested in learning C++, which is important for their subfield. There is also a practical reason for rewriting the algorithm in C++. igraph[@Csardi2006] is written in C and maintained by a large team of computer scientists working in the field of network analysis and it is unrealistic to believe we can even come close to competing with their computational speed within the timeframe of this project. However, in their C-based implementation, it is difficult to add low-level customized features to the algorithm due to inflexibility resulting from a lack of encapsulation and innate challenges with memory management. By rewriting the algorithm in object-oriented C++, it will be potentially useful to one of the group members who may want to extend the functionality of the algorithm. For example, on-the-fly checking for intact RNA polymerase-III/II binding domains, more complex weighting schemes using local sequence alignment and expected bindable transcript lengths, as well as dynamic storage of relevant properties for (and results of) these calculations within partition, community, and graph objects. Extending the Leiden algorithm with these features is beyond the scope of this class and will be reserved for potential future projects. Instead, the goal of our project was to get the algorithm working in C++ to have a foundation for these future projects. Our second aim was to use our implementation to demonstrate the Leiden algorithm’s viability in identifying candidate chromatin modulating Alu communities with a crude weighting scheme incorporating publicly available Hi-C[@Lee2023], ATAC-seq data[@Corces2020], and the Dfam transposable elements database[@Storer2021].

# Package Features

Our package includes two main user-callable functions. The first, createAluGraph creates our weighted adjacency matrix by using taking the Hadamard product of DNA-fragment contact probabilities from Fit Hi-C callsets, a boolean matrix of whether or not both fragments contain accessible chromatin (peaks) based on the ATAC-seq data, and counts of the expected number of potential simultaneous Alu-mediated RNA-RNA interactions (assuming that the Alu has a functional RNA polymerase III binding and that the Alu’s are far enough apart that there is only one Alu sequence per transcript). It outputs the adjacency matrix as a list of connections which is typically more memory efficient. Several helper functions for creating the input matrices for createAluGraph are included as well. The second primary, user-callable function is runLeiden which is exported from C++. It takes the output of createAluGraph, or any other R list of connections with the items “to”, “from”, and “weights.” The function runLeiden interfaces the Optimizer class with R allowing the algorithm to be called as an R function. The Optimizer class contains utilities for running each step of the Leiden algorithm as defined by the original paper[@Traag2019]. The Partition class contains utilities for creating and modifying partitions. The Community class contains utilities for creating and modifying communities, and the Graph class contains utilities for creating and modifying nodes and edges in a graph. Lastly, the package contains a function plotLeiden to plot the results of our algorithm which requires graph.data.frame from igraph.[@Csardi2006] 

# Algorithm Description

The following Two sections are based on [@Traag2019].

## Louvain Algorithm

The Louvain algorithm has historically been a popular community detection algorithm. However, it has a significant drawback: it can produce poorly connected communities. In extreme cases, the communities identified by the Louvain algorithm might even be disconnected, which is a problem that becomes more pronounced with iterations getting increasingly large. To overcome these limitations, a new algorithm known as the Leiden algorithm was proposed.[@Traag2019]. Since the Leiden algorithm builds off of the Louvain algorithm, will first introduce the Louvain algorithm and then discuss the improvements Traag et al. made to the Leiden algorithm.
To ensure clarity in the upcoming discussion about the algorithms, it's essential to establish some key terminologies: $E$ (Edge): This represents a connection between two nodes, $v$ and $u$, in a network. $V$ (Node Set): Every node in the network, denoted as $v$, is associated with a set of edges ($E$). $G$ (Graph): The graph encompasses all nodes and their connecting edges, expressed as $V(E)$. $C$ (Community Set): Each community is composed of a subset of nodes and edges from the graph, represented as a subset of $V(E)$. $P$ (Partition): A partition comprises a collection of communities, each containing a subset of nodes and edges, expressed as $C(V(E))$. $H$ (Quality): This is a measure used to evaluate the effectiveness or suitability of a particular partition or community structure within the graph.

In community detection algorithms, the goal is often to optimize some function of “quality,” which is a quantitative way to represent how community-like identified communities are. The Louvain algorithm historically functioned by optimizing modularity but other, arguably better metrics have been developed between the Louvain algorithm and Leiden algorithm’s inceptions including  the Constant Potts Model (CPM). $H = \sum_{c} \left[ e_c - \gamma \left( \frac{n_c}{2} \right) \right]$, which specifically  attempts to maximize the internal edges in communities while ensuring these communities remain relatively small. In other words, CPM attempts to optimize the density of the communities it labels as communities.  A crucial  component of this optimization is the resolution parameter, $\gamma$, which determines the threshold for the density of connections both within and outside a community. Consider a community $c$ with $e_c$ edges and $n_c$ nodes. If this community is divided into two separate communities, $r$ and $s$, $\frac{e_{r \rightarrow s}}{2n_r n_s} < \gamma$, with $e_{r \leftrightarrow s}$ representing the number of links between them, the algorithm assesses whether such a split is advantageous. The ratio of $e_{r \leftrightarrow s}$
 denotes the density of links between $r$ and $s$. Therefore, the algorithm seeks to keep the link density between different communities lower than $\gamma$, while maintaining a higher density within the communities. This approach gives a clear and practical interpretation of the $\gamma$ hyperparameter, guiding the formation and division of communities in the network. 

The Louvain algorithm optimizes community labeling through two subroutines: (1) local moving of nodes, and (2) network aggregation. In the first moving phase, nodes are initialized as singletons (members of their own community) and are systematically moved into other communities until the partition quality cannot be improved anymore. The moving phase knows when to stop because the nodes are all marked as stable when they are moved, and marked as unstable when one of their neighbors moves and they are not in the same community as the new community their neighbor was moved to. Following this, the second phase involves constructing an aggregated network based on the newly formed partitions. Here, each community becomes a singular node within this aggregated network. These two processes repeat iteratively until there is no further improvement in the quality function. A significant drawback of the Louvain algorithm is its tendency to create poorly connected or even disconnected communities. In such cases, parts of a community are only interconnected through paths that extend outside their community. This issue often arises when a node, serving as a bridge within its original community, is moved to a different community, leading to a disconnection in the former. Surprisingly, other nodes in the disconnected community might not shift to new communities, as they could still be strongly linked within their original community despite its disconnection. Intriguingly, repeated iterations of the algorithm tend to exacerbate the issue, rather than resolving it. This reveals a critical limitation in the Louvain algorithm’s ability to maintain cohesive and connected communities throughout its optimization process. 

## Leiden Algorithm

The Leiden algorithm involves three distinct phases, two of which are mostly the same as the Leiden algorithm: (1) the local moving of nodes, (2) the refinement of the partition, and (3) the aggregation of the network based on the refined partition. 
A significant difference between the Leiden and Louvain algorithms lies in their approach to local moving of nodes. Contrary to the Louvain algorithm, which iteratively visits all nodes in a network until no further node movements increase the quality function, the Leiden algorithm adopts a more efficient strategy. In its fast local move procedure, only nodes in a changed neighborhood are visited.  The fast local move procedure can be summarized in the following steps:

First we initialize a queue with all nodes in the network, adding them in a random order, we also initialize several vectors to track changes made during this step, most notably a boolean vector marking each node as stable or unstable. We iterate over and pop each node from the front of the queue, and create an empty community as a neighbor for it if it is not in a community by itself. For each neighboring community,  we check the difference in the partition quality resulting from moving this node from its current community to the neighboring one with the equation:

$\Delta H$ can be efficiently calculated as:

\[
\Delta H = C^\prime - u_{\text{weight}} \cdot c_{\text{weight}} \cdot \gamma
\]

Where $C^\prime$ is the cluster $u$ is being moved to, and $u$ is the node which has moved.


We select the community to move to as the community resulting in the highest delta quality. Regardless of whether or not the node moves to a new community here, the node is marked as stable. However, if the node does move, any neighbor that is not already queued, and not already in the community it moved to is marked as unstable and added to the end of the queue. This process terminates when the queue is empty and all nodes are marked as stable.
The Leiden algorithm introduces a second subroutine between the moving and aggregation subroutines, called the refinement phase. This phase aims to develop a more granular refined partition $P_{refined}$  where each community in the partition can consist of multiple subcommunities from the partition, $P$, identified by the previous move nodes phase. To obtain $P_{refined}$, we first initialize it as a singleton partition, where each node forms its own community. The algorithm then undergoes a process of local mergers within this partition. Specifically, nodes that are isolated in a community in $P_{refined}$ can be merged into a different community. However, these mergers are restricted to occur only within each community of the original partition P. Furthermore, a node is only merged into a community in$P_{refined}$ if both the node and the community exhibit sufficient connectivity to their respective communities in P. If multiple quality improving mergers are identified, it selects on based on a probability weighted by the delta quality improvement to the partition, rather than selecting the best one every time which helps prevent the partition from converging to a local minima, another benefit of the Leiden approach over the Louvain. 

As a result of this refinement phase, communities in P are often, but not always, divided into multiple smaller communities in $P_{refined}$. Based on above changes, the Leiden algorithm opens up more possibilities for identifying high-quality partitions.
An important aspect of the refinement phase in the refine partition is the node merging strategy. Unlike a purely greedy approach, nodes are not necessarily merged with the community that offers the largest increase in the quality function. Instead, a node may be merged with any community that results in an increase in the quality function, with the probability of the communities selection proportional to the delta quality improvement. This approach effectively mitigates the risk of encountering local maxima or minima in the optimization process. The selection of the community for merging a node is randomized, with the likelihood of a community being chosen correlating to the magnitude of the increase in the quality function it offers. By excluding mergers that would decrease the quality function, the refinement phase is rendered more efficient. Also, even when node mergers that decrease the quality function are excluded, based on the theorem, for any graph $G = (V, E)$, there exists a sequence of moves that non-decreasingly improves the modularity until it reaches an optimal partition $P^*$ , and this can be achieved in $n - |P^*|$ steps, where $n$ is the number of nodes in the graph and $|P^*|$ is the number of communities in the optimal partition,  the optimal partition of a set of nodes can still be uncovered. 
The aggregate network in the Leiden algorithm is then constructed based on this refined partition $P_{refined}$, although the initial partition for the aggregate network is still based on P, similar to the Louvain algorithm. The 3 subroutines iterate for a specified number of times, or until convergence. 

## Weight calculation

The biological intuition for our weight calculations is that we are looking for open transcribable chromatin with complementary Alu elements in it with evidence of chromatin modulation. Therefore we construct a boolean edge weights where $A$ is 1 if both DNA fragments are open (accessible) and 0 otherwise. Our Alu element matrix is the expected maximum number of possible simultaneous RNA-RNA bonds between two complementary Alu elements, assuming all Alu elements in the same family are not mutated beyond the ability to be transcribed or be sufficiently complementary with one another. The counts $N$ represent the number of Alu elements within a certain window around the DNA fragments being considered. This provides an expectation of potential connections or interactions influenced by the presence of these elements.

For any two DNA fragments, $i$ and $j$, and a specific Alu family, $k$, the edge weight $e_{i,j}$ is calculated. This weight is the minimum of the boolean values $a_i$ and $a_j$ multiplied by the sum of the minimum number of Alu elements of family $k$ in both fragments, times the fragment contact probabilities $h_{i,j}$. $e_{i,j}$ is the edge weight between DNA fragment $i$ and $j$. $a_i$ is a boolean value indicating whether a window coincides with accessible chromatin for fragment $i$. $n_{k,i}$ is the number of Alu elements of the family $k$ in fragment $i$. $h_{i,j}$ is the contact probabilities between fragments $i$ and $j$. We can represent this as a matrix operation:

$$
E = A \odot N \odot H
$$ 

Where E is the weighted adjacency matrix, A is the chromatin accessibility matrix, N is the Alu element matrix, and H is the Hi-C contact probability matrix.

The combination of these three datasets—Hi-C, ATAC-seq (chromatin accessibility), and Alu element — constructs a crude weighted graph where the edges are more likely to cluster with one another if their potentially chromatin modulating Alus within the DNA fragment.

# Evaluation

To evaluate our implementation of the Leiden algorithm, we compare it to the gold-standard of network analysis software igraph[@Csardi2006] using several toy models including “noperfectmatching” and “Zachary.” For the more complex toy model, Zachary, we try several random seeds to account for randomness and show that the models are generating overlapping sets of models. We also test each function on several increasing resolution hyperparameters to demonstrate the expected behavior that the number of communities increases until eventually, the resolution is too high to identify any communities. To evaluate efficiency, we used microbenchmark[@Mersmann2023] to show, as expected, that the igraph implementation is orders of magnitude faster than our implementation, and scales better going from a 16 node graph “noperfectmatching” to the 32 node graph “Zachary.” Suggesting that our implementation should be optimized further before we start adding the planned features. 

```{r, echo=FALSE}
table1 <- data.frame(
  c("Ours", "igraph"),
  c("501276.3", "147.5"),
  c("501276.3", "147.5")
)

colnames(table1) <- c("Implementation",
                      "16-Node Graph\nAverage Time [microseconds]",
                      "32-Node Graph\nAverage Time [microseconds]")
  
knitr::kable(
  table1,
  caption="Table 1: Benchmarking Results comparing our implementation speed to igraph", format="html", align=c('c', 'c', 'c')
) %>% kable_styling(
  full_width = F,      # This line is for applying default Bootstrap theme
  bootstrap_options = c("striped", "hover", "condensed", "responsive")
) %>% row_spec(
  0,
  bold = T,
  color = "white",
  background = "gray"
)
```

To see if our weighting scheme with the Leiden implementation provides sensible results on our biological data and runs in a reasonable amount of time to be useful on the scale of data it was designed for, we create weights from chromosome 21 using createAluGraph and our runLeiden implementation. As expected, our plot shows that AluJ’s are not likely to be as involved in chromatin remodeling via RNA-RNA interactions. This is sensible because AluJ’s are very old and not likely to be transcribed and/or not likely to be as complementary to each other due to accumulated mutations. A plot of these communities is shown in Figure 1. 

```{r, echo=FALSE}
knitr::include_graphics("./alu_plot.png")
```

All relevant code to run the benchmarking and generate the above plot can be found in the supplementary notebook.

# Contributions

Brandt Bessell came up with the project idea, contributed to developing, debugging and testing  all parts of the Leiden implementation, converted it to a documented and installable package, made the presentation slides and helped write the project proposal and project update. He wrote the introduction, the problem description, package features, evaluation, references, and supplementary sections of the report. He helped write and proofread the algorithm description of the report. He benchmarked the algorithm on toy models. He consulted the matrix weight calculation team and did proofreading.

Xuyuan Zhang contributed to writing, debugging and testing the Leiden implementation, wrote the functions and documentation for creating the input weights as well as plotting the outputs of our Leiden implementation. He helped test those functions on our biological data. He helped discuss the presentation, helped write the project proposal and update and helped write the weight calculation section of the final report. 

Sicheng Qian, helped write some of the functions for creating the input weights. He helped write the project proposal, helped write the Leiden algorithm description section of the final report, and made the .bib format for markdown. 


# Supplemental

This section will walk through installing our package and replicating the results 
of our benchmarking. First set your working directory

```{r, results='hide', message=FALSE, warning=FALSE}
# change this!
setwd("C:/Users/bbessell/OneDrive - Michigan Medicine/Documents/Courses/BIOSTAT615/test_package")
#setwd("C:/Users/bbessell/Projects/GitHub/AluNet")
```

To run our package, first install it and some other dependencies we will use to benchmark

```{r, results='hide', message=FALSE, warning=FALSE}
if (!require("igraph")) {
  install.packages("igraph")
}
if (!require("microbenchmark")) {
  install.packages("microbenchmark")
}

if (!require("devtools")) {
  install.packages("devtools")
}

devtools::install_github("babessell1/AluNet/AluNet")

# if the above doesn't work, try
#url <- "https://github.com/babessell1/AluNet/raw/main/AluNet_1.0.tar.gz"
#download.file(url, "AluNet_1.0.tar.gz")
#install.packages("AluNet_1.0.tar.gz",repo=NULL)

```

Then we will load it, igraph, and microbenchmark to benchmark against the current gold standard of clustering algorithm implementations.

```{r, results='hide', message=FALSE, warning=FALSE}
library("igraph")
library("microbenchmark")
library("AluNet")
```

First, we check against several toy models. Note that the RNG is different for each implementation of Leiden so we will run several different seeds for each one to show that they are functioning similarly.

To do this we define a function to to run igraph's Leiden algorithm and plot the output, we define it here since it is not part of our package. 

```{r}
runIgraph <- function(gdf, iterations, gamma, theta) {

  ldc <- cluster_leiden(
    gdf,
    resolution_parameter=gamma,
    objective_function="CPM",
    weights=g_list[["weight"]],
    beta=theta,
    n_iterations=iterations,
    vertex_weights = rep(1, length(unique(c(g_list[["to"]], g_list[["from"]]))))
  )
  return(ldc)
}

plotIgraph <- function(gdf, ldc) {
  color_palette <- heat.colors(100)
  
    # create colormap for the communities
    get_community_colors <- function(num_communities) {
    rainbow(num_communities)
    }
  
    # map communities to the colormap
    num_communities <- length(unique(ldc$membership))
    community_colors <- get_community_colors(num_communities)
    colors <- community_colors[as.factor(ldc$membership)]
  
  plot(
    gdf,
    layout = layout_with_fr(gdf),
    vertex.color = colors,  # Color nodes based on community
    #edge.color = color_palette[g_list$weight],  # Map edge color from cold to hot
    vertex.label = V(gdf)$community,
    edge.arrow.size = 0,  # Set arrow size to 0 to remove arrows
    main = "iGraph's C-based clustering"
  )
  
  # Add a legend for community colors
  legend("topright", legend=unique(ldc$membership),
         col = community_colors, pch = 19,
         title = "Communities", cex = 0.8)
}
```

Our first toy model is "noperfectmatching" which has very well defined communities except for the central node which can belong to any of them depending on the starting conditions. This is a simple model and thus should converge in only a single iteration, despite us setting a maximum of 10.

We define our resolution parameter, gamma as .1 which should be small enough to identify more granular communities that will make it easier to compare its accuracy to igraph, we set theta to 0.01 because it is the middle of the recommended ranges of theta by the authors of the Leiden algorithm which should result in a moderate degree of randomness in the refine step. Since the toy models come from the igraph package, we also have to convert them into a format interpretable by our function, which is a list of connections "to" "from" and "weight."

```{r}
toy = "noperfectmatching"
iterations <- 10
gamma <- .1  # gamma > 0
theta <- 0.01 # good theta is 0.005 < 0.05

g <- as_edgelist(make_graph(toy))
g_list <- list(
  to=g[,1],from=g[,2],
  weight=rep(1, length(g[,1]))  # all 1
  
)
```

```{r}
seed <- 1
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```


```{r}
gdf <- make_graph(toy)
set.seed(1)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```

Next we benchmark the model to show that igraph, which is developed and maintained by a very large team of computer scientists, is much more optimized than our implementation.

```{r, results='hide', message=FALSE, warning=FALSE}
rm(.Random.seed)
# Use microbenchmark to time the function
mb_ours <- microbenchmark(
  runLeiden(g_list, 1, gamma, theta),
  times=50
)

mb_igraph <- microbenchmark(
  runIgraph(gdf, 1, gamma, theta),
  times=50
)
```

```{r}
# Print the result
print(rbind(mb_ours, mb_igraph))
```

Next we try with a larger more complex toy models to also show that igraph's implementation scales better than ours (4:1) compared to (5:1) for a graph that is twice as big. 

```{r, results='hide', message=FALSE, warning=FALSE}
toy = "Zachary"
g <- as_edgelist(make_graph(toy))
g_list <- list(
  to=g[,1],from=g[,2],
  weight=rep(1, length(g[,1]))  # all 1
  
)
gdf <- make_graph(toy)


rm(.Random.seed)
# Use microbenchmark to time the function
mb_ours <- microbenchmark(
  runLeiden(g_list, 3, gamma, theta),
  times=50
)

mb_igraph <- microbenchmark(
  runIgraph(gdf, 3, gamma, theta),
  times=50
)
```

```{r}
print(rbind(mb_ours, mb_igraph))
```

Now, let's increase the iteration number to 10 so that the partition can converge, and try several random seeds to see if the sets of partitions  produced by each method overlap.

```{r}
iterations <- 10
gamma <- .1  # gamma > 0
theta <- 0.01 # good theta is 0.005 < 0.05

g <- as_edgelist(make_graph(toy))
g_list <- list(
  to=g[,1],from=g[,2],
  weight=rep(1, length(g[,1]))  # all 1
  
)
gdf <- make_graph(toy)
```

```{r}
seed <- 1
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```

```{r}
gdf <- make_graph(toy)
set.seed(1)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```

Notice these models are slightly different, Since this graph is a bit more  complicated, and prone to RNG-based variation, which cannot be equalized by setting the same seed due to differences in implementation, we can rerun the igraph implementation with 2 as our seed instead of 1, to get the same partition.

```{r}
gdf <- make_graph(toy)
set.seed(2)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```

Let's see if we consistently produce similar community configurations.

```{r}
seed <- 3
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```

```{r}
gdf <- make_graph(toy)
set.seed(3)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```

```{r}
seed <- 4
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```

```{r}
gdf <- make_graph(toy)
set.seed(4)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```


```{r}
seed <- 5
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```

```{r}
gdf <- make_graph(toy)
set.seed(5)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```

Increasing the resolution parameter, gamma, should decrease the size of the  communities, let's see if that happens.

```{r}
gamma <- 0.5
```

```{r}
seed <- 1
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```

```{r}
gdf <- make_graph(toy)
set.seed(1)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```

Let's create a temporary directory and double check we have a few more dependencies to download and process some data as well as plot our results. To download the processed data directly, skip to the next section.

```{r, eval=FALSE}
if (!require("R.utils")) {
  install.packages("R.utils")
}
library("R.utils")

if (!require("RColorBrewer")) {
  install.packages("RColorBrewer")
}
library(RColorBrewer)

if (!require("dplyr")) {
  install.packages("dplyr")
}
library(dplyr)

if (!require("stringr")) {
  install.packages("stringr")
}
library(stringr)


dir.create("temp")
```

To show its utility on biological data, we first download some Fit-Hi-C data[@Lee2023]

```{r, eval=FALSE}
file_path <- paste0(getwd(), "/temp")
download_hic <- function(file_path){
  library(R.utils)
  setwd(file_path)
  file_name <- "hic_data.txt.gz"
  url <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE148434&format=file&file=GSE148434%5FNS%5Fall%2E5kb%2E1e%2D2%2EFitHiC%2Eall%2Etxt%2Egz"

  download.file(url, paste(file_path, file_name, sep = "/"), mode = "wb")
  gunzip(file_name, remove=FALSE)
}
download_hic(file_path)
```

Then we download Dfam mobile elements[@Storer2021] (be careful, it is very large). We place it in a folder in the current working directory.

```{r, eval=FALSE}
download_dfam <- function(file_path){
  url <- "https://www.dfam.org/releases/Dfam_3.8/annotations/hg38/hg38.hits.gz"
  download_directory <- paste(file_path, "uncleaned_data", sep = "/")
  # Create the directory
  if (!file.exists(download_directory)) {
    dir.create(download_directory)
  }
  file_name <- "hg38.hits.gz"
  hg38_file <- paste(download_directory, file_name, sep = "/")
  download.file(url, hg38_file, mode = "wb")
  gunzip(hg38_file, remove=FALSE)
}
download_dfam(file_path)

```

Then we download some example ATAC-seq data for the same tissue and disease context [@Corces2020] 

```{r, eval=FALSE}
download_atac <- function(file_path){
  library(readxl)
  url <- "https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-020-00721-x/MediaObjects/41588_2020_721_MOESM4_ESM.xlsx"
  download_directory <- paste(file_path, "uncleaned_data", sep = "/")
  # Create the directory
  if (!file.exists(download_directory)) {
    dir.create(download_directory)
  }
  file_name <- "atac_data.xlsx"
  atac_file <- paste(download_directory, file_name, sep = "/")
  download.file(url, atac_file, mode = "wb")
  # Read the Excel file
  atac_data <- read_excel(atac_file)
  
  # Specify the name of the output CSV file
  output_csv_name <- "atac_data.csv"
  output_csv_file <- paste(download_directory, output_csv_name, sep = "/")
  
  # Save as CSV
  write.csv(atac_data, output_csv_file, row.names = FALSE)
}
download_atac(file_path)
```

Then we run several helper functions to convert each of the data into dataframe (matrix)

```{r, eval=FALSE}
data_dir <-  paste0(getwd(), "/temp")
returned_path <- clean_alu.R(
  paste0(data_dir, "/hic_data.txt"),
  paste0(data_dir, "/uncleaned_data/hg38.hits"),
  data_dir
)
```

Then we use the files we made to generate the edges:

```{r, eval=FALSE}
cleaning_dir <- clean_alu(paste0(data_dir, "/highest_probability"), data_dir)
merge_hic_with_alu(data_dir, cleaning_dir)
```

```{r}
dataframe <- read.csv("edges_data_frame.csv")
dataframe$group <- NULL
list <- as.list(dataframe)
```

```{r}
result <- runLeiden(list, 1, 1, 0.01, 1)
```

If you don't want to download the large datasets, you can download our processed data instead to run our implementation of the Leiden algorithm on it.

```{r}
url <- "https://github.com/babessell1/AluNet/raw/main/final_data.csv"
download.file(url, "final_data.csv.csv")
dataframe_ <- read.csv("final_data.csv")
dataframe$group <- NULL
list_ <- as.list(dataframe)
```
Then we run the Leiden algorithm.

```{r}
result <- runLeiden(list_, 10, 1, 0.01, 1)
```
Plot the results and save as a high-dpi png.

```{r}
#png("alu_plot.png", width = 1000, height = 1000, res = 300)
plotLeiden(result, TRUE)
#dev.off()
```

# References