SupplementaryNotebook.Rmd

---
title: "AluNet Supplementary"
author: "Brandt Bessell, Xuyuan Zhang, Sicheng Qian"
date: "2023-12-08"
---

This section will walk through installing our package and replicating the results 
of our benchmarking. First set your working directory

```{r, results='hide', message=FALSE, warning=FALSE}
# change this!
setwd("C:/Users/bbessell/OneDrive - Michigan Medicine/Documents/Courses/BIOSTAT615/test_package")
#setwd("C:/Users/bbessell/Projects/GitHub/AluNet")
```

To run our package, first install it and some other dependencies we will use to benchmark

```{r, results='hide', message=FALSE, warning=FALSE}
if (!require("igraph")) {
  install.packages("igraph")
}
if (!require("microbenchmark")) {
  install.packages("microbenchmark")
}

if (!require("devtools")) {
  install.packages("devtools")
}

devtools::install_github("babessell1/AluNet/AluNet")

# if the above doesn't work, try
#url <- "https://github.com/babessell1/AluNet/raw/main/AluNet_1.0.tar.gz"
#download.file(url, "AluNet_1.0.tar.gz")
#install.packages("AluNet_1.0.tar.gz",repo=NULL)

```

Then we will load it, igraph, and microbenchmark to benchmark against the current gold standard of clustering algorithm implementations.

```{r, results='hide', message=FALSE, warning=FALSE}
library("igraph")
library("microbenchmark")
library("AluNet")
```

First, we check against several toy models. Note that the RNG is different for each implementation of Leiden so we will run several different seeds for each one to show that they are functioning similarly.

To do this we define a function to to run igraph's Leiden algorithm and plot the output, we define it here since it is not part of our package. 

```{r}
runIgraph <- function(gdf, iterations, gamma, theta) {

  ldc <- cluster_leiden(
    gdf,
    resolution_parameter=gamma,
    objective_function="CPM",
    weights=g_list[["weight"]],
    beta=theta,
    n_iterations=iterations,
    vertex_weights = rep(1, length(unique(c(g_list[["to"]], g_list[["from"]]))))
  )
  return(ldc)
}

plotIgraph <- function(gdf, ldc) {
  color_palette <- heat.colors(100)
  
    # create colormap for the communities
    get_community_colors <- function(num_communities) {
    rainbow(num_communities)
    }
  
    # map communities to the colormap
    num_communities <- length(unique(ldc$membership))
    community_colors <- get_community_colors(num_communities)
    colors <- community_colors[as.factor(ldc$membership)]
  
  plot(
    gdf,
    layout = layout_with_fr(gdf),
    vertex.color = colors,  # Color nodes based on community
    #edge.color = color_palette[g_list$weight],  # Map edge color from cold to hot
    vertex.label = V(gdf)$community,
    edge.arrow.size = 0,  # Set arrow size to 0 to remove arrows
    main = "iGraph's C-based clustering"
  )
  
  # Add a legend for community colors
  legend("topright", legend=unique(ldc$membership),
         col = community_colors, pch = 19,
         title = "Communities", cex = 0.8)
}
```

Our first toy model is "noperfectmatching" which has very well defined communities except for the central node which can belong to any of them depending on the starting conditions. This is a simple model and thus should converge in only a single iteration, despite us setting a maximum of 10.

We define our resolution parameter, gamma as .1 which should be small enough to identify more granular communities that will make it easier to compare its accuracy to igraph, we set theta to 0.01 because it is the middle of the recommended ranges of theta by the authors of the Leiden algorithm which should result in a moderate degree of randomness in the refine step. Since the toy models come from the igraph package, we also have to convert them into a format interpretable by our function, which is a list of connections "to" "from" and "weight."

```{r}
toy = "noperfectmatching"
iterations <- 10
gamma <- .1  # gamma > 0
theta <- 0.01 # good theta is 0.005 < 0.05

g <- as_edgelist(make_graph(toy))
g_list <- list(
  to=g[,1],from=g[,2],
  weight=rep(1, length(g[,1]))  # all 1
  
)
```

```{r}
seed <- 1
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```


```{r}
gdf <- make_graph(toy)
set.seed(1)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```

Next we benchmark the model to show that igraph, which is developed and maintained by a very large team of computer scientists, is much more optimized than our implementation.

```{r, results='hide', message=FALSE, warning=FALSE}
rm(.Random.seed)
# Use microbenchmark to time the function
mb_ours <- microbenchmark(
  runLeiden(g_list, 1, gamma, theta),
  times=50
)

mb_igraph <- microbenchmark(
  runIgraph(gdf, 1, gamma, theta),
  times=50
)
```

```{r}
# Print the result
print(rbind(mb_ours, mb_igraph))
```

Next we try with a larger more complex toy models to also show that igraph's implementation scales better than ours (4:1) compared to (5:1) for a graph that is twice as big. 

```{r, results='hide', message=FALSE, warning=FALSE}
toy = "Zachary"
g <- as_edgelist(make_graph(toy))
g_list <- list(
  to=g[,1],from=g[,2],
  weight=rep(1, length(g[,1]))  # all 1
  
)
gdf <- make_graph(toy)


rm(.Random.seed)
# Use microbenchmark to time the function
mb_ours <- microbenchmark(
  runLeiden(g_list, 3, gamma, theta),
  times=50
)

mb_igraph <- microbenchmark(
  runIgraph(gdf, 3, gamma, theta),
  times=50
)
```

```{r}
print(rbind(mb_ours, mb_igraph))
```

Now, let's increase the iteration number to 10 so that the partition can converge, and try several random seeds to see if the sets of partitions  produced by each method overlap.

```{r}
iterations <- 10
gamma <- .1  # gamma > 0
theta <- 0.01 # good theta is 0.005 < 0.05

g <- as_edgelist(make_graph(toy))
g_list <- list(
  to=g[,1],from=g[,2],
  weight=rep(1, length(g[,1]))  # all 1
  
)
gdf <- make_graph(toy)
```

```{r}
seed <- 1
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```

```{r}
gdf <- make_graph(toy)
set.seed(1)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```

Notice these models are slightly different, Since this graph is a bit more  complicated, and prone to RNG-based variation, which cannot be equalized by setting the same seed due to differences in implementation, we can rerun the igraph implementation with 2 as our seed instead of 1, to get the same partition.

```{r}
gdf <- make_graph(toy)
set.seed(2)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```

Let's see if we consistently produce similar community configurations.

```{r}
seed <- 3
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```

```{r}
gdf <- make_graph(toy)
set.seed(3)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```

```{r}
seed <- 4
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```

```{r}
gdf <- make_graph(toy)
set.seed(4)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```


```{r}
seed <- 5
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```

```{r}
gdf <- make_graph(toy)
set.seed(5)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```

Increasing the resolution parameter, gamma, should decrease the size of the  communities, let's see if that happens.

```{r}
gamma <- 0.5
```

```{r}
seed <- 1
result <- runLeiden(g_list, iterations, gamma, theta, seed)
plotLeiden(result)
```

```{r}
gdf <- make_graph(toy)
set.seed(1)
igraph_result <- runIgraph(gdf, iterations, gamma, theta)
plotIgraph(gdf, igraph_result)
```

Let's create a temporary directory and double check we have a few more dependencies to download and process some data as well as plot our results. To download the processed data directly, skip to the next section.

```{r, eval=FALSE}
if (!require("R.utils")) {
  install.packages("R.utils")
}
library("R.utils")

if (!require("RColorBrewer")) {
  install.packages("RColorBrewer")
}
library(RColorBrewer)

if (!require("dplyr")) {
  install.packages("dplyr")
}
library(dplyr)

if (!require("stringr")) {
  install.packages("stringr")
}
library(stringr)


dir.create("temp")
```

To show its utility on biological data, we first download some Fit-Hi-C data[@Lee2023]

```{r, eval=FALSE}
file_path <- paste0(getwd(), "/temp")
download_hic <- function(file_path){
  library(R.utils)
  setwd(file_path)
  file_name <- "hic_data.txt.gz"
  url <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE148434&format=file&file=GSE148434%5FNS%5Fall%2E5kb%2E1e%2D2%2EFitHiC%2Eall%2Etxt%2Egz"

  download.file(url, paste(file_path, file_name, sep = "/"), mode = "wb")
  gunzip(file_name, remove=FALSE)
}
download_hic(file_path)
```

Then we download Dfam mobile elements[@Storer2021] (be careful, it is very large). We place it in a folder in the current working directory.

```{r, eval=FALSE}
download_dfam <- function(file_path){
  url <- "https://www.dfam.org/releases/Dfam_3.8/annotations/hg38/hg38.hits.gz"
  download_directory <- paste(file_path, "uncleaned_data", sep = "/")
  # Create the directory
  if (!file.exists(download_directory)) {
    dir.create(download_directory)
  }
  file_name <- "hg38.hits.gz"
  hg38_file <- paste(download_directory, file_name, sep = "/")
  download.file(url, hg38_file, mode = "wb")
  gunzip(hg38_file, remove=FALSE)
}
download_dfam(file_path)

```

Then we download some example ATAC-seq data for the same tissue and disease context [@Corces2020] 

```{r, eval=FALSE}
download_atac <- function(file_path){
  library(readxl)
  url <- "https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-020-00721-x/MediaObjects/41588_2020_721_MOESM4_ESM.xlsx"
  download_directory <- paste(file_path, "uncleaned_data", sep = "/")
  # Create the directory
  if (!file.exists(download_directory)) {
    dir.create(download_directory)
  }
  file_name <- "atac_data.xlsx"
  atac_file <- paste(download_directory, file_name, sep = "/")
  download.file(url, atac_file, mode = "wb")
  # Read the Excel file
  atac_data <- read_excel(atac_file)
  
  # Specify the name of the output CSV file
  output_csv_name <- "atac_data.csv"
  output_csv_file <- paste(download_directory, output_csv_name, sep = "/")
  
  # Save as CSV
  write.csv(atac_data, output_csv_file, row.names = FALSE)
}
download_atac(file_path)
```

Then we run several helper functions to convert each of the data into dataframe (matrix)

```{r, eval=FALSE}
data_dir <-  paste0(getwd(), "/temp")
returned_path <- clean_alu.R(
  paste0(data_dir, "/hic_data.txt"),
  paste0(data_dir, "/uncleaned_data/hg38.hits"),
  data_dir
)
```

Then we use the files we made to generate the edges:

```{r, eval=FALSE}
cleaning_dir <- clean_alu(paste0(data_dir, "/highest_probability"), data_dir)
merge_hic_with_alu(data_dir, cleaning_dir)
```

```{r}
dataframe <- read.csv("edges_data_frame.csv")
dataframe$group <- NULL
list <- as.list(dataframe)
```

```{r}
result <- runLeiden(list, 1, 1, 0.01, 1)
```

If you don't want to download the large datasets, you can download our processed data instead to run our implementation of the Leiden algorithm on it.

```{r}
url <- "https://github.com/babessell1/AluNet/raw/main/final_data.csv"
download.file(url, "final_data.csv.csv")
dataframe_ <- read.csv("final_data.csv")
dataframe$group <- NULL
list_ <- as.list(dataframe)
```
Then we run the Leiden algorithm.

```{r}
result <- runLeiden(list_, 10, 1, 0.01, 1)
```
Plot the results and save as a high-dpi png.

```{r}
#png("alu_plot.png", width = 1000, height = 1000, res = 300)
plotLeiden(result, TRUE)
#dev.off()
```