notebook.Rmd

---
title: "R Notebook"
output: html_notebook
---

# SimpleGeneExpression
Programs to quantify expression of transcripts from public datasets


### Project team 
   * [Jose V Die](https://github.com/jdieramon)
   * [Moamen Elmassry](https://github.com/MoamenElmassry)
   * [Kim LeBlanc](https://github.com/kleblanc5909) 
   * [Ben Busby](https://github.com/DCGenomics)


<br>

This tutorial is based on the analysis of the Magic-BLAST table output. Please, read the 
[documentation](https://ncbi.github.io/magicblast/) to understand how Magic-BLAST 
works.   

The first three rows of the typical magic-BLAST output file contains some info 
related to the software version, the command line used and the fields description: 
  
![](figures/magic_header.png)  
  
<br>
We may want to remove those rows with the command line :   

```{}
tail -n +4 my_magic > my_magic
```

Here, the starting point is a dataset made of 24 auxin receptor factors (ARF) genes from the chickpea genome. Code to identify those accessions is available from the [GeneHummus](https://github.com/NCBI-Hackathons/GeneHummus) repository. We will study the frequency of the 24 ARF genes in root tissues of two genotypes under drought stress and control conditions across 4 SRA libraries:   

* [SRR5927129](https://www.ncbi.nlm.nih.gov/sra/?term=SRR5927129). Susceptible_Control  
* [SRR5927130](https://www.ncbi.nlm.nih.gov/sra/?term=SRR5927130). Susceptible_Drought  
* [SRR5927133](https://www.ncbi.nlm.nih.gov/sra/?term=SRR5927133). Tolerant_Control  
* [SRR5927134](https://www.ncbi.nlm.nih.gov/sra/?term=SRR5927134). Tolerant_Drought  

First, load the functions needed for the analysis. 


Load functions
```{r message=FALSE}
source("functions/expression.R")
```


First we load the `my_tabledemo.rda` object. This object contains the first 500,000 
lines of the magic-BLAST table output. The original table has been cut for 
illustrative purposes. After loading the object into the environment, the `my.magic` 
dataset is available. 

```{r}
load("data/my_tabledemo.rda")
```


Next,  we want to name the columns. 
```{r}
names(my.magic) <- c("query.acc", "reference.acc", "identity", "not.used", "not.used.1",
              "not.used.2", "query.start", "query.end", "reference.start", 
              "reference.end", "not.used.3", "not.used.4", "score", "query.strand",
              "reference.strand", "query.length", "BTOP", "num.placements", 
              "not.used.5", "compartment", "left.overhang", "right.overhang", 
              "mate.reference", "mate.ref..start", "composite.score")    
```

Check the dataset dimensions.
```{r}
dim(my.magic)
```

Before analyzing the data, we have to apply a number of filters to tidy the dataset. 

## Filter
### Filter 1 : alignement length score 
First, check the query lengths (length reads ). We can see that the whole reads dataset is 125 bp length. 
```{r qlengths}
table(my.magic$query.length)
```

Now, we check the alignement length scores:  

```{r plot_alignlengths}
plot(density(my.magic$score))
```

**Filter1** involves filtering by high alignement length score. 
Here we use at least 120 bp as an argument for the function `by_score`. 

```{r byscore, cache  = TRUE}
df_filtered <- by_score(my.magic, 120) 
plot(density(df_filtered$score), col = "red")
```


### Filter 2 : identity 
We want high scores for the alignements but also alignements with high identities. 
This chunk plots the identity distribution for the filtered data : 
```{r}
plot(density(df_filtered$identity))
```

**Filter2** involves filtering by high identities. Here we use at least 99% as 
an argument for the function `by_identity`
```{r by_identity, cache = TRUE}
df_filtered <- by_identity(df_filtered, 99)
plot(density(df_filtered$identity), col = "red")
```

The function `tidy_query.acc`creates a new dataset with the number of sequence 
counts from the filtered data.  
```{r}
df_filtered <- tidy_query.acc(df_filtered)
# Create a tibble object
df_filtered <- as_data_frame(df_filtered)
```


### Plot of Counts
```{r cache = TRUE}
hist(df_filtered$Count)
```

### Plot Counts ~ SRA run
```{r}
boxplot(df_filtered$Count ~ df_filtered$Query, 
        outline = FALSE, 
        main = "Sequence Counts per run", 
        col = c("grey80", "grey50", "grey80", "grey50"))
```
  
  
### Optional filter:   
Genes with presence at least in *n* SRA libraries    

```{r genes4SRA}
df_filtered %>% 
  group_by(Reference) %>% 
  summarize(Total = n()) %>%
  arrange(Total)
```
Filter by genes present in all 4 libraries. 

```{r genes4SRAfilter}
# total number of occurrences for each gene
mydf <- df_filtered %>% 
  filter(Reference != "LOC101514738" )

mydf
```

Now, add a column with the SRA run size with the `gatherSize` function 

```{r runsize, cache = TRUE}
mydf <- mutate(mydf, Runsize = gatherSize(mydf))
```

Look at head and tail of `mydf`

```{r}
mydf %>% head()

```

```{r}
mydf %>% tail()
```

### Normalized counts

```{r normCounts, cache = TRUE}
mydf_norm <- mydf %>% 
  mutate(norm_Counts = Count*mean(Runsize)/Runsize) %>%
  select(Query, Reference, norm_Counts)

mydf_norm

```

## Analysis of normalized counts
**Boxplot of gene counts distribution**  

```{r countsboxplot}
mydf_norm %>% 
  ggplot(aes(x =  reorder(Reference, norm_Counts), y = norm_Counts)) + 
  geom_boxplot()

```
Dataset of normalized **counts per gene and SRA library**
```{r cache = TRUE}
library(tidyr)
bySRA <-  spread(mydf_norm, Query, norm_Counts)
bySRA


```

**Density plots of normalized counts**  

```{r density, cache = TRUE}
plot(density(bySRA$SRR5927129), col = 1, main = 'Counts', ylim = c(0, 0.00105))
lines(density(bySRA$SRR5927130), col = 2)
lines(density(bySRA$SRR5927133), col = 3)
lines(density(bySRA$SRR5927134), col = 4)
```


**Scatter-plot**, comparing counts of two samples to each other
### Example1
```{r scatterControl, cache  = TRUE}
ggplot(bySRA, aes(x = SRR5927129, y = SRR5927133)) +
  geom_point() +
  ggtitle("Control roots") + 
  xlab("Tolerant plants") + 
  ylab("Susceptible plants") + 
  geom_smooth(method = "lm")
```

### Example2
```{r scatterTolerant, cache  = TRUE}
ggplot(bySRA, aes(x = SRR5927134, y = SRR5927133)) +
  geom_point() +
  ggtitle("Roots", "Tolerant plants") + 
  xlab("Control") + 
  ylab("Drought") + 
  geom_smooth(method = "lm")

```