compute-overlaps.Rmd

# Compute overlaps

Objective: compute overlaps and summary statistics between two sets of
genomic ranges. In particular, suppose we want to compute the mean
genomic extent (distance from left-most to right-most basepair) of 
genes overlapping a set of query ranges.

We move on from the "classroom example" by seeing how we compute
overlaps when the features are in genomic space. We will use *GRanges*
in the Bioconductor package *GenomicRanges* [@Lawrence2013]
to represent the features and *plyranges* [@Lee2019]
to compute the overlaps, similarly to how we used
*dplyr* to compute the overlaps in the previous analysis. So
data.frame is to *dplyr* as GRanges is to *plyranges*.

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=TRUE)
```

```{r message=FALSE}
library(plyranges)
```

Note the structure of the *GRanges* object. We can create a *GRanges*
from a data.frame by specifying two of: `start`, `end`, or `width`.

```{r}
df <- data.frame(
  seqnames="chr1",
  start=1 + c(34e6,36e6,36.6e6),
  width=c(2e5,2e5,1e5),
  strand=c("+","-","-"),
  range_id=factor(c("foo","bar","boo")))
r <- as_granges(df)
r
```

In case you haven't seen this before, *GRanges* objects have specific
functions to pull out information. See `?GRanges` for details.

```{r}
length(r)
seqnames(r)
strand(r)
```

Let's use *plyranges* to find the genes that overlap a region of
interest.

Typically, it is prefered to use Ensembl or GENCODE gene annotation, 
the latter of which can by obtained from *AnnotationHub*. Ensembl gene
annotation can be easily manipulated with the *ensembldb*
package [@rainer2019].
Provided is some example code in an unevaluated code chunk:

```{r eval=FALSE}
library(ensembldb)
edb <- ... # obtain from AnnotationHub or from GTF file
g <- genes(edb)
```

Here, we will work with a static TxDb that is distributed as an
annotation package in Bioconductor. We use this TxDb because it is
an older genome release (hg19) that matches some ranges we will work
with later, but generally it is recommended to use a recent (and
versioned) Ensembl or GENCODE gene annotation.

```{r message=FALSE}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
```

The same function can be used to extract the gene ranges:

```{r}
g <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
g <- keepStandardChromosomes(g, pruning.mode="coarse")
g
```

Now we are ready to test for overlaps. A left join gives us all the
overlaps for ranges on the left side (here `r`). If a range on the
left has no overlaps it appears with `NA` for the metadata columns of
the right side ranges. If a range on the left side has multiple
overlaps with the right side, it will appear multiple times in the
output. 

Keeping reading below on how to deal with this, if it is desired to
have statistics on per-range overlaps.

```{r}
r %>% join_overlap_left(g)
```

If we want to exclude the zero matches cases, we can use an inner
join:

```{r}
r %>% join_overlap_inner(g)
```

:::: {.note}
What about other types of overlap?  We can specify a distance, called
a "gap", if we want to also find when ranges are *near* each other, up
to a maximum allowed distance. We can also specify the minimum amount
of overlapping basepairs. Every overlap function in plyranges has
arguments `maxgap` and `minoverlap`. For example, to see if ranges are
50kb from each other, we would specify `maxgap=5e4`. If we want to
know if ranges are 50kb from a particular endpoint of another set of
ranges, for example TSS, we could perform the operations `anchor_5p()`
followed by `mutate(width=1)`, before overlapping the sets.
::::

We can also perform summarization by columns either in the `r` or the
`g` object:

```{r}
r %>% join_overlap_inner(g) %>%
  group_by(range_id) %>%
  summarize(count=n())
```

This is giving us the same information as the following:

```{r}
r %>% count_overlaps(g)
```

Which can be added to the range data with a `mutate` call:

```{r}
r %>% mutate(overlaps = count_overlaps(., g))
```

:::: {.note}
Up to this point, we could have used R's native pipe `|>` in place of
the *magrittr* pipe `%>%`. However the above line won't work with R's
native pipe, even if we use the underscore placeholder `_` that goes
with R's native pipe, instead of the dot placeholder `.`. This is
because R's native pipe doesn't allow the placeholder to be within a
nested expression (`count_overlaps` called before `mutate`). In
addition, R's native pipe only allows use of the placeholder once;
sometimes it is convenient to access the incoming object more than
once.
::::

If we don't care about multiple overlaps, but just want a binary 
variable that records if there was one or more overlaps or not,
we can ask if the count of overlaps is greater than 0:

```{r}
r %>% mutate(overlaps_any = count_overlaps(., g) > 0)
```

If we want to keep the information about the gene ranges, we swap the
order of the ranges in the command:

```{r}
g %>% join_overlap_inner(r)
```

If we want strand specific overlaps, we can add `_directed`:

```{r}
g %>% join_overlap_inner_directed(r)
```

By turning the join around, we have access to the genomic range 
information about the genes. Now we can compute, e.g. the average 
genomic extent of the genes (first base to last base), per overlapping
range.

```{r}
g %>% join_overlap_inner_directed(r) %>%
  group_by(range_id) %>%
  summarize(count=n(), mean_width=mean(width))
```

What about `"foo"`? We need to add a `complete()` call to account for
the fact that we are missing those overlaps after the join. We need to
call the function explicitly from the *tidyr* package but by not
loading the package we can avoid some function name conflicts with
*plyranges*. Also we need to convert to tibble (explanation
follows).

```{r}
library(tibble)
g %>% join_overlap_inner_directed(r) %>%
  group_by(range_id) %>%
  summarize(count=n(), mean_width=mean(width)) %>%
  as_tibble() %>%
  tidyr::complete(range_id, fill=list(count=0))
```

Why did we have to convert to tibble before running `complete()`?
This is because metadata columns of *GRanges* objects are in a format
called *DataFrame*, on which the *tidyr* / *dplyr* functions don't
know how to operate. 

To access metadata columns of a *GRanges* object, you can use any
of these paradigms:

```{r}
mcols(r)
mcols(r)$range_id
r$range_id # this works also
mcols(r)[["range_id"]] # for programmatic access
```

But if you want to work on them in with *tidyr* / *dplyr*, you need to
first convert to tibble (or data.frame):

```{r}
mcols(r) %>% as_tibble()
```

:::: {.note}
Reduce instead of summarize:
Above when we used `group_by` and `summarize` we lost the original
range data. Another option, to preserve the range data, is to use
the function `reduce_ranges` within groups that we define (which can
be `_directed` or not). 
::::

If we want to preserve the range information
for the `r` object, we can start with `r` and proceed to join, group
by, and reduce within groups.  For an example of `reduce_ranges` used
in the context of a genomic data analysis see @Lee2020.
In order to compute on the gene widths, we have to add that as a
metadata column within the join. To keep the no-gene-overlapping
ranges in `r`, we can count when the gene ID is not `NA`.

```{r}
r %>%
  join_overlap_left_directed(g %>% mutate(gene_width=width)) %>%
  group_by(range_id) %>%
  reduce_ranges(count=sum(!is.na(gene_id)),
                mean_width=mean(gene_width))
```

Hopefully, you've seen that there are many routes to compute the types
of statistics of interest. The best way to decide which one to use is
to think first: what do I want the final output to look like, and do I
need to keep track of non-overlapping ranges? This will help dictate
the way you set up your code, whether a `join` or a `mutate` to just
tally a column of overlaps, etc., and whether a `complete` call is
needed to fill in missing levels at the end of the analysis.