-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpromoter-marks.Rmd
169 lines (138 loc) · 5.36 KB
/
promoter-marks.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# Promoter marks
Objective: determine if tissue-specific promoter marks (e.g. H3K27ac)
are often near genes that are expressed in a tissue-specific manner.
We will load expression data from the GTEx project [@gtex], which gives
median expression in TPM for each tissue. We will use H3K27ac ChIP-seq
data from the ENCODE project [@encode].
```{r message=FALSE}
library(tidyr)
file <- "data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz"
gtex <- read.delim(file, skip=2)
```
We select two tissues, bladder and kidney, and convert the data from a
wide format into a tidy format.
```{r}
tissues <- gtex %>% dplyr::select(Name, Bladder, Kidney...Cortex) %>%
dplyr::rename(gene = Name, Kidney = Kidney...Cortex) %>%
dplyr::mutate(gene = sub("\\..*","",gene)) %>%
pivot_longer(!gene, names_to="tissue", values_to="tpm")
```
Now define two vectors of genes that are specific to bladder and kidney:
```{r}
bladder_expr <- tissues %>%
dplyr::filter(tissue == "Bladder" & tpm > 10) %>%
dplyr::pull(gene)
kidney_expr <- tissues %>%
dplyr::filter(tissue == "Kidney" & tpm > 10) %>%
dplyr::pull(gene)
int <- intersect(bladder_expr, kidney_expr)
bladder_expr <- setdiff(bladder_expr, int)
kidney_expr <- setdiff(kidney_expr, int)
# save(bladder_expr, kidney_expr, file="data/bladder_kidney_expr.rda")
```
Next, use an existing TxDb to locate these genes in the genomes. While
we usually recommend to use GENCODE genes for human analysis, because
the ENCODE chromatin modification peak files on AnnotationHub are in
hg19, we use the UCSC hg19 genes here for simplicity of the code:
```{r message=FALSE}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
```
Add the ENSEMBL ID and pull out the two tissue-specific sets.
```{r}
g <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
```
```{r message=FALSE}
library(plyranges)
g <- g %>% mutate(ensembl = mapIds(org.Hs.eg.db, gene_id, "ENSEMBL", "ENTREZID"))
bladder_g <- g %>% filter(ensembl %in% bladder_expr)
kidney_g <- g %>% filter(ensembl %in% kidney_expr)
```
Finally we combine the two sets with `bind_ranges`, and we change
the feature size from the whole gene extent (the range from the
leftmost exon to rightmost exon), to just the TSS, using `anchor_5p`
and `mutate`.
```{r}
tss <- bind_ranges(bladder=bladder_g,
kidney=kidney_g,
.id="gtissue") %>%
anchor_5p() %>%
mutate(width=1)
```
Now we will obtain the H3K27ac peak sets:
```{r eval=FALSE}
library(AnnotationHub)
ah <- AnnotationHub()
# query(ah, c("Homo sapiens", "bladder", "H3K27ac", "narrowPeak"))
bladder_pks <- ah[["AH44180"]]
# query(ah, c("Homo sapiens", "kidney", "H3K27ac", "narrowPeak"))
kidney_pks <- ah[["AH43443"]]
save(bladder_pks, kidney_pks, file="data/peaks.rda")
```
```{r echo=FALSE}
load("data/peaks.rda")
```
We download these and scale so they have the same 90% quantile of `signalValue`.
```{r}
ninety <- function(x) quantile(x, .9, names=FALSE)
bladder_pks <- bladder_pks %>%
mutate(signal = signalValue / ninety(signalValue))
kidney_pks <- kidney_pks %>%
mutate(signal = signalValue / ninety(signalValue))
```
Combine the peaks from bladder and kidney, filter to those with < 0.1%
FDR, and center the peak on the summit (the `peak` column gives the
shift from the left side to the summit).
```{r}
pks <- bind_ranges(bladder=bladder_pks,
kidney=kidney_pks,
.id="ptissue") %>%
filter(qValue > 3, width <= 1000) %>%
mutate(start = start + peak) %>%
select(-peak) %>%
mutate(width = 1)
```
Finally, once we have two tidy range sets, we can perform the analysis
by a join, followed by two lines that take care of multiple overlaps,
followed by two lines that give us our tallies of interest.
It appears that tissue-specific peaks are enriched near the
tissue-specific genes for both bladder and kidney.
```{r}
tss %>%
join_overlap_left(pks, maxgap=500) %>%
group_by(ptissue) %>% # within peak tissue...
filter(!duplicated(gene_id)) %>% # ...just take the first overlap per gene
group_by(gtissue, ptissue) %>%
summarize(count = n())
```
The above number could also be found with four `countOverlaps` calls,
by considering all four pairs of overlaps of the two sets of genes and
peaks.
Another way to avoid counting overlaps more than once per gene is to
use the *plyranges* function, `n_distinct()`:
```{r}
tss %>%
join_overlap_left(pks, maxgap=500) %>%
group_by(gtissue, ptissue) %>%
summarize(count = n_distinct(gene_id))
```
If we want more information per gene, e.g. suppose we want to compute
the average signal per gene of peaks nearby, we need to group twice,
once also by gene ID, and the second time integrating over gene
ID. While here we add a few more lines of code, performing such an
operation with base Bioconductor functions would require adding code
to perform the loops, adding many intermediate variables to store
results, etc.
```{r}
tss %>%
join_overlap_left(pks, maxgap=500) %>%
group_by(gtissue, ptissue, gene_id) %>% # need per gene stats
summarize(num_overlaps = n(), signal = mean(signal)) %>%
as_tibble() %>% # DataFrame to tibble for further processing
group_by(gtissue, ptissue) %>%
summarize(sum_any_overlaps = sum(num_overlaps > 0),
mean_signal=mean(signal))
```
What's wrong with this analysis?
* We didn't figure out the expressed promoter, we just looked at the
left or rightmost isoform (for + or - strand genes, respectively).