-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
330 lines (259 loc) · 16.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
---
output: github_document
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# CTCF
[](https://lifecycle.r-lib.org/articles/stages.html#stable)
<!-- [](https://bioconductor.org/checkResults/release/bioc-LATEST/CTCF) [](https://github.com/mdozmorov/CTCF/actions/workflows/R-CMD-check-bioc.yaml) -->
`r BiocStyle::Biocpkg('CTCF')` defines an AnnotationHub resource representing genomic coordinates
of [FIMO](https://meme-suite.org/meme/doc/fimo.html)-predicted CTCF binding
sites for human and mouse genomes, including the
[Telomere-to-Telomere](https://github.com/marbl/CHM13) and [mm39](http://genomeref.blogspot.com/2020/07/grcm39-new-mouse-reference-genome.html)
genome assemblies. It also includes experimentally defined CTCF-bound
cis-regulatory elements from [ENCODE SCREEN](https://screen.encodeproject.org/).
TL;DR - for human hg38 genome assembly, use `hg38.MA0139.1.RData` ("AH104729").
For mouse mm10 genome assembly, use `mm10.MA0139.1.RData` ("AH104755").
For [ENCODE SCREEN](https://screen.encodeproject.org/) data, use
`hg38.SCREEN.GRCh38_CTCF.RData` ("AH104730") or `mm10.SCREEN.mm10_CTCF.RData`
("AH104756") objects.
The CTCF GRanges are named as `<assembly>.<Database>`. The FIMO-predicted data
includes extra columns with motif name, score, p-value, q-value, and the motif
sequence.
<!--
**Please, note that the updated CTCF objects will be available in
Bioconductor/AnnotationHub 3.16.** To test the following code, use the
`bioconductor::devel` Docker image. Run:
```{bash eval=FALSE}
docker run -e PASSWORD=password -p 8787:8787 -d --rm -v $(pwd):/home/rstudio bioconductor/bioconductor_docker:devel
```
Open http://localhost:8787 and login using `rstudio/password` credentials.
-->
## Installation instructions
[Install](https://www.bioconductor.org/install/#install-R) the latest
release of R, then get the latest version of Bioconductor by starting R and
entering the commands:
```{r 'install1', eval = TRUE, message=FALSE, warning=FALSE}
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install(version = "3.16")
```
Then, install additional packages using the following code:
```{r 'install2', eval = TRUE, message=FALSE, warning=FALSE}
# BiocManager::install("AnnotationHub", update = FALSE)
# BiocManager::install("GenomicRanges", update = FALSE)
# BiocManager::install("plyranges", update = FALSE)
```
## Example
```{r get-data}
suppressMessages(library(AnnotationHub))
ah <- AnnotationHub()
query_data <- subset(ah, preparerclass == "CTCF")
# Explore the AnnotationHub object
query_data
# Get the list of data providers
query_data$dataprovider %>% table()
```
We can find CTCF sites identified using JASPAR 2022 database in hg38 human genome
```{r}
subset(query_data, species == "Homo sapiens" &
genome == "hg38" &
dataprovider == "JASPAR 2022")
# Same for mm10 mouse genome
# subset(query_data, species == "Mus musculus" & genome == "mm10" & dataprovider == "JASPAR 2022")
```
The `hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2` object contains
CTCF sites detected using the
[all three CTCF PWMs](https://jaspar.genereg.net/search?q=CTCF&collection=all&tax_group=all&tax_id=9606&type=all&class=all&family=all&version=all). To retrieve, we'll use:
```{r}
# hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2
CTCF_hg38_all <- query_data[["AH104727"]]
CTCF_hg38_all
```
The `hg38.MA0139.1` object contains CTCF sites detected using the most
popular [MA0139.1](https://jaspar.genereg.net/matrix/MA0139.1/) CTCF PWM.
To retrieve:
```{r}
# hg38.MA0139.1
CTCF_hg38 <- query_data[["AH104729"]]
CTCF_hg38
```
It is always advisable to sort GRanges objects and keep standard chromsomes:
```{r}
suppressMessages(library(plyranges))
CTCF_hg38_all <- CTCF_hg38_all %>% keepStandardChromosomes() %>% sort()
CTCF_hg38 <- CTCF_hg38 %>% keepStandardChromosomes() %>% sort()
```
Save the data in a BED file, if needed.
```{r eval=FALSE}
# Note that rtracklayer::import and rtracklayer::export perform unexplained
# start coordinate conversion, likely related to 0- and 1-based coordinate
# system. We recommend converting GRanges to a data frame and save tab-separated
write.table(CTCF_hg38_all %>% sort() %>% as.data.frame(),
file = "CTCF_hg38_all.bed",
sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(CTCF_hg38 %>% sort() %>% as.data.frame(),
file = "CTCF_hg38.bed",
sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
```
Create an [IGV](https://software.broadinstitute.org/software/igv/)
XML session file out of the saved BED files using the `r BiocStyle::Biocpkg('tracktables')`
package. See `vignette("tracktables", package = "tracktables")` for more details.
```{r eval=FALSE}
library(tracktables) # BiocManager::install("tracktables")
# Sample sheet metadata
SampleSheet <- data.frame(SampleName = c("CTCF all", "CTCF MA0139.1"),
Description = c("All CTCF matrices from JASPAR2022",
"MA0139.1 CTCF matrix from JASPAR2022"))
# File sheet linking files with sample names
FileSheet <- data.frame(SampleName = c("CTCF all", "CTCF MA0139.1"),
bigwig = c(NA, NA),
interval = c("CTCF_hg38_all.bed", "CTCF_hg38.bed"),
bam = c(NA, NA))
# Creating an IGV session XML file
MakeIGVSession(SampleSheet, FileSheet,
igvdirectory = getwd(), "CTCF_from_JASPAR2022", "hg38")
```
Note that the [FIMO](https://meme-suite.org/meme/doc/fimo.html) tool detects
CTCF binding sites using the 1e-4 p-value threshold by default (the more significant
p-value corresponds to the more confidently detected CTCF motif). We found that
this threshold may be too permissive. Using the
[ENCODE SCREEN](https://screen.encodeproject.org/) database as ground truth,
we found 1e-6 as the optimal threshold providing approximately 80% true positive
rate. However, less significant CTCF motifs may be cell type-specific or have
weaker CTCF binding and therefore be missed by conventional peak callers.
If cell type-specific CTCF binding is of interest, we recommend exploring
less significant CTCF sites.
```{r echo=FALSE}
knitr::include_graphics("man/figures/Figure_human_pvalues_threshold.png")
```
To filter the GRanges object and keep high-confidence CTCF sites, use:
```{r}
# Check length before filtering
print(paste("Number of CTCF motifs at the default 1e-4 threshold:", length(CTCF_hg38)))
# Filter and check length after filtering
CTCF_hg38_filtered <- CTCF_hg38 %>% plyranges::filter(pvalue < 1e-6)
print(paste("Number of CTCF motifs at the 1e-6 threshold:", length(CTCF_hg38_filtered)))
# Similarly, filter
CTCF_hg38_all_filtered <- CTCF_hg38_all %>% plyranges::filter(pvalue < 1e-6)
```
Given some databases provide multiple CTCF PWMs, one CTCF site may be detected
multiple times resulting in overlapping CTCF sites. For example, the proportion
of overlapping CTCF sites in the `CTCF_hg38_all_filtered` object containing CTCF
sites detected by three matrices nearly 40%:
```{r}
# Proportion of overlapping enrtries
tmp <- findOverlaps(CTCF_hg38_all, CTCF_hg38_all)
prop_overlap <- sort(table(queryHits(tmp)) %>% table(), decreasing = TRUE)
sum(prop_overlap[which(names(prop_overlap) != "1")]) / length(CTCF_hg38_all)
```
The proportion of overlapping CTCF sites in the `CTCF_hg38_filtered` object
containing CTCF sites detected by the [MA0139.1](https://jaspar.genereg.net/matrix/MA0139.1/)
matrix is less than 2.5%
```{r}
tmp <- findOverlaps(CTCF_hg38, CTCF_hg38)
prop_overlap <- sort(table(queryHits(tmp)) %>% table(), decreasing = TRUE)
sum(prop_overlap[which(names(prop_overlap) != "1")]) / length(CTCF_hg38)
```
Reducing them (merging overlapping CTCF sites), combined with 1E-6 cutoff
filtering, yields the number of CTCF sites comparable to previously reported.
```{r}
print(paste("Number of CTCF_hg38 motifs at the 1e-6 threshold AND reduced:", length(CTCF_hg38_filtered %>% reduce())))
print(paste("Number of CTCF_hg38_all motifs at the 1e-6 threshold AND reduced:", length(CTCF_hg38_all_filtered %>% reduce())))
```
However, regulatory elements with CTCF proteins co-occupying adjacent/overlapping
CTCF binding motifs were shown to be functionally and structurally different from
those with single CTCF motifs. We provide non-reduced CTCF data and advise
considering overlap of CTCF sites depending on the study's goal.
# liftOver of CTCF coordinates
As genome assemblies for model organisms continue to improve, CTCF sites for
previous genome assemblies become obsolete. Typically, the actual genome
sequence changes little, leading to changes in genomic coordinates.
The [liftOver](https://genome.ucsc.edu/cgi-bin/hgLiftOver)
method allows for conversion of genomic coordinates between genome assemblies.
Some carefully curated CTCF sites are available only for older genome assemblies.
Examples include the data from [CTCFBSDB](https://insulatordb.uthsc.edu/),
available for hg18 and mm8 genome assemblies.
To investigate whether liftOver of CTCF sites from older genome assemblies is
a viable option, we tested for overlap between CTCF sites directly detected in
specific genome assemblies with those lifted over. We detected CTCF sites using
the [MA0139.1](https://jaspar.genereg.net/matrix/MA0139.1/) PWM from JASPAR 2022
database in hg18, hg19, hg38, and T2T genome assemblies and converted their
genomic coordinates using the corresponding liftOver chains
([download_liftOver.sh](https://github.com/dozmorovlab/CTCF.dev/blob/main/scripts/download_liftOver.sh)
and [convert_liftOver.sh](https://github.com/dozmorovlab/CTCF.dev/blob/main/scripts/convert_liftOver.sh)
scripts). We observed high Jaccard overlap among CTCF sites detected in the
original genome assemblies or lifted over.
**Jaccard overlaps among CTCF binding sites detected in the original and liftOver human genome assemblies.** CTCF sites were detected using JASPAR 2022 MA0139.1 PWM. The correlogram was clustered using Euclidean distance and Ward.D clustering . White-red gradient indicate low-to-high Jaccard overlaps. Jaccard values are shown in the corresponding cells.
```{r echo=FALSE}
knitr::include_graphics("man/figures/Figure_liftOverJaccard.png")
```
Our results suggest that liftOver is a viable alternative to obtain CTCF genomic
annotations for different genome assemblies. We provide
[CTCFBSDB](https://insulatordb.uthsc.edu/) data converted to hg19 and hg38
genome assemblies.
# CTCF Position Weight Matrices
**CTCF PWM information.** "Motif" - individual motif IDs or the total number of motifs per database; "Length" - motif length of the range of lengths; "URL" - direct links to motif pages. Jaspar, Hocomoco, Jolma 2013 PWMs were downloaded from the [MEME database](https://meme-suite.org/meme/doc/download.html).
```{r echo=FALSE}
mtx <- read.csv("man/tables/Table_PWMs.csv")
knitr::kable(mtx)
```
**CTCF motif logos.** PWMs from (A) MEME, (B) CTCFBSDB, (C) CIS-BP human, and (D) CIS-BP mouse databases. Clustering and alignment of motifs was performed using the `r `BiocStyle::Biocpkg("motifStack")` R package.
```{r echo=FALSE}
knitr::include_graphics("man/figures/Supplementary_Figure_1.png")
```
See [inst/scripts/make-data.R](inst/scripts/make-data.R) how the CTCF GRanges
objects were created.
# CTCF predicted and experimental data
**Predefined CTCF binding data.** "Database" - source of data; "Number" - number of binding sites; "Assembly" - genome assembly; "URL" - direct link to data download.
```{r echo=FALSE}
mtx <- read.csv("man/tables/Table_Predefined.csv")
knitr::kable(mtx)
```
# All GRanges objects included in the package
**Summary of CTCF binding data provided in the package.** CTCF sites for each genome assembly and PWM combination were detected using FIMO. "ID" - object names formatted as `<assembly>.<database name>`; "Assembly" - genome assembly, T2T - telomere to telomere (GCA_009914755.4) genome assembly; "All (p-value threshold Xe-Y)" - the total number of CTCF binding sites in the corresponding BED file at the Xe-Y threshold; "Non-overlapping (p-value threshold Xe-Y)" - number of non-overlapping CTCF binding sites (overlapping regions are merged) at the Xe-Y threshold.
```{r echo=FALSE}
mtx <- read.csv("man/tables/Table_log.csv")
knitr::kable(mtx)
```
# Future development
- High-resolution CTCF footprinting, K562 cell line. CTCF MNase HiChIP data, long fragments correspond to nucleosome-protected DNA whereas short fragments arise from CTCF-protected DNA. [FactorFinder](https://github.com/aryeelab/cohesin_extrusion_reproducibility) tool for near-base-pair resolution CTCF detection, uses the distribution of short fragments around a CTCF binding site. Benchmarking against CTCF motif location, ChIP-seq peaks, loop anchors. A model of topological regulation by partially extruded CTCF loops. Data to be available. <details>
<summary>Paper</summary>
Sept, Corriene E., Y. Esther Tak, Christian G. Cerda-Smith, Haley M. Hutchinson, Viraat Goel, Marco Blanchette, Mital S. Bhakta, et al. “High-Resolution CTCF Footprinting Reveals Impact of Chromatin State on Cohesin Extrusion Dynamics.” Preprint. Genomics, October 23, 2023. https://doi.org/10.1101/2023.10.20.563340.
</details>
- Divergent CTCF sites are enriched at boundaries, convergent CTCF sites mark the interior of TADs, short loops in the 5-100kb range. Good intro about TADs. CTCF orientation is not linked to the direction of transcription. Definition of eight orientation patterns. Supplementary data: Table S1 - CTCF coordinates with directionality, Table S3 - consensus CTCF coordinates. <details>
<summary>Paper</summary>
Nanni, Luca, Stefano Ceri, and Colin Logie. "Spatial patterns of CTCF sites define the anatomy of TADs and their boundaries." Genome biology 21 (2020): 1-25. https://doi.org/10.1186/s13059-020-02108-x
</summary>
## Citation
Below is the citation output from using `citation('CTCF')` in R. Please
run this yourself to check for any updates on how to cite __CTCF__.
```{r 'citation', eval = requireNamespace('CTCF')}
print(citation("CTCF"), bibtex = TRUE)
```
<!--
Please note that the `r BiocStyle::Biocpkg('CTCF')` was only made possible thanks to many other R and
bioinformatics software authors, which are cited either in the vignettes and/or
the paper(s) describing this package.
## Additional references
- Marina-Zárate, Ester, Ana Rodríguez-Ronchel, Manuel J. Gómez, Fátima Sánchez-Cabo, and Almudena R. Ramiro. "Low affinity CTCF binding drives transcriptional regulation whereas high affinity binding encompasses architectural functions." iScience (2023). https://doi.org/10.1016/j.isci.2023.106106
## Development tools
* Continuous code testing is possible thanks to [GitHub actions](https://www.tidyverse.org/blog/2020/04/usethis-1-6-0/) through `r BiocStyle::CRANpkg('usethis')`, `r BiocStyle::CRANpkg('remotes')`, and `r BiocStyle::CRANpkg('rcmdcheck')` customized to use [Bioconductor's docker containers](https://www.bioconductor.org/help/docker/) and `r BiocStyle::Biocpkg('BiocCheck')`.
* Code coverage assessment is possible thanks to [codecov](https://codecov.io/gh) and `r BiocStyle::CRANpkg('covr')`.
* The [documentation website](http://mdozmorov.github.io/CTCF) is automatically updated thanks to `r BiocStyle::CRANpkg('pkgdown')`.
* The code is styled automatically thanks to `r BiocStyle::CRANpkg('styler')`.
* The documentation is formatted thanks to `r BiocStyle::CRANpkg('devtools')` and `r BiocStyle::CRANpkg('roxygen2')`.
For more details, check the `dev` directory.
-->
This package was developed using `r BiocStyle::Biocpkg('biocthis')`.
## Code of Conduct
Please note that the CTCF project is released with a [Contributor Code of Conduct](http://bioconductor.org/about/code-of-conduct/). By contributing to this project, you agree to abide by its terms.