Skip to content

Commit ad391c3

Browse files
committed
support proteomics
1 parent a76a704 commit ad391c3

File tree

9 files changed

+344
-20
lines changed

9 files changed

+344
-20
lines changed

DESCRIPTION

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,17 @@
11
Package: ggcoverage
22
Type: Package
3-
Title: Visualize Genome Coverage with Various Annotations
4-
Version: 1.1.0
3+
Title: Visualize Genome/Protein Coverage with Various Annotations
4+
Version: 1.2.0
55
Authors@R:
66
person(given = "Yabing",
77
family = "Song",
88
role = c("aut", "cre"),
99
email = "[email protected]")
1010
Maintainer: Yabing Song <[email protected]>
11-
Description: The goal of 'ggcoverage' is to simplify the process of visualizing genome coverage. It contains functions to
12-
load data from BAM, BigWig or BedGraph files, create genome coverage plot, add various annotations to
13-
the coverage plot, including base and amino acid annotation, GC annotation, gene annotation, transcript annotation, ideogram annotation and peak annotation.
11+
Description: The goal of 'ggcoverage' is to simplify the process of visualizing genome/protein coverage. It contains functions to
12+
load data from BAM, BigWig, BedGraph or txt/xlsx files, create genome/protein coverage plot, add various annotations to
13+
the coverage plot, including base and amino acid annotation, GC annotation, gene annotation, transcript annotation, ideogram annotation,
14+
peak annotation, contact map annotation, link annotation and peotein feature annotation.
1415
License: MIT + file LICENSE
1516
Encoding: UTF-8
1617
RoxygenNote: 7.1.1
@@ -45,7 +46,10 @@ Imports:
4546
ggforce,
4647
HiCBricks,
4748
ggpattern,
48-
BiocParallel
49+
BiocParallel,
50+
openxlsx,
51+
stringr,
52+
ggpp
4953
Suggests:
5054
rmarkdown,
5155
knitr,

NEWS.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
# ggcoverage 1.2.0
2+
## Major changes
3+
* Supporting protein coverage and annotation plot (`ggprotein`, `geom_protein`).
4+
5+
-------------
6+
17
# ggcoverage 1.1.0
28
## Major changes
39
* Mark SNV with twill (add twill to position with SNV), star (add star mark to position with SNV), and highlight (position without SNV is grey).

R/geom_protein.R

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
#' Layer for Protein Coverage Plot.
2+
#'
3+
#' @param coverage.file Exported protein coverage file, should be in excel.
4+
#' @param fasta.file Input reference protein fasta file.
5+
#' @param protein.id The protein ID of exported coverage file. This should be unique and in \code{fasta.file}.
6+
#' @param XCorr.threshold The cross-correlation threshold. Default: 2.
7+
#' @param confidence The confidence level. Default: High.
8+
#' @param contaminant Whether to remove contaminant peptides. Default: NULL (not remove).
9+
#' @param remove.na Logical value, whether to remove NA value in Abundance column. Default: TRUE.
10+
#' @param color The fill color of coverage plot. Default: grey.
11+
#' @param mark.bare Logical value, whether to mark region where Abundance is zero or NA. Default: TRUE.
12+
#' @param mark.color The color used for the marked region. Default: red.
13+
#' @param mark.alpha The transparency used for the marked region. Default: 0.5.
14+
#' @param show.table Logical value, whether to show coverage summary table. Default: TRUE.
15+
#' @param table.position The position of the coverage summary table, choose from right_top, left_top, left_bottom, right_bottom.
16+
#' Default: right_top.
17+
#' @param table.size The font size of coverage summary table. Default: 4.
18+
#' @param table.color The font color of coverage summary table. Default: black.
19+
#' @param range.size The label size of range text, used when \code{range.position} is in. Default: 3.
20+
#' @param range.position The position of y axis range, chosen from in (move y axis in the plot) and
21+
#' out (normal y axis). Default: in.
22+
#'
23+
#' @return A ggplot2 object.
24+
#' @importFrom openxlsx read.xlsx
25+
#' @importFrom magrittr %>%
26+
#' @importFrom dplyr filter group_by summarise arrange
27+
#' @importFrom rlang .data
28+
#' @importFrom Biostrings readAAStringSet
29+
#' @importFrom stringr str_locate
30+
#' @importFrom GenomicRanges reduce GRanges setdiff
31+
#' @importFrom IRanges IRanges
32+
#' @importFrom ggplot2 ggplot geom_rect geom_text aes aes_string scale_x_continuous
33+
#' @importFrom ggpp annotate
34+
#' @importFrom scales scientific
35+
#' @export
36+
#'
37+
#' @examples
38+
#' # library(ggplot2)
39+
#' # library(ggcoverage)
40+
#' # coverage.file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.xlsx", package = "ggcoverage")
41+
#' # fasta.file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.fasta", package = "ggcoverage")
42+
#' # protein.id = "sp|
43+
#' # ggplot() +
44+
#' # geom_peptide(coverage.file = coverage.file, fasta.file = fasta.file, protein.id = protein.id)
45+
geom_protein = function(coverage.file, fasta.file, protein.id, XCorr.threshold = 2,
46+
confidence = "High", contaminant = NULL, remove.na = TRUE,
47+
color = "grey", mark.bare = TRUE, mark.color = "red", mark.alpha = 0.5,
48+
show.table = TRUE, table.position = c("right_top", "left_top", "left_bottom", "right_bottom"),
49+
table.size = 4, table.color = "black", range.size = 3, range.position = c("in", "out")){
50+
# check parameters
51+
table.position <- match.arg(arg = table.position)
52+
range.position <- match.arg(arg = range.position)
53+
54+
# load coverage dataframe
55+
coverage.df = openxlsx::read.xlsx(coverage.file)
56+
# remove suffix and prefix string
57+
coverage.df$Annotated.Sequence = gsub(pattern = ".*\\.(.*)\\..*", replacement = "\\1", x = coverage.df$Annotated.Sequence)
58+
# filter converge according to confidence
59+
if(!is.null(confidence)){
60+
coverage.df = coverage.df[coverage.df[, "Confidence"] == confidence, ]
61+
}
62+
# filter converge according to contaminant
63+
if(!is.null(contaminant)){
64+
coverage.df = coverage.df[coverage.df[, "Contaminant"] == contaminant, ]
65+
}
66+
# filter converge according to cross-correlation
67+
if(!is.null(XCorr.threshold)){
68+
xcorr.index = grep(pattern = "XCorr", x = colnames(coverage.df))
69+
coverage.df = coverage.df[coverage.df[, xcorr.index] >= XCorr.threshold, ]
70+
}
71+
# get abundance cols
72+
abundance.col = grep(pattern = "Abundance", x = colnames(coverage.df), value = TRUE)
73+
# remove na abundance
74+
if(remove.na){
75+
coverage.df = coverage.df %>% dplyr::filter(!is.na(.data[[abundance.col]]))
76+
}
77+
# sum abundance of duplicated Annotated.Sequence
78+
coverage.df = coverage.df %>%
79+
dplyr::group_by(.data[["Annotated.Sequence"]]) %>%
80+
dplyr::summarise(Abundance = sum(.data[[abundance.col]])) %>%
81+
as.data.frame()
82+
colnames(coverage.df) = c("peptide", "abundance")
83+
# check the coverage dataframe
84+
if(nrow(coverage.df) == 0){
85+
stop("There is no valid peptide, please check!")
86+
}
87+
88+
# load genome fasta
89+
aa.set = Biostrings::readAAStringSet(fasta.file)
90+
protein.index = which(names(aa.set) == protein.id)
91+
if(length(protein.index) == 1){
92+
aa.set.used = aa.set[protein.index]
93+
aa.seq.used = paste(aa.set.used)
94+
}else if(length(protein.index) > 1){
95+
stop("Please check the protein.id you provided, there is more than one in provided fasta file!")
96+
}else{
97+
stop("Please check the protein.id you provided, it can't be found in provided fasta file!")
98+
}
99+
100+
# get the region
101+
aa.anno.region = sapply(coverage.df$peptide, function(x){
102+
stringr::str_locate(pattern =x, aa.seq.used)
103+
}) %>% t() %>% as.data.frame()
104+
colnames(aa.anno.region) = c("start", "end")
105+
106+
# merge
107+
coverage.final = merge(coverage.df, aa.anno.region, by.x = "peptide", by.y = 0, all.x = TRUE)
108+
coverage.final = coverage.final %>% dplyr::arrange(.data[["start"]], .data[["end"]])
109+
110+
# get coverage positions
111+
coverage.pos =
112+
GenomicRanges::reduce(GenomicRanges::GRanges(protein.id, IRanges::IRanges(coverage.final$start, coverage.final$end))) %>%
113+
as.data.frame()
114+
coverage.pos$strand = NULL
115+
colnames(coverage.pos) = c("ProteinID", "start", "end", "width")
116+
coverage.pos$Type = "covered"
117+
# get coverage rate
118+
coverage.rate = round(sum(coverage.pos$width)*100/nchar(aa.seq.used), 2)
119+
# non-cover position
120+
non.coverage.pos =
121+
GenomicRanges::setdiff(GenomicRanges::GRanges(protein.id, IRanges::IRanges(1, nchar(aa.seq.used))),
122+
GenomicRanges::GRanges(protein.id, IRanges::IRanges(coverage.final$start, coverage.final$end))) %>%
123+
as.data.frame()
124+
non.coverage.pos$strand = NULL
125+
colnames(non.coverage.pos) = c("ProteinID", "start", "end", "width")
126+
non.coverage.pos$Type = "bare"
127+
# coverage summary
128+
coverage.summary = rbind(coverage.pos, non.coverage.pos) %>% as.data.frame()
129+
130+
# coverage rect
131+
coverage.rect = geom_rect(data = coverage.final, mapping = aes_string(xmin = "start", xmax = "end",
132+
ymin = "0", ymax = "abundance"),
133+
show.legend = FALSE, fill = color)
134+
plot.ele <- list(coverage.rect)
135+
# mark bare
136+
if(mark.bare){
137+
bare.rect = geom_rect(data = non.coverage.pos, mapping = aes_string(xmin = "start", xmax = "end",
138+
ymin = "0", ymax = "Inf"),
139+
show.legend = F, fill = mark.color, alpha = mark.alpha)
140+
plot.ele <- append(plot.ele, bare.rect)
141+
}
142+
# summary table
143+
if(show.table){
144+
# table position
145+
if(table.position == "left_top"){
146+
table.x = 0
147+
table.y = max(coverage.final[ , "abundance"])
148+
}else if(table.position == "right_top"){
149+
table.x = nchar(aa.seq.used)
150+
table.y = max(coverage.final[ , "abundance"])
151+
}else if(table.position == "left_bottom"){
152+
table.x = 0
153+
table.y = 0
154+
}else if(table.position == "right_bottom"){
155+
table.x = nchar(aa.seq.used)
156+
table.y = 0
157+
}
158+
summary.table = ggpp::annotate(geom = "table", label = list(coverage.summary), x= table.x, y=table.y,
159+
color = table.color, size = table.size)
160+
plot.ele <- append(plot.ele, summary.table)
161+
}
162+
# range position
163+
if (range.position == "in") {
164+
# prepare range
165+
max.abundance = CeilingNumber(max(coverage.final$abundance))
166+
abundance.range = data.frame(label = paste0("[0, ", scales::scientific(max.abundance, digits = 2), "]"))
167+
range.text = geom_text(
168+
data = abundance.range,
169+
mapping = aes(x = -Inf, y = Inf, label = label),
170+
hjust = 0,
171+
vjust = 1.5,
172+
size = range.size
173+
)
174+
plot.ele <- append(plot.ele, range.text)
175+
}
176+
# change x scale
177+
plot.ele <- append(plot.ele, scale_x_continuous(limits = c(1, nchar(aa.seq.used)), expand = c(0, 0)))
178+
return(plot.ele)
179+
}

R/ggprotein.R

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
#' Create Mass Spectrometry Protein Coverage Plot.
2+
#'
3+
#' @param coverage.file Exported protein coverage file, should be in excel.
4+
#' @param fasta.file Input reference protein fasta file.
5+
#' @param protein.id The protein ID of exported coverage file. This should be unique and in \code{fasta.file}.
6+
#' @param XCorr.threshold The cross-correlation threshold. Default: 2.
7+
#' @param confidence The confidence level. Default: High.
8+
#' @param contaminant Whether to remove contaminant peptides. Default: NULL (not remove).
9+
#' @param remove.na Logical value, whether to remove NA value in Abundance column. Default: TRUE.
10+
#' @param color The fill color of coverage plot. Default: grey.
11+
#' @param mark.bare Logical value, whether to mark region where Abundance is zero or NA. Default: TRUE.
12+
#' @param mark.color The color used for the marked region. Default: red.
13+
#' @param mark.alpha The transparency used for the marked region. Default: 0.5.
14+
#' @param show.table Logical value, whether to show coverage summary table. Default: TRUE.
15+
#' @param table.position The position of the coverage summary table, choose from right_top, left_top, left_bottom, right_bottom.
16+
#' Default: right_top.
17+
#' @param table.size The font size of coverage summary table. Default: 4.
18+
#' @param table.color The font color of coverage summary table. Default: black.
19+
#' @param range.size The label size of range text, used when \code{range.position} is in. Default: 3.
20+
#' @param range.position The position of y axis range, chosen from in (move y axis in the plot) and
21+
#' out (normal y axis). Default: in.
22+
#'
23+
#' @return A ggplot2 object.
24+
#' @importFrom openxlsx read.xlsx
25+
#' @importFrom magrittr %>%
26+
#' @importFrom dplyr filter group_by summarise arrange
27+
#' @importFrom rlang .data
28+
#' @importFrom Biostrings readAAStringSet
29+
#' @importFrom stringr str_locate
30+
#' @importFrom GenomicRanges reduce GRanges setdiff
31+
#' @importFrom IRanges IRanges
32+
#' @importFrom ggplot2 ggplot geom_rect geom_text aes aes_string scale_x_continuous theme_classic theme
33+
#' element_blank annotate rel scale_y_continuous expansion
34+
#' @importFrom ggpp annotate
35+
#' @importFrom scales scientific
36+
#' @export
37+
#'
38+
#' @examples
39+
#' # library(ggcoverage)
40+
#' # coverage.file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.xlsx", package = "ggcoverage")
41+
#' # fasta.file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.fasta", package = "ggcoverage")
42+
#' # protein.id = "sp|P02769|ALBU_BOVIN"
43+
#' # ggprotein(coverage.file = coverage.file, fasta.file = fasta.file, protein.id = protein.id)
44+
ggprotein = function(coverage.file, fasta.file, protein.id, XCorr.threshold = 2,
45+
confidence = "High", contaminant = NULL, remove.na = TRUE,
46+
color = "grey", mark.bare = TRUE, mark.color = "red", mark.alpha = 0.5,
47+
show.table = TRUE, table.position = c("right_top", "left_top", "left_bottom", "right_bottom"),
48+
table.size = 4, table.color = "black", range.size = 3, range.position = c("in", "out"), plot.space = 0.2){
49+
# check parameters
50+
table.position <- match.arg(arg = table.position)
51+
range.position <- match.arg(arg = range.position)
52+
53+
# ms protein plot
54+
protein.plot = ggplot() +
55+
geom_protein(coverage.file = coverage.file, fasta.file = fasta.file, protein.id = protein.id,
56+
XCorr.threshold = XCorr.threshold, confidence = confidence, contaminant = contaminant,
57+
remove.na = remove.na, color = color, mark.bare = mark.bare, mark.color = mark.color,
58+
mark.alpha = mark.alpha, show.table = show.table, table.position = table.position,
59+
table.size = table.size, table.color = table.color, range.size = range.size, range.position = range.position)
60+
61+
# add theme
62+
if (range.position == "in") {
63+
protein.plot +
64+
theme_protein()
65+
} else if (range.position == "out") {
66+
protein.plot +
67+
theme_protein2()
68+
}
69+
}

R/theme_ggcoverage.R

Lines changed: 54 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
#'
66
#' @return List of layers.
77
#' @importFrom ggplot2 theme_classic theme unit element_blank annotate rel scale_y_continuous expansion
8-
#' scale_x_continuous coord_cartesian
8+
#' scale_x_continuous
99
#' @importFrom scales comma
1010
#' @export
1111
#'
@@ -32,7 +32,7 @@ theme_coverage <- function(space = 0.2) {
3232
#'
3333
#' @return List of layers.
3434
#' @importFrom ggplot2 scale_y_continuous expansion theme_classic theme unit element_blank annotate rel
35-
#' scale_x_continuous coord_cartesian
35+
#' scale_x_continuous
3636
#' @importFrom scales comma
3737
#' @export
3838
#'
@@ -471,3 +471,55 @@ theme_cnv <- function(x.range, margin.len) {
471471
coord_cartesian(xlim = x.range)
472472
)
473473
}
474+
475+
# theme for ggprotein: suitable for range position is in
476+
#' Theme for geom_protein.
477+
#'
478+
#' @return List of layers.
479+
#' @importFrom ggplot2 theme_classic theme unit element_blank annotate rel scale_y_continuous expansion
480+
#' scale_x_continuous
481+
#' @importFrom scales comma
482+
#' @export
483+
#'
484+
theme_protein <- function() {
485+
list(
486+
theme_classic(),
487+
theme(
488+
axis.line.y = element_blank(),
489+
axis.ticks.y = element_blank(),
490+
axis.text.y = element_blank(),
491+
axis.title = element_blank()
492+
),
493+
annotate("segment", x = -Inf, xend = Inf, y = -Inf, yend = -Inf, size = rel(1)),
494+
scale_y_continuous(expand = expansion(mult = c(0)))
495+
)
496+
}
497+
498+
# theme for ggprotein: suitable for range position is out
499+
#' Theme for geom_protein.
500+
#'
501+
#' @return List of layers.
502+
#' @importFrom ggplot2 scale_y_continuous expansion theme_classic theme unit element_blank annotate rel
503+
#' scale_x_continuous
504+
#' @importFrom scales comma scientific
505+
#' @export
506+
#'
507+
theme_protein2 <- function() {
508+
list(
509+
scale_y_continuous(
510+
limits = ~ c(0, CeilingNumber(max(.x)), digits = 2),
511+
breaks = ~ .x[2],
512+
expand = expansion(mult = c(0)),
513+
labels = function(x) format(x, scientific = TRUE, digits = 2)
514+
),
515+
theme_classic(),
516+
theme(
517+
axis.title = element_blank()
518+
),
519+
annotate("segment", x = -Inf, xend = Inf, y = -Inf, yend = -Inf, size = rel(1))
520+
)
521+
}
522+
523+
524+
525+

README.Rmd

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ knitr::opts_chunk$set(
1414
)
1515
```
1616

17-
# ggcoverage - Visualize and annotate genome coverage with ggplot2
17+
# ggcoverage - Visualize and annotate omics coverage with ggplot2
1818

1919
<img src = "man/figures/ggcoverage.png" align = "right" width = "200"/>
2020

@@ -23,10 +23,10 @@ knitr::opts_chunk$set(
2323
[![CODE_SIZE](https://img.shields.io/github/languages/code-size/showteeth/ggcoverage.svg)](https://github.com/showteeth/ggcoverage)
2424

2525
## Introduction
26-
The goal of `ggcoverage` is simplify the process of visualizing genome coverage. It contains three main parts:
26+
The goal of `ggcoverage` is simplify the process of visualizing omics coverage. It contains three main parts:
2727

28-
* **Load the data**: `ggcoverage` can load BAM, BigWig (.bw), BedGraph files from various NGS data, including WGS, RNA-seq, ChIP-seq, ATAC-seq, et al.
29-
* **Create genome coverage plot**
28+
* **Load the data**: `ggcoverage` can load BAM, BigWig (.bw), BedGraph, txt/xlsx files from various omics data, including WGS, RNA-seq, ChIP-seq, ATAC-seq, proteomics, et al.
29+
* **Create omics coverage plot**
3030
* **Add annotations**: `ggcoverage` supports six different annotations:
3131
* **base and amino acid annotation**: Visualize genome coverage at single-nucleotide level with bases and amino acids.
3232
* **GC annotation**: Visualize genome coverage with GC content
@@ -37,6 +37,7 @@ knitr::opts_chunk$set(
3737
* **peak annotation**: Visualize genome coverage and peak identified
3838
* **contact map annotation**: Visualize genome coverage with Hi-C contact map
3939
* **link annotation**: Visualize genome coverage with contacts
40+
* **peotein feature annotation**: Visualize protein coverage with features
4041

4142
`ggcoverage` utilizes `ggplot2` plotting system, so its usage is **ggplot2-style**!
4243

0 commit comments

Comments
 (0)