Skip to content

Commit

Permalink
support proteomics
Browse files Browse the repository at this point in the history
  • Loading branch information
showteeth committed May 24, 2023
1 parent a76a704 commit ad391c3
Show file tree
Hide file tree
Showing 9 changed files with 344 additions and 20 deletions.
16 changes: 10 additions & 6 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
Package: ggcoverage
Type: Package
Title: Visualize Genome Coverage with Various Annotations
Version: 1.1.0
Title: Visualize Genome/Protein Coverage with Various Annotations
Version: 1.2.0
Authors@R:
person(given = "Yabing",
family = "Song",
role = c("aut", "cre"),
email = "[email protected]")
Maintainer: Yabing Song <[email protected]>
Description: The goal of 'ggcoverage' is to simplify the process of visualizing genome coverage. It contains functions to
load data from BAM, BigWig or BedGraph files, create genome coverage plot, add various annotations to
the coverage plot, including base and amino acid annotation, GC annotation, gene annotation, transcript annotation, ideogram annotation and peak annotation.
Description: The goal of 'ggcoverage' is to simplify the process of visualizing genome/protein coverage. It contains functions to
load data from BAM, BigWig, BedGraph or txt/xlsx files, create genome/protein coverage plot, add various annotations to
the coverage plot, including base and amino acid annotation, GC annotation, gene annotation, transcript annotation, ideogram annotation,
peak annotation, contact map annotation, link annotation and peotein feature annotation.
License: MIT + file LICENSE
Encoding: UTF-8
RoxygenNote: 7.1.1
Expand Down Expand Up @@ -45,7 +46,10 @@ Imports:
ggforce,
HiCBricks,
ggpattern,
BiocParallel
BiocParallel,
openxlsx,
stringr,
ggpp
Suggests:
rmarkdown,
knitr,
Expand Down
6 changes: 6 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
# ggcoverage 1.2.0
## Major changes
* Supporting protein coverage and annotation plot (`ggprotein`, `geom_protein`).

-------------

# ggcoverage 1.1.0
## Major changes
* Mark SNV with twill (add twill to position with SNV), star (add star mark to position with SNV), and highlight (position without SNV is grey).
Expand Down
179 changes: 179 additions & 0 deletions R/geom_protein.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
#' Layer for Protein Coverage Plot.
#'
#' @param coverage.file Exported protein coverage file, should be in excel.
#' @param fasta.file Input reference protein fasta file.
#' @param protein.id The protein ID of exported coverage file. This should be unique and in \code{fasta.file}.
#' @param XCorr.threshold The cross-correlation threshold. Default: 2.
#' @param confidence The confidence level. Default: High.
#' @param contaminant Whether to remove contaminant peptides. Default: NULL (not remove).
#' @param remove.na Logical value, whether to remove NA value in Abundance column. Default: TRUE.
#' @param color The fill color of coverage plot. Default: grey.
#' @param mark.bare Logical value, whether to mark region where Abundance is zero or NA. Default: TRUE.
#' @param mark.color The color used for the marked region. Default: red.
#' @param mark.alpha The transparency used for the marked region. Default: 0.5.
#' @param show.table Logical value, whether to show coverage summary table. Default: TRUE.
#' @param table.position The position of the coverage summary table, choose from right_top, left_top, left_bottom, right_bottom.
#' Default: right_top.
#' @param table.size The font size of coverage summary table. Default: 4.
#' @param table.color The font color of coverage summary table. Default: black.
#' @param range.size The label size of range text, used when \code{range.position} is in. Default: 3.
#' @param range.position The position of y axis range, chosen from in (move y axis in the plot) and
#' out (normal y axis). Default: in.
#'
#' @return A ggplot2 object.
#' @importFrom openxlsx read.xlsx
#' @importFrom magrittr %>%
#' @importFrom dplyr filter group_by summarise arrange
#' @importFrom rlang .data
#' @importFrom Biostrings readAAStringSet
#' @importFrom stringr str_locate
#' @importFrom GenomicRanges reduce GRanges setdiff
#' @importFrom IRanges IRanges
#' @importFrom ggplot2 ggplot geom_rect geom_text aes aes_string scale_x_continuous
#' @importFrom ggpp annotate
#' @importFrom scales scientific
#' @export
#'
#' @examples
#' # library(ggplot2)
#' # library(ggcoverage)
#' # coverage.file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.xlsx", package = "ggcoverage")
#' # fasta.file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.fasta", package = "ggcoverage")
#' # protein.id = "sp|
#' # ggplot() +
#' # geom_peptide(coverage.file = coverage.file, fasta.file = fasta.file, protein.id = protein.id)
geom_protein = function(coverage.file, fasta.file, protein.id, XCorr.threshold = 2,
confidence = "High", contaminant = NULL, remove.na = TRUE,
color = "grey", mark.bare = TRUE, mark.color = "red", mark.alpha = 0.5,
show.table = TRUE, table.position = c("right_top", "left_top", "left_bottom", "right_bottom"),
table.size = 4, table.color = "black", range.size = 3, range.position = c("in", "out")){
# check parameters
table.position <- match.arg(arg = table.position)
range.position <- match.arg(arg = range.position)

# load coverage dataframe
coverage.df = openxlsx::read.xlsx(coverage.file)
# remove suffix and prefix string
coverage.df$Annotated.Sequence = gsub(pattern = ".*\\.(.*)\\..*", replacement = "\\1", x = coverage.df$Annotated.Sequence)
# filter converge according to confidence
if(!is.null(confidence)){
coverage.df = coverage.df[coverage.df[, "Confidence"] == confidence, ]
}
# filter converge according to contaminant
if(!is.null(contaminant)){
coverage.df = coverage.df[coverage.df[, "Contaminant"] == contaminant, ]
}
# filter converge according to cross-correlation
if(!is.null(XCorr.threshold)){
xcorr.index = grep(pattern = "XCorr", x = colnames(coverage.df))
coverage.df = coverage.df[coverage.df[, xcorr.index] >= XCorr.threshold, ]
}
# get abundance cols
abundance.col = grep(pattern = "Abundance", x = colnames(coverage.df), value = TRUE)
# remove na abundance
if(remove.na){
coverage.df = coverage.df %>% dplyr::filter(!is.na(.data[[abundance.col]]))
}
# sum abundance of duplicated Annotated.Sequence
coverage.df = coverage.df %>%
dplyr::group_by(.data[["Annotated.Sequence"]]) %>%
dplyr::summarise(Abundance = sum(.data[[abundance.col]])) %>%
as.data.frame()
colnames(coverage.df) = c("peptide", "abundance")
# check the coverage dataframe
if(nrow(coverage.df) == 0){
stop("There is no valid peptide, please check!")
}

# load genome fasta
aa.set = Biostrings::readAAStringSet(fasta.file)
protein.index = which(names(aa.set) == protein.id)
if(length(protein.index) == 1){
aa.set.used = aa.set[protein.index]
aa.seq.used = paste(aa.set.used)
}else if(length(protein.index) > 1){
stop("Please check the protein.id you provided, there is more than one in provided fasta file!")
}else{
stop("Please check the protein.id you provided, it can't be found in provided fasta file!")
}

# get the region
aa.anno.region = sapply(coverage.df$peptide, function(x){
stringr::str_locate(pattern =x, aa.seq.used)
}) %>% t() %>% as.data.frame()
colnames(aa.anno.region) = c("start", "end")

# merge
coverage.final = merge(coverage.df, aa.anno.region, by.x = "peptide", by.y = 0, all.x = TRUE)
coverage.final = coverage.final %>% dplyr::arrange(.data[["start"]], .data[["end"]])

# get coverage positions
coverage.pos =
GenomicRanges::reduce(GenomicRanges::GRanges(protein.id, IRanges::IRanges(coverage.final$start, coverage.final$end))) %>%
as.data.frame()
coverage.pos$strand = NULL
colnames(coverage.pos) = c("ProteinID", "start", "end", "width")
coverage.pos$Type = "covered"
# get coverage rate
coverage.rate = round(sum(coverage.pos$width)*100/nchar(aa.seq.used), 2)
# non-cover position
non.coverage.pos =
GenomicRanges::setdiff(GenomicRanges::GRanges(protein.id, IRanges::IRanges(1, nchar(aa.seq.used))),
GenomicRanges::GRanges(protein.id, IRanges::IRanges(coverage.final$start, coverage.final$end))) %>%
as.data.frame()
non.coverage.pos$strand = NULL
colnames(non.coverage.pos) = c("ProteinID", "start", "end", "width")
non.coverage.pos$Type = "bare"
# coverage summary
coverage.summary = rbind(coverage.pos, non.coverage.pos) %>% as.data.frame()

# coverage rect
coverage.rect = geom_rect(data = coverage.final, mapping = aes_string(xmin = "start", xmax = "end",
ymin = "0", ymax = "abundance"),
show.legend = FALSE, fill = color)
plot.ele <- list(coverage.rect)
# mark bare
if(mark.bare){
bare.rect = geom_rect(data = non.coverage.pos, mapping = aes_string(xmin = "start", xmax = "end",
ymin = "0", ymax = "Inf"),
show.legend = F, fill = mark.color, alpha = mark.alpha)
plot.ele <- append(plot.ele, bare.rect)
}
# summary table
if(show.table){
# table position
if(table.position == "left_top"){
table.x = 0
table.y = max(coverage.final[ , "abundance"])
}else if(table.position == "right_top"){
table.x = nchar(aa.seq.used)
table.y = max(coverage.final[ , "abundance"])
}else if(table.position == "left_bottom"){
table.x = 0
table.y = 0
}else if(table.position == "right_bottom"){
table.x = nchar(aa.seq.used)
table.y = 0
}
summary.table = ggpp::annotate(geom = "table", label = list(coverage.summary), x= table.x, y=table.y,
color = table.color, size = table.size)
plot.ele <- append(plot.ele, summary.table)
}
# range position
if (range.position == "in") {
# prepare range
max.abundance = CeilingNumber(max(coverage.final$abundance))
abundance.range = data.frame(label = paste0("[0, ", scales::scientific(max.abundance, digits = 2), "]"))
range.text = geom_text(
data = abundance.range,
mapping = aes(x = -Inf, y = Inf, label = label),
hjust = 0,
vjust = 1.5,
size = range.size
)
plot.ele <- append(plot.ele, range.text)
}
# change x scale
plot.ele <- append(plot.ele, scale_x_continuous(limits = c(1, nchar(aa.seq.used)), expand = c(0, 0)))
return(plot.ele)
}
69 changes: 69 additions & 0 deletions R/ggprotein.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
#' Create Mass Spectrometry Protein Coverage Plot.
#'
#' @param coverage.file Exported protein coverage file, should be in excel.
#' @param fasta.file Input reference protein fasta file.
#' @param protein.id The protein ID of exported coverage file. This should be unique and in \code{fasta.file}.
#' @param XCorr.threshold The cross-correlation threshold. Default: 2.
#' @param confidence The confidence level. Default: High.
#' @param contaminant Whether to remove contaminant peptides. Default: NULL (not remove).
#' @param remove.na Logical value, whether to remove NA value in Abundance column. Default: TRUE.
#' @param color The fill color of coverage plot. Default: grey.
#' @param mark.bare Logical value, whether to mark region where Abundance is zero or NA. Default: TRUE.
#' @param mark.color The color used for the marked region. Default: red.
#' @param mark.alpha The transparency used for the marked region. Default: 0.5.
#' @param show.table Logical value, whether to show coverage summary table. Default: TRUE.
#' @param table.position The position of the coverage summary table, choose from right_top, left_top, left_bottom, right_bottom.
#' Default: right_top.
#' @param table.size The font size of coverage summary table. Default: 4.
#' @param table.color The font color of coverage summary table. Default: black.
#' @param range.size The label size of range text, used when \code{range.position} is in. Default: 3.
#' @param range.position The position of y axis range, chosen from in (move y axis in the plot) and
#' out (normal y axis). Default: in.
#'
#' @return A ggplot2 object.
#' @importFrom openxlsx read.xlsx
#' @importFrom magrittr %>%
#' @importFrom dplyr filter group_by summarise arrange
#' @importFrom rlang .data
#' @importFrom Biostrings readAAStringSet
#' @importFrom stringr str_locate
#' @importFrom GenomicRanges reduce GRanges setdiff
#' @importFrom IRanges IRanges
#' @importFrom ggplot2 ggplot geom_rect geom_text aes aes_string scale_x_continuous theme_classic theme
#' element_blank annotate rel scale_y_continuous expansion
#' @importFrom ggpp annotate
#' @importFrom scales scientific
#' @export
#'
#' @examples
#' # library(ggcoverage)
#' # coverage.file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.xlsx", package = "ggcoverage")
#' # fasta.file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.fasta", package = "ggcoverage")
#' # protein.id = "sp|P02769|ALBU_BOVIN"
#' # ggprotein(coverage.file = coverage.file, fasta.file = fasta.file, protein.id = protein.id)
ggprotein = function(coverage.file, fasta.file, protein.id, XCorr.threshold = 2,
confidence = "High", contaminant = NULL, remove.na = TRUE,
color = "grey", mark.bare = TRUE, mark.color = "red", mark.alpha = 0.5,
show.table = TRUE, table.position = c("right_top", "left_top", "left_bottom", "right_bottom"),
table.size = 4, table.color = "black", range.size = 3, range.position = c("in", "out"), plot.space = 0.2){
# check parameters
table.position <- match.arg(arg = table.position)
range.position <- match.arg(arg = range.position)

# ms protein plot
protein.plot = ggplot() +
geom_protein(coverage.file = coverage.file, fasta.file = fasta.file, protein.id = protein.id,
XCorr.threshold = XCorr.threshold, confidence = confidence, contaminant = contaminant,
remove.na = remove.na, color = color, mark.bare = mark.bare, mark.color = mark.color,
mark.alpha = mark.alpha, show.table = show.table, table.position = table.position,
table.size = table.size, table.color = table.color, range.size = range.size, range.position = range.position)

# add theme
if (range.position == "in") {
protein.plot +
theme_protein()
} else if (range.position == "out") {
protein.plot +
theme_protein2()
}
}
56 changes: 54 additions & 2 deletions R/theme_ggcoverage.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
#'
#' @return List of layers.
#' @importFrom ggplot2 theme_classic theme unit element_blank annotate rel scale_y_continuous expansion
#' scale_x_continuous coord_cartesian
#' scale_x_continuous
#' @importFrom scales comma
#' @export
#'
Expand All @@ -32,7 +32,7 @@ theme_coverage <- function(space = 0.2) {
#'
#' @return List of layers.
#' @importFrom ggplot2 scale_y_continuous expansion theme_classic theme unit element_blank annotate rel
#' scale_x_continuous coord_cartesian
#' scale_x_continuous
#' @importFrom scales comma
#' @export
#'
Expand Down Expand Up @@ -471,3 +471,55 @@ theme_cnv <- function(x.range, margin.len) {
coord_cartesian(xlim = x.range)
)
}

# theme for ggprotein: suitable for range position is in
#' Theme for geom_protein.
#'
#' @return List of layers.
#' @importFrom ggplot2 theme_classic theme unit element_blank annotate rel scale_y_continuous expansion
#' scale_x_continuous
#' @importFrom scales comma
#' @export
#'
theme_protein <- function() {
list(
theme_classic(),
theme(
axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.title = element_blank()
),
annotate("segment", x = -Inf, xend = Inf, y = -Inf, yend = -Inf, size = rel(1)),
scale_y_continuous(expand = expansion(mult = c(0)))
)
}

# theme for ggprotein: suitable for range position is out
#' Theme for geom_protein.
#'
#' @return List of layers.
#' @importFrom ggplot2 scale_y_continuous expansion theme_classic theme unit element_blank annotate rel
#' scale_x_continuous
#' @importFrom scales comma scientific
#' @export
#'
theme_protein2 <- function() {
list(
scale_y_continuous(
limits = ~ c(0, CeilingNumber(max(.x)), digits = 2),
breaks = ~ .x[2],
expand = expansion(mult = c(0)),
labels = function(x) format(x, scientific = TRUE, digits = 2)
),
theme_classic(),
theme(
axis.title = element_blank()
),
annotate("segment", x = -Inf, xend = Inf, y = -Inf, yend = -Inf, size = rel(1))
)
}




9 changes: 5 additions & 4 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ knitr::opts_chunk$set(
)
```

# ggcoverage - Visualize and annotate genome coverage with ggplot2
# ggcoverage - Visualize and annotate omics coverage with ggplot2

<img src = "man/figures/ggcoverage.png" align = "right" width = "200"/>

Expand All @@ -23,10 +23,10 @@ knitr::opts_chunk$set(
[![CODE_SIZE](https://img.shields.io/github/languages/code-size/showteeth/ggcoverage.svg)](https://github.com/showteeth/ggcoverage)

## Introduction
The goal of `ggcoverage` is simplify the process of visualizing genome coverage. It contains three main parts:
The goal of `ggcoverage` is simplify the process of visualizing omics coverage. It contains three main parts:

* **Load the data**: `ggcoverage` can load BAM, BigWig (.bw), BedGraph files from various NGS data, including WGS, RNA-seq, ChIP-seq, ATAC-seq, et al.
* **Create genome coverage plot**
* **Load the data**: `ggcoverage` can load BAM, BigWig (.bw), BedGraph, txt/xlsx files from various omics data, including WGS, RNA-seq, ChIP-seq, ATAC-seq, proteomics, et al.
* **Create omics coverage plot**
* **Add annotations**: `ggcoverage` supports six different annotations:
* **base and amino acid annotation**: Visualize genome coverage at single-nucleotide level with bases and amino acids.
* **GC annotation**: Visualize genome coverage with GC content
Expand All @@ -37,6 +37,7 @@ knitr::opts_chunk$set(
* **peak annotation**: Visualize genome coverage and peak identified
* **contact map annotation**: Visualize genome coverage with Hi-C contact map
* **link annotation**: Visualize genome coverage with contacts
* **peotein feature annotation**: Visualize protein coverage with features

`ggcoverage` utilizes `ggplot2` plotting system, so its usage is **ggplot2-style**!

Expand Down
Loading

0 comments on commit ad391c3

Please sign in to comment.