support proteomics

showteeth · showteeth · commit ad391c31b2ca · 2023-05-24T14:01:30.000+08:00
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,16 +1,17 @@
 Package: ggcoverage
 Type: Package
-Title: Visualize Genome Coverage with Various Annotations
-Version: 1.1.0
+Title: Visualize Genome/Protein Coverage with Various Annotations
+Version: 1.2.0
 Authors@R: 
     person(given = "Yabing",
            family = "Song",
            role = c("aut", "cre"),
            email = "songyb0519@gmail.com")
 Maintainer: Yabing Song <songyb0519@gmail.com>
-Description: The goal of 'ggcoverage' is to simplify the process of visualizing genome coverage. It contains functions to 
-    load data from BAM, BigWig or BedGraph files, create genome coverage plot, add various annotations to 
-    the coverage plot, including base and amino acid annotation, GC annotation, gene annotation, transcript annotation, ideogram annotation and peak annotation.
+Description: The goal of 'ggcoverage' is to simplify the process of visualizing genome/protein coverage. It contains functions to 
+    load data from BAM, BigWig, BedGraph or txt/xlsx files, create genome/protein coverage plot, add various annotations to 
+    the coverage plot, including base and amino acid annotation, GC annotation, gene annotation, transcript annotation, ideogram annotation,
+    peak annotation, contact map annotation, link annotation and peotein feature annotation.
 License: MIT + file LICENSE
 Encoding: UTF-8
 RoxygenNote: 7.1.1
@@ -45,7 +46,10 @@ Imports:
     ggforce,
     HiCBricks,
     ggpattern,
-    BiocParallel
+    BiocParallel,
+    openxlsx,
+    stringr,
+    ggpp
 Suggests: 
     rmarkdown,
     knitr,
diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,9 @@
+# ggcoverage 1.2.0
+## Major changes
+* Supporting protein coverage and annotation plot (`ggprotein`, `geom_protein`).
+
+-------------
+
 # ggcoverage 1.1.0
 ## Major changes
 * Mark SNV with twill (add twill to position with SNV), star (add star mark to position with SNV), and highlight (position without SNV is grey).
diff --git a/R/geom_protein.R b/R/geom_protein.R
@@ -0,0 +1,179 @@
+#' Layer for Protein Coverage Plot.
+#'
+#' @param coverage.file Exported protein coverage file, should be in excel.
+#' @param fasta.file Input reference protein fasta file.
+#' @param protein.id The protein ID of exported coverage file. This should be unique and in \code{fasta.file}.
+#' @param XCorr.threshold The cross-correlation threshold. Default: 2.
+#' @param confidence The confidence level. Default: High.
+#' @param contaminant Whether to remove contaminant peptides. Default: NULL (not remove).
+#' @param remove.na Logical value, whether to remove NA value in Abundance column. Default: TRUE.
+#' @param color The fill color of coverage plot. Default: grey.
+#' @param mark.bare Logical value, whether to mark region where Abundance is zero or NA. Default: TRUE.
+#' @param mark.color The color used for the marked region. Default: red.
+#' @param mark.alpha The transparency used for the marked region. Default: 0.5.
+#' @param show.table Logical value, whether to show coverage summary table. Default: TRUE.
+#' @param table.position The position of the coverage summary table, choose from right_top, left_top, left_bottom, right_bottom.
+#' Default: right_top.
+#' @param table.size The font size of coverage summary table. Default: 4.
+#' @param table.color The font color of coverage summary table. Default: black.
+#' @param range.size The label size of range text, used when \code{range.position} is in. Default: 3.
+#' @param range.position The position of y axis range, chosen from in (move y axis in the plot) and
+#' out (normal y axis). Default: in.
+#'
+#' @return A ggplot2 object.
+#' @importFrom openxlsx read.xlsx
+#' @importFrom magrittr %>%
+#' @importFrom dplyr filter group_by summarise arrange
+#' @importFrom rlang .data
+#' @importFrom Biostrings readAAStringSet
+#' @importFrom stringr str_locate
+#' @importFrom GenomicRanges reduce GRanges setdiff
+#' @importFrom IRanges IRanges
+#' @importFrom ggplot2 ggplot geom_rect geom_text aes aes_string scale_x_continuous
+#' @importFrom ggpp annotate
+#' @importFrom scales scientific
+#' @export
+#'
+#' @examples
+#' # library(ggplot2)
+#' # library(ggcoverage)
+#' # coverage.file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.xlsx", package = "ggcoverage")
+#' # fasta.file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.fasta", package = "ggcoverage")
+#' # protein.id = "sp|
+#' # ggplot() +
+#' #     geom_peptide(coverage.file = coverage.file, fasta.file = fasta.file, protein.id = protein.id)
+geom_protein = function(coverage.file, fasta.file, protein.id, XCorr.threshold = 2,
+                        confidence = "High", contaminant = NULL, remove.na = TRUE,
+                        color = "grey", mark.bare = TRUE, mark.color = "red", mark.alpha = 0.5,
+                        show.table = TRUE, table.position = c("right_top", "left_top", "left_bottom", "right_bottom"),
+                        table.size = 4, table.color = "black", range.size = 3, range.position = c("in", "out")){
+  # check parameters
+  table.position <- match.arg(arg = table.position)
+  range.position <- match.arg(arg = range.position)
+
+  # load coverage dataframe
+  coverage.df = openxlsx::read.xlsx(coverage.file)
+  # remove suffix and prefix string
+  coverage.df$Annotated.Sequence = gsub(pattern = ".*\\.(.*)\\..*", replacement = "\\1", x = coverage.df$Annotated.Sequence)
+  # filter converge according to confidence
+  if(!is.null(confidence)){
+    coverage.df = coverage.df[coverage.df[, "Confidence"] == confidence, ]
+  }
+  # filter converge according to contaminant
+  if(!is.null(contaminant)){
+    coverage.df = coverage.df[coverage.df[, "Contaminant"] == contaminant, ]
+  }
+  # filter converge according to cross-correlation
+  if(!is.null(XCorr.threshold)){
+    xcorr.index = grep(pattern = "XCorr", x = colnames(coverage.df))
+    coverage.df = coverage.df[coverage.df[, xcorr.index] >= XCorr.threshold, ]
+  }
+  # get abundance cols
+  abundance.col = grep(pattern = "Abundance", x = colnames(coverage.df), value = TRUE)
+  # remove na abundance
+  if(remove.na){
+    coverage.df = coverage.df %>% dplyr::filter(!is.na(.data[[abundance.col]]))
+  }
+  # sum abundance of duplicated Annotated.Sequence
+  coverage.df = coverage.df %>%
+    dplyr::group_by(.data[["Annotated.Sequence"]]) %>%
+    dplyr::summarise(Abundance = sum(.data[[abundance.col]])) %>%
+    as.data.frame()
+  colnames(coverage.df) = c("peptide", "abundance")
+  # check the coverage dataframe
+  if(nrow(coverage.df) == 0){
+    stop("There is no valid peptide, please check!")
+  }
+
+  # load genome fasta
+  aa.set = Biostrings::readAAStringSet(fasta.file)
+  protein.index = which(names(aa.set) == protein.id)
+  if(length(protein.index) == 1){
+    aa.set.used = aa.set[protein.index]
+    aa.seq.used = paste(aa.set.used)
+  }else if(length(protein.index) > 1){
+    stop("Please check the protein.id you provided, there is more than one in provided fasta file!")
+  }else{
+    stop("Please check the protein.id you provided, it can't be found in provided fasta file!")
+  }
+
+  # get the region
+  aa.anno.region = sapply(coverage.df$peptide, function(x){
+    stringr::str_locate(pattern =x, aa.seq.used)
+  }) %>% t() %>% as.data.frame()
+  colnames(aa.anno.region) = c("start", "end")
+
+  # merge
+  coverage.final = merge(coverage.df, aa.anno.region, by.x = "peptide", by.y = 0, all.x = TRUE)
+  coverage.final = coverage.final %>% dplyr::arrange(.data[["start"]], .data[["end"]])
+
+  # get coverage positions
+  coverage.pos =
+    GenomicRanges::reduce(GenomicRanges::GRanges(protein.id, IRanges::IRanges(coverage.final$start, coverage.final$end))) %>%
+    as.data.frame()
+  coverage.pos$strand = NULL
+  colnames(coverage.pos) = c("ProteinID", "start", "end", "width")
+  coverage.pos$Type = "covered"
+  # get coverage rate
+  coverage.rate = round(sum(coverage.pos$width)*100/nchar(aa.seq.used), 2)
+  # non-cover position
+  non.coverage.pos =
+    GenomicRanges::setdiff(GenomicRanges::GRanges(protein.id, IRanges::IRanges(1, nchar(aa.seq.used))),
+                           GenomicRanges::GRanges(protein.id, IRanges::IRanges(coverage.final$start, coverage.final$end))) %>%
+    as.data.frame()
+  non.coverage.pos$strand = NULL
+  colnames(non.coverage.pos) = c("ProteinID", "start", "end", "width")
+  non.coverage.pos$Type = "bare"
+  # coverage summary
+  coverage.summary = rbind(coverage.pos, non.coverage.pos) %>% as.data.frame()
+
+  # coverage rect
+  coverage.rect = geom_rect(data = coverage.final, mapping = aes_string(xmin = "start", xmax = "end",
+                                                                        ymin = "0", ymax = "abundance"),
+                            show.legend = FALSE, fill = color)
+  plot.ele <- list(coverage.rect)
+  # mark bare
+  if(mark.bare){
+    bare.rect = geom_rect(data = non.coverage.pos, mapping = aes_string(xmin = "start", xmax = "end",
+                                                                        ymin = "0", ymax = "Inf"),
+                          show.legend = F, fill = mark.color, alpha = mark.alpha)
+    plot.ele <- append(plot.ele, bare.rect)
+  }
+  # summary table
+  if(show.table){
+    # table position
+    if(table.position == "left_top"){
+      table.x = 0
+      table.y = max(coverage.final[ , "abundance"])
+    }else if(table.position == "right_top"){
+      table.x = nchar(aa.seq.used)
+      table.y = max(coverage.final[ , "abundance"])
+    }else if(table.position == "left_bottom"){
+      table.x = 0
+      table.y = 0
+    }else if(table.position == "right_bottom"){
+      table.x = nchar(aa.seq.used)
+      table.y = 0
+    }
+    summary.table = ggpp::annotate(geom = "table", label = list(coverage.summary), x= table.x, y=table.y,
+                                   color = table.color, size = table.size)
+    plot.ele <- append(plot.ele, summary.table)
+  }
+  # range position
+  if (range.position == "in") {
+    # prepare range
+    max.abundance = CeilingNumber(max(coverage.final$abundance))
+    abundance.range = data.frame(label = paste0("[0, ", scales::scientific(max.abundance, digits = 2), "]"))
+    range.text = geom_text(
+      data = abundance.range,
+      mapping = aes(x = -Inf, y = Inf, label = label),
+      hjust = 0,
+      vjust = 1.5,
+      size = range.size
+    )
+    plot.ele <- append(plot.ele, range.text)
+  }
+  # change x scale
+  plot.ele <- append(plot.ele, scale_x_continuous(limits = c(1, nchar(aa.seq.used)), expand = c(0, 0)))
+  return(plot.ele)
+}
diff --git a/R/ggprotein.R b/R/ggprotein.R
@@ -0,0 +1,69 @@
+#' Create Mass Spectrometry Protein Coverage Plot.
+#'
+#' @param coverage.file Exported protein coverage file, should be in excel.
+#' @param fasta.file Input reference protein fasta file.
+#' @param protein.id The protein ID of exported coverage file. This should be unique and in \code{fasta.file}.
+#' @param XCorr.threshold The cross-correlation threshold. Default: 2.
+#' @param confidence The confidence level. Default: High.
+#' @param contaminant Whether to remove contaminant peptides. Default: NULL (not remove).
+#' @param remove.na Logical value, whether to remove NA value in Abundance column. Default: TRUE.
+#' @param color The fill color of coverage plot. Default: grey.
+#' @param mark.bare Logical value, whether to mark region where Abundance is zero or NA. Default: TRUE.
+#' @param mark.color The color used for the marked region. Default: red.
+#' @param mark.alpha The transparency used for the marked region. Default: 0.5.
+#' @param show.table Logical value, whether to show coverage summary table. Default: TRUE.
+#' @param table.position The position of the coverage summary table, choose from right_top, left_top, left_bottom, right_bottom.
+#' Default: right_top.
+#' @param table.size The font size of coverage summary table. Default: 4.
+#' @param table.color The font color of coverage summary table. Default: black.
+#' @param range.size The label size of range text, used when \code{range.position} is in. Default: 3.
+#' @param range.position The position of y axis range, chosen from in (move y axis in the plot) and
+#' out (normal y axis). Default: in.
+#'
+#' @return A ggplot2 object.
+#' @importFrom openxlsx read.xlsx
+#' @importFrom magrittr %>%
+#' @importFrom dplyr filter group_by summarise arrange
+#' @importFrom rlang .data
+#' @importFrom Biostrings readAAStringSet
+#' @importFrom stringr str_locate
+#' @importFrom GenomicRanges reduce GRanges setdiff
+#' @importFrom IRanges IRanges
+#' @importFrom ggplot2 ggplot geom_rect geom_text aes aes_string scale_x_continuous theme_classic theme
+#' element_blank annotate rel scale_y_continuous expansion
+#' @importFrom ggpp annotate
+#' @importFrom scales scientific
+#' @export
+#'
+#' @examples
+#' # library(ggcoverage)
+#' # coverage.file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.xlsx", package = "ggcoverage")
+#' # fasta.file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.fasta", package = "ggcoverage")
+#' # protein.id = "sp|P02769|ALBU_BOVIN"
+#' # ggprotein(coverage.file = coverage.file, fasta.file = fasta.file, protein.id = protein.id)
+ggprotein = function(coverage.file, fasta.file, protein.id, XCorr.threshold = 2,
+                     confidence = "High", contaminant = NULL, remove.na = TRUE,
+                     color = "grey", mark.bare = TRUE, mark.color = "red", mark.alpha = 0.5,
+                     show.table = TRUE, table.position = c("right_top", "left_top", "left_bottom", "right_bottom"),
+                     table.size = 4, table.color = "black", range.size = 3, range.position = c("in", "out"), plot.space = 0.2){
+  # check parameters
+  table.position <- match.arg(arg = table.position)
+  range.position <- match.arg(arg = range.position)
+
+  # ms protein plot
+  protein.plot = ggplot() +
+    geom_protein(coverage.file = coverage.file, fasta.file = fasta.file, protein.id = protein.id,
+                 XCorr.threshold = XCorr.threshold, confidence = confidence, contaminant = contaminant,
+                 remove.na = remove.na, color = color, mark.bare = mark.bare, mark.color = mark.color,
+                 mark.alpha = mark.alpha, show.table = show.table, table.position = table.position,
+                 table.size = table.size, table.color = table.color, range.size = range.size, range.position = range.position)
+
+  # add theme
+  if (range.position == "in") {
+    protein.plot +
+      theme_protein()
+  } else if (range.position == "out") {
+    protein.plot +
+      theme_protein2()
+  }
+}
diff --git a/R/theme_ggcoverage.R b/R/theme_ggcoverage.R
@@ -5,7 +5,7 @@
 #'
 #' @return List of layers.
 #' @importFrom ggplot2 theme_classic theme unit element_blank annotate rel scale_y_continuous expansion
-#' scale_x_continuous coord_cartesian
+#' scale_x_continuous
 #' @importFrom scales comma
 #' @export
 #'
@@ -32,7 +32,7 @@ theme_coverage <- function(space = 0.2) {
 #'
 #' @return List of layers.
 #' @importFrom ggplot2 scale_y_continuous expansion theme_classic theme unit element_blank annotate rel
-#' scale_x_continuous coord_cartesian
+#' scale_x_continuous
 #' @importFrom scales comma
 #' @export
 #'
@@ -471,3 +471,55 @@ theme_cnv <- function(x.range, margin.len) {
     coord_cartesian(xlim = x.range)
   )
 }
+
+# theme for ggprotein: suitable for range position is in
+#' Theme for geom_protein.
+#'
+#' @return List of layers.
+#' @importFrom ggplot2 theme_classic theme unit element_blank annotate rel scale_y_continuous expansion
+#' scale_x_continuous
+#' @importFrom scales comma
+#' @export
+#'
+theme_protein <- function() {
+  list(
+    theme_classic(),
+    theme(
+      axis.line.y = element_blank(),
+      axis.ticks.y = element_blank(),
+      axis.text.y = element_blank(),
+      axis.title = element_blank()
+    ),
+    annotate("segment", x = -Inf, xend = Inf, y = -Inf, yend = -Inf, size = rel(1)),
+    scale_y_continuous(expand = expansion(mult = c(0)))
+  )
+}
+
+# theme for ggprotein: suitable for range position is out
+#' Theme for geom_protein.
+#'
+#' @return List of layers.
+#' @importFrom ggplot2 scale_y_continuous expansion theme_classic theme unit element_blank annotate rel
+#' scale_x_continuous
+#' @importFrom scales comma scientific
+#' @export
+#'
+theme_protein2 <- function() {
+  list(
+    scale_y_continuous(
+      limits = ~ c(0, CeilingNumber(max(.x)), digits = 2),
+      breaks = ~ .x[2],
+      expand = expansion(mult = c(0)),
+      labels = function(x) format(x, scientific = TRUE, digits = 2)
+    ),
+    theme_classic(),
+    theme(
+      axis.title = element_blank()
+    ),
+    annotate("segment", x = -Inf, xend = Inf, y = -Inf, yend = -Inf, size = rel(1))
+  )
+}
+
+
+
+
diff --git a/README.Rmd b/README.Rmd
@@ -14,7 +14,7 @@ knitr::opts_chunk$set(
 )
 ```
 
-# ggcoverage - Visualize and annotate genome coverage with ggplot2
+# ggcoverage - Visualize and annotate omics coverage with ggplot2
 
 <img src = "man/figures/ggcoverage.png" align = "right" width = "200"/>
 
@@ -23,10 +23,10 @@ knitr::opts_chunk$set(
 [![CODE_SIZE](https://img.shields.io/github/languages/code-size/showteeth/ggcoverage.svg)](https://github.com/showteeth/ggcoverage)
 
 ## Introduction
-  The goal of `ggcoverage` is simplify the process of visualizing genome coverage. It contains three main parts:
+  The goal of `ggcoverage` is simplify the process of visualizing omics coverage. It contains three main parts:
 
-* **Load the data**: `ggcoverage` can load BAM, BigWig (.bw), BedGraph files from various NGS data, including WGS, RNA-seq, ChIP-seq, ATAC-seq, et al.
-* **Create genome coverage plot**
+* **Load the data**: `ggcoverage` can load BAM, BigWig (.bw), BedGraph, txt/xlsx files from various omics data, including WGS, RNA-seq, ChIP-seq, ATAC-seq, proteomics, et al.
+* **Create omics coverage plot**
 * **Add annotations**: `ggcoverage` supports six different annotations:
   * **base and amino acid annotation**: Visualize genome coverage at single-nucleotide level with bases and amino acids.
   * **GC annotation**: Visualize genome coverage with GC content
@@ -37,6 +37,7 @@ knitr::opts_chunk$set(
   * **peak annotation**: Visualize genome coverage and peak identified
   * **contact map annotation**: Visualize genome coverage with Hi-C contact map
   * **link annotation**: Visualize genome coverage with contacts
+  * **peotein feature annotation**: Visualize protein coverage with features
 
 `ggcoverage` utilizes `ggplot2` plotting system, so its usage is **ggplot2-style**!
 
diff --git a/inst/extdata/Proteomics/MS_BSA_coverage.fasta b/inst/extdata/Proteomics/MS_BSA_coverage.fasta
diff --git a/inst/extdata/Proteomics/MS_BSA_coverage.xlsx b/inst/extdata/Proteomics/MS_BSA_coverage.xlsx
diff --git a/vignettes/ggcoverage.Rmd b/vignettes/ggcoverage.Rmd