scPROTEIN

scPROTEIN is a deep contrastive learning framework for single-cell proteomics embedding.

Overview

For the datasets provided with raw peptide intensities, scPROTEIN stage1 estimates the uncertainty of peptide quantification and aggregates the peptide content to the protein level in an uncertainty-guided manner.
Taking the protein-level abundance matrix as input, scPROTEIN stage2 aims to alleviate data missingness, denoise the protein data, remove batch effects in a unified framework, and encode single-cell proteomic-specific embeddings. These embeddings can be applied to a variety of downstream tasks.

Setup

Users can directly download the scPROTEIN package using pip:

pip install scprotein

If for some reason this doesn't work on your device, you can also install scPROTEIN with the provided .whl file.

pip install docs/scprotein-0.1.1-py3-none-any.whl

You can check if scPROTEIN package has been successfully installed via the following command:

python3 -c "import scprotein"

Key Functions

For stage 1

You can utilize the functions of stage1 from the scPROTEIN python package as:

# Incorporate all functions of stage1:
from scprotein.peptide_uncertainty_estimation import * 

# or you can import specific function of stage1, for instance:
from scprotein.peptide_uncertainty_estimation import peptide_encode

load_peptide(data_path)

- Function:

Load the input peptide-level file, and then extract the peptide sequences, peptide-level data along with other meta information.

- Parameters:

data_path (str): Data path to load the peptide-level file.

- Returns:

peptides (list): Peptide sequences.
proteins (list): Protein names.
Y_label (array): Peptide-level abundance matrix (peptide*cell).
cell_list (list): The list containing the index of each cell.
num_cells (int): Number of total cells.

peptide_encode(peptides)

- Function:

This function takes as input peptide sequences composed of amino acids. It returns the corresponding one-hot encoding data matrix and the total number of different amino acid types.

- Parameters:

peptides (list): The input peptide sequences.

- Returns:

peptide_onehot_padding (array): One-hot encoding matrix for peptide sequences.
num_amino_acid (int): The number of different amino acid types.

peptide_CNN(num_amino_acid, max_pool_size, hidden_dim, output_dim, conv_layers, dropout_rate, kernel_nums, kernel_size)

- Function:

This function defines the Heteroscedastic regression model of scPROTEIN stage1 for peptide uncertainty estimation.

- Parameters:

num_amino_acid (int): The number of different amino acid types.
max_pool_size (int): The size of the sliding window in the max-pooling operation.
hidden_dim (int): The hidden dimension in the fully-connected layer.
output_dim (int): Output dimension of the Heteroscedastic regression model, which is twice the number of cells (each cell has a $\mu$ and a $\sigma$).
conv_layers (int): Number of convolutional layers.
dropout_rate (float): Dropout rate.
kernel_nums (int): Number of kernels in each convolutional block.
kernel_size (int): Kernel size of each convolutional block.

- Returns:

model (object): The defined Heteroscedastic regression model object.

scPROTEIN_stage1_learning(model, peptide_onehot_padding, Y_label, learning_rate, weight_decay, split_percentage, num_epochs, batch_size)

- Function:

This function constructs the framework for scPROTEIN stage1 training and prediction.

- Parameters:

model (object): Defined Heteroscedastic regression model object of stage1.
peptide_onehot_padding (array): One-hot encoding matrix for the input peptide sequences.
Y_label (array): Peptide-level abundance matrix (peptide*cell).
split_percentage (float): Split percentage of data.
learning_rate (float): Learning rate for the Adam optimizer.
weight_decay (float): Weight decay for the Adam optimizer.
num_epochs (int): Number of epochs for training stage1. We empirically set 90 to strike a balance between achieving convergence and reducing training time.
batch_size (int): Batch size for mini-batch training.

- Returns:

scPROTEIN_stage1 (object): The scPROTEIN stage1 object. The functions of scPROTEIN_stage1 are as follows:
- scPROTEIN_stage1.train(): Perform scPROTEIN stage1 training.
- scPROTEIN_stage1.uncertainty_generation(): Generate the estimated peptide uncertainty based on the trained stage1 model.

load_sc_proteomic_features(stage1)

- Function:

This function specifies whether to use stage1 and loads the single-cell protein-level data matrix.

- Parameters:

stage1 (bool): This parameter indicates if scPROTEIN starts from stage1. True represents generating protein-level data using stage1 in the uncertainty-guided manner, and False denotes directly learning from protein-level data.

- Returns:

proteins (list): Protein names.
cells (list): The list containing the index of each cell.
features (array): Single-cell proteomics data matrix.

For stage 2

Users can utilize the functions of stage2 from the fully-fledged scPROTEIN python package as:

# Incorporate all functions of stage2:
from scprotein import *

# or you can import specific function of stage2, for instance:
from scprotein import Encoder

graph_generation(features, threshold, feature_preprocess)

- Function:

This function constructs the cell graph based on the protein feature matrix.

- Parameters:

features (array): Single-cell proteomics data matrix.
threshold (float): Threshold for graph construction.
feature_preprocess (bool): Feature preprocessing.

- Returns:

graph_data (torch_geometric data object): The graph data in torch_geometric format, consisting of edges and node features.

Encoder(input_features, num_hidden, activation, num_layers)

- Function:

Construct the graph encoder for embedding learning.

- Parameters:

input_features (int): Dimension of the input feature matrix (usually the number of proteins).
num_hidden (int): Hidden dimension in the graph encoder.
activation (str): The type of non-linear activation function.
num_layers (int): Number of layers in the graph encoder.

- Returns:

encoder (PyTorch module): The defined graph encoder module.

Model(encoder, num_hidden, num_proj_hidden, tau)

- Function:

This function establishes the scPROTEIN stage2 model, consisting of a graph encoder, projection head, and loss calculation.

- Parameters:

encoder (PyTorch module): Defined graph encoder.
num_hidden (int): Hidden dimension in the graph encoder.
num_proj_hidden (int): Hidden dimension of the projection head.
tau (float): Temperature coefficient.

- Returns:

model (PyTorch module): The defined scPROTEIN stage2 model.

scPROTEIN_learning(model, device, data, drop_feature_rate_1, drop_feature_rate_2, drop_edge_rate_1, drop_edge_rate_2, learning_rate, weight_decay, num_protos, topology_denoising, num_epochs, alpha, num_changed_edges, seed)

- Function:

This function constructs the framework of scPROTEIN stage2 training and prediction.

- Parameters:

model (PyTorch module): Defined scPROTEIN stage2 model.
device (str): Running device.
data (torch_geometric data): The defined graph data in torch_geometric format, consisting of edges and node features.
drop_feature_rate_1 (float): Dropedge rate for augmentation view1.
drop_feature_rate_2 (float): Dropedge rate for augmentation view2.
drop_edge_rate_1 (float): Feature masking rate for augmentation view1.
drop_edge_rate_2 (float): Feature masking rate for augmentation view2.
learning_rate (float): Learning rate for Adam optimizer.
weight_decay (float): Weight decay for Adam optimizer.
num_protos (int): Number of prototypes.
topology_denoising (bool): Indicator of if using the topology denoising.
num_epochs (int): Number of epochs for training stage2. We empirically set 200 to strike a balance between achieving convergence and reducing training time.
alpha (float): Balance factor in the loss function.
num_changed_edges (int): Number of added/removed edges in topology denoising.
seed (int): Random seed.

- Returns:

scPROTEIN object for stage2. The functions of scPROTEIN are as follows:
- scPROTEIN.train(): Conduct training of scPROTEIN stage2.
- scPROTEIN.embedding_generation(): Generate the cell representation matrix based on the trained stage2 model.

integrate_sc_proteomic_features(dataset1, dataset2)

- Function:

This function prepares for integrating different single-cell proteomics datasets.

- Parameters:

dataset1 (h5ad format): First dataset for integration.
dataset2 (h5ad format): Second dataset for integration.

- Returns:

batch_label (array): The batch label which indicates the source of each cell.
cell_type_with_dataname (list): Cell type of each cell, along with the dataset name.
cell_type_label (array): Discrete cell type labels.
overlap_cell_type_label (list): The overlap cell type(s) of both integrated datasets.
features_concat (array): The combination of single-cell proteomics data from both integrated datasets, using the overlap proteins.

integration_visualization(cell_type_with_dataname, embedding)

- Function:

This function generates a 2D visualization plot of the data integration result. Users can customize the default cell type names and colors based on the used datasets.

- Parameters:

cell_type_with_dataname (list): Cell type of each cell, along with the dataset name. This can be obtained using the integrate_sc_proteomic_features function.
embedding (array): Learned cell representation matrix (cell * embedding).

- Returns:

A 2D visualization plot of the integration result, colored by cell types.

rank_proteins_and_volcano_plot(adata)

- Function:

This function identifies the top 5 upregulated proteins and generates a volcano plot for clinical proteomics data analysis.

- Parameters:

adata (object): The anndata object of the single-cell proteomics dataset. It should include protein rank information used for characterizing groups. This information can be generated in advance, e.g., using the sc.tl.rank_genes_groups function in Scanpy.

- Returns:

The top 5 upregulated proteins.
A volcano plot for differential protein analysis.

Questions

If you have a question about using scPROTEIN, you can post an issue or reach us by email(nkuweili@mail.nankai.edu.cn).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!