scPROTEIN is a deep contrastive learning framework for single-cell proteomics embedding.
- For the datasets provided with raw peptide intensities, scPROTEIN stage1 estimates the uncertainty of peptide quantification and aggregates the peptide content to the protein level in an uncertainty-guided manner.
- Taking the protein-level abundance matrix as input, scPROTEIN stage2 aims to alleviate data missingness, denoise the protein data, remove batch effects in a unified framework, and encode single-cell proteomic-specific embeddings. These embeddings can be applied to a variety of downstream tasks.
Users can directly download the scPROTEIN package using pip:
pip install scprotein
If for some reason this doesn't work on your device, you can also install scPROTEIN with the provided .whl file.
pip install docs/scprotein-0.1.1-py3-none-any.whl
You can check if scPROTEIN package has been successfully installed via the following command:
python3 -c "import scprotein"
You can utilize the functions of stage1 from the scPROTEIN python package as:
# Incorporate all functions of stage1:
from scprotein.peptide_uncertainty_estimation import *
# or you can import specific function of stage1, for instance:
from scprotein.peptide_uncertainty_estimation import peptide_encode
load_peptide(data_path)
- Function:
Load the input peptide-level file, and then extract the peptide sequences, peptide-level data along with other meta information.
- Parameters:
data_path
(str): Data path to load the peptide-level file.
- Returns:
peptides
(list): Peptide sequences.proteins
(list): Protein names.Y_label
(array): Peptide-level abundance matrix (peptide*cell).cell_list
(list): The list containing the index of each cell.num_cells
(int): Number of total cells.
peptide_encode(peptides)
- Function:
This function takes as input peptide sequences composed of amino acids. It returns the corresponding one-hot encoding data matrix and the total number of different amino acid types.
- Parameters:
peptides
(list): The input peptide sequences.
- Returns:
peptide_onehot_padding
(array): One-hot encoding matrix for peptide sequences.num_amino_acid
(int): The number of different amino acid types.
peptide_CNN(num_amino_acid, max_pool_size, hidden_dim, output_dim, conv_layers, dropout_rate, kernel_nums, kernel_size)
- Function:
This function defines the Heteroscedastic regression model of scPROTEIN stage1 for peptide uncertainty estimation.
- Parameters:
-
num_amino_acid
(int): The number of different amino acid types. -
max_pool_size
(int): The size of the sliding window in the max-pooling operation. -
hidden_dim
(int): The hidden dimension in the fully-connected layer. -
output_dim
(int): Output dimension of the Heteroscedastic regression model, which is twice the number of cells (each cell has a$\mu$ and a$\sigma$ ). -
conv_layers
(int): Number of convolutional layers. -
dropout_rate
(float): Dropout rate. -
kernel_nums
(int): Number of kernels in each convolutional block. -
kernel_size
(int): Kernel size of each convolutional block.
- Returns:
model
(object): The defined Heteroscedastic regression model object.
scPROTEIN_stage1_learning(model, peptide_onehot_padding, Y_label, learning_rate, weight_decay, split_percentage, num_epochs, batch_size)
- Function:
This function constructs the framework for scPROTEIN stage1 training and prediction.
- Parameters:
model
(object): Defined Heteroscedastic regression model object of stage1.peptide_onehot_padding
(array): One-hot encoding matrix for the input peptide sequences.Y_label
(array): Peptide-level abundance matrix (peptide*cell).split_percentage
(float): Split percentage of data.learning_rate
(float): Learning rate for the Adam optimizer.weight_decay
(float): Weight decay for the Adam optimizer.num_epochs
(int): Number of epochs for training stage1. We empirically set 90 to strike a balance between achieving convergence and reducing training time.batch_size
(int): Batch size for mini-batch training.
- Returns:
-
scPROTEIN_stage1
(object): The scPROTEIN stage1 object. The functions ofscPROTEIN_stage1
are as follows:scPROTEIN_stage1.train()
: Perform scPROTEIN stage1 training.scPROTEIN_stage1.uncertainty_generation()
: Generate the estimated peptide uncertainty based on the trained stage1 model.
load_sc_proteomic_features(stage1)
- Function:
This function specifies whether to use stage1 and loads the single-cell protein-level data matrix.
- Parameters:
stage1
(bool): This parameter indicates if scPROTEIN starts from stage1.True
represents generating protein-level data using stage1 in the uncertainty-guided manner, andFalse
denotes directly learning from protein-level data.
- Returns:
proteins
(list): Protein names.cells
(list): The list containing the index of each cell.features
(array): Single-cell proteomics data matrix.
Users can utilize the functions of stage2 from the fully-fledged scPROTEIN python package as:
# Incorporate all functions of stage2:
from scprotein import *
# or you can import specific function of stage2, for instance:
from scprotein import Encoder
graph_generation(features, threshold, feature_preprocess)
- Function:
This function constructs the cell graph based on the protein feature matrix.
- Parameters:
features
(array): Single-cell proteomics data matrix.threshold
(float): Threshold for graph construction.feature_preprocess
(bool): Feature preprocessing.
- Returns:
graph_data
(torch_geometric data object): The graph data in torch_geometric format, consisting of edges and node features.
Encoder(input_features, num_hidden, activation, num_layers)
- Function:
Construct the graph encoder for embedding learning.
- Parameters:
input_features
(int): Dimension of the input feature matrix (usually the number of proteins).num_hidden
(int): Hidden dimension in the graph encoder.activation
(str): The type of non-linear activation function.num_layers
(int): Number of layers in the graph encoder.
- Returns:
encoder
(PyTorch module): The defined graph encoder module.
Model(encoder, num_hidden, num_proj_hidden, tau)
- Function:
This function establishes the scPROTEIN stage2 model, consisting of a graph encoder, projection head, and loss calculation.
- Parameters:
encoder
(PyTorch module): Defined graph encoder.num_hidden
(int): Hidden dimension in the graph encoder.num_proj_hidden
(int): Hidden dimension of the projection head.tau
(float): Temperature coefficient.
- Returns:
model
(PyTorch module): The defined scPROTEIN stage2 model.
scPROTEIN_learning(model, device, data, drop_feature_rate_1, drop_feature_rate_2, drop_edge_rate_1, drop_edge_rate_2, learning_rate, weight_decay, num_protos, topology_denoising, num_epochs, alpha, num_changed_edges, seed)
- Function:
This function constructs the framework of scPROTEIN stage2 training and prediction.
- Parameters:
model
(PyTorch module): Defined scPROTEIN stage2 model.device
(str): Running device.data
(torch_geometric data): The defined graph data in torch_geometric format, consisting of edges and node features.drop_feature_rate_1
(float): Dropedge rate for augmentation view1.drop_feature_rate_2
(float): Dropedge rate for augmentation view2.drop_edge_rate_1
(float): Feature masking rate for augmentation view1.drop_edge_rate_2
(float): Feature masking rate for augmentation view2.learning_rate
(float): Learning rate for Adam optimizer.weight_decay
(float): Weight decay for Adam optimizer.num_protos
(int): Number of prototypes.topology_denoising
(bool): Indicator of if using the topology denoising.num_epochs
(int): Number of epochs for training stage2. We empirically set 200 to strike a balance between achieving convergence and reducing training time.alpha
(float): Balance factor in the loss function.num_changed_edges
(int): Number of added/removed edges in topology denoising.seed
(int): Random seed.
- Returns:
-
scPROTEIN
object for stage2. The functions ofscPROTEIN
are as follows:scPROTEIN.train()
: Conduct training of scPROTEIN stage2.scPROTEIN.embedding_generation()
: Generate the cell representation matrix based on the trained stage2 model.
integrate_sc_proteomic_features(dataset1, dataset2)
- Function:
This function prepares for integrating different single-cell proteomics datasets.
- Parameters:
dataset1
(h5ad format): First dataset for integration.dataset2
(h5ad format): Second dataset for integration.
- Returns:
batch_label
(array): The batch label which indicates the source of each cell.cell_type_with_dataname
(list): Cell type of each cell, along with the dataset name.cell_type_label
(array): Discrete cell type labels.overlap_cell_type_label
(list): The overlap cell type(s) of both integrated datasets.features_concat
(array): The combination of single-cell proteomics data from both integrated datasets, using the overlap proteins.
integration_visualization(cell_type_with_dataname, embedding)
- Function:
This function generates a 2D visualization plot of the data integration result. Users can customize the default cell type names and colors based on the used datasets.
- Parameters:
cell_type_with_dataname
(list): Cell type of each cell, along with the dataset name. This can be obtained using theintegrate_sc_proteomic_features
function.embedding
(array): Learned cell representation matrix (cell * embedding).
- Returns:
- A 2D visualization plot of the integration result, colored by cell types.
rank_proteins_and_volcano_plot(adata)
- Function:
This function identifies the top 5 upregulated proteins and generates a volcano plot for clinical proteomics data analysis.
- Parameters:
adata
(object): The anndata object of the single-cell proteomics dataset. It should include protein rank information used for characterizing groups. This information can be generated in advance, e.g., using thesc.tl.rank_genes_groups
function in Scanpy.
- Returns:
- The top 5 upregulated proteins.
- A volcano plot for differential protein analysis.
If you have a question about using scPROTEIN, you can post an issue or reach us by email([email protected]).