|
1 |
| -# scCello |
| 1 | +<div align="center"> |
| 2 | + |
| 3 | +# scCello: Cell-ontology Guided Transcriptome Foundation Model |
| 4 | + |
| 5 | +[](https://pytorch.org/get-started/locally/) |
| 6 | +[](https://arxiv.org/abs/2408.12373) |
| 7 | +[](https://huggingface.co/collections/katarinayuan/sccello-67a01b6841f3658ba443c58a) |
| 8 | + |
| 9 | + |
| 10 | +</div> |
| 11 | + |
| 12 | +PyG implementation of [scCello], a cell-ontology guided transcriptome foundation model (TFM) for single cell RNA-seq data. Authored by [Xinyu Yuan], and [Zhihao Zhan]. |
| 13 | + |
| 14 | +[Xinyu Yuan]: https://github.com/KatarinaYuan |
| 15 | +[Zhihao Zhan]: https://github.com/zhan8855 |
| 16 | +[scCello]: https://github.com/DeepGraphLearning/scCello |
| 17 | + |
| 18 | +## Overview ## |
| 19 | + |
| 20 | +scCello enhances transcriptome foundation models (TFMs) by integrating cell ontology graphs into pre-training, addressing the limitation of treating cells as independent entities. By incorporating cell-level objectives: **cell-type coherence loss** and **ontology alignment loss**, scCello demonstrate superior or competitive generalization and transferability capability over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses. |
| 21 | + |
| 22 | +This repository is based on PyTorch 2.0 and Python 3.9. |
| 23 | + |
| 24 | + |
| 25 | + |
| 26 | +Table of contents: |
| 27 | +* [Features](#features) |
| 28 | +* [Updates](#updates) |
| 29 | +* [Installation](#installation) |
| 30 | +* [Download](#download) |
| 31 | + * [Model checkpoints](#model-checkpoints) |
| 32 | + * [Pre-training and downstream datasets](#pre-training-and-downstream-datasets) |
| 33 | + * [Example h5ad data](#example-h5ad-data) |
| 34 | +* [Usage](#usage) |
| 35 | + * [h5ad data format transformation](#h5ad-data-format-transformation) |
| 36 | + * [Downstream generalization](#downstream-generalization) |
| 37 | + * [Cell type clustering & batch integration](#cell-type-clustering--batch-integration) |
| 38 | + * [Cell type classification](#cell-type-classification) |
| 39 | + * [Novel cell type classification](#novel-cell-type-classification) |
| 40 | + * [Downstream transferability](#downstream-transferability) |
| 41 | + * [Marker gene prediction](#marker-gene-prediction) |
| 42 | + * [Cancer drug response prediction](#cancer-drug-response-prediction) |
| 43 | + * [Pre-training](#pre-training) |
| 44 | +* [Citation](#citation) |
| 45 | + |
| 46 | +## Features ## |
| 47 | +* **Cell-type Specific Learning**: Utilizes cell-type coherence loss to learn specific gene expression patterns relevant to each cell type. |
| 48 | +* **Ontology-aware Modeling**: Employs ontology alignment loss to understand and preserve the hierarchical relationships among different cell types. |
| 49 | +* **Large-scale Pre-training**: Trained on over 22 million cells from the CellxGene database, ensuring robust and generalizable models. |
| 50 | +* **Advanced Generalization and Transferability**: Demonstrates superior performance on various biologically significant tasks such as identifying novel cell types and predicting cell-type-specific marker genes. |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | +## Updates |
| 55 | +* **Feb 5th, 2025**: scCello code released! |
| 56 | +* **Oct 1st, 2024**: scCello got accepted at NeurIPS 2024! |
| 57 | +* **Aug 22nd, 2024**: scCello preprint release on arxiv! |
| 58 | + |
| 59 | +## Installation ## |
| 60 | + |
| 61 | +You may install the dependencies via the following bash command. |
| 62 | + |
| 63 | +```bash |
| 64 | +conda install pytorch==2.0.1 pytorch-cuda=11.7 -c pytorch -c nvidia |
| 65 | +pip install transformers[torch] |
| 66 | +pip install easydict |
| 67 | +pip install psutil |
| 68 | +pip install wandb |
| 69 | +pip install pytz |
| 70 | +pip install ipdb |
| 71 | +pip install pandas |
| 72 | +pip install datasets |
| 73 | +pip install torchmetrics |
| 74 | +pip install rdflib |
| 75 | +pip install hickle |
| 76 | +pip install anndata |
| 77 | +pip install scikit-learn |
| 78 | +pip install scanpy |
| 79 | +pip install scib |
| 80 | +conda install -c conda-forge cupy |
| 81 | +conda install rapidsai::cuml |
| 82 | +conda install -c rapidsai -c conda-forge -c nvidia cugraph |
| 83 | +``` |
| 84 | + |
| 85 | + |
| 86 | +## Download ## |
| 87 | +### Model Checkpoints ### |
| 88 | + |
| 89 | +Quick start guide to load scCello checkpoint: |
| 90 | +* for zero-shot inference tasks |
| 91 | +``` |
| 92 | +from sccello.src.model_prototype_contrastive import PrototypeContrastiveForMaskedLM |
| 93 | +
|
| 94 | +model = PrototypeContrastiveForMaskedLM.from_pretrained("katarinayuan/scCello-zeroshot", output_hidden_states=True) |
| 95 | +``` |
| 96 | + |
| 97 | +* for linear probing tasks (see details in sccello/script/run_cell_type_classification.py) |
| 98 | +``` |
| 99 | +from sccello.src.model_prototype_contrastive import PrototypeContrastiveForSequenceClassification |
| 100 | +
|
| 101 | +model_kwargs = { |
| 102 | + "num_labels": NUM_LABELS, # number of labels for classification |
| 103 | + "total_logging_steps": training_cfg["logging_steps"] * training_args.gradient_accumulation_steps, |
| 104 | +} |
| 105 | +
|
| 106 | +model = PrototypeContrastiveForSequenceClassification.from_pretrained("katarinayuan/scCello-zeroshot", **model_kwargs) |
| 107 | +``` |
| 108 | +### Pre-training and Downstream Datasets ### |
| 109 | +For downstreams, in-distribution (ID) data $D^{id}$ and out-of-distribution (OOD) data across cell type $\{D_i^{ct}\}|i\in{1,2}$, tissue $\{D_i^{ts}\}|i\in{1,2}$ and donors $\{D_i^{dn}\}|i\in{1,2}$ are summarized (see App. B for data preprocessing details.) |
| 110 | + |
| 111 | + |
| 112 | +``` |
| 113 | +# Note that some datasets are extremely large, use the following command to change data caching directory. The default is "~/.cache/huggingface/datasets/". |
| 114 | +export HF_HOME="/path/to/another/directory/datasets" |
| 115 | +
|
| 116 | +from sccello.src.utils import data_loading |
| 117 | +
|
| 118 | +# pre-training data & D^{id} |
| 119 | +train_dataset = load_dataset("katarinayuan/scCello_pretrain_unsplitted")["train"] |
| 120 | +train_dataset, indist_test_data = train_dataset.train_test_split(test_size=0.001, seed=237) # seed used in scCello |
| 121 | +
|
| 122 | +# D_1^{ct} & D_2^{ct} |
| 123 | +d1_ct, d2_ct = data_loading.get_fracdata("celltype", "frac100", False, False) |
| 124 | +
|
| 125 | +# D_1^{ts} & D_2^{ts} |
| 126 | +d1_ts, d2_ts = data_loading.get_fracdata("tissue", "frac100", False, False) |
| 127 | +
|
| 128 | +# D_1^{dn} & D_2^{dn} |
| 129 | +d1_dn, d2_dn = data_loading.get_fracdata("donor", "frac100", False, False) |
| 130 | +
|
| 131 | +``` |
| 132 | + |
| 133 | +### Example h5ad data ### |
| 134 | +Example data for transforming h5ad format to huggingface format. |
| 135 | +For building pre-training datasets and downstream datasets, we downloaded a series of human h5ad data from [CellxGene](https://chanzuckerberg.github.io/cellxgene-census/) |
| 136 | +```bash |
| 137 | +pip install gdown |
| 138 | +cd ./data/example_h5ad/ |
| 139 | +gdown https://drive.google.com/uc?id=1UsbkhmZwSDWTgY4die60fHvzL_FnXtWE |
| 140 | +``` |
| 141 | + |
| 142 | +## Usage ## |
| 143 | +The `sccello/script` folder contains all executable files. |
| 144 | + |
| 145 | +General configurations: |
| 146 | +``` |
| 147 | +pretrained_ckpt=katarinayuan/scCello-zeroshot |
| 148 | +output_dir=/home/xinyu402/single_cell_output/ |
| 149 | +wandb_run_name=test |
| 150 | +``` |
| 151 | + |
| 152 | +### h5ad Data Format Transformation ### |
| 153 | + |
| 154 | +``` |
| 155 | +python ./sccello/script/run_data_transformation.py |
| 156 | +``` |
| 157 | + |
| 158 | +### Downstream Generalization ### |
| 159 | + |
| 160 | +#### Cell Type Clustering & Batch Integration #### |
| 161 | + |
| 162 | +``` |
| 163 | +python ./sccello/script/run_cell_type_clustering.py --pretrained_ckpt $pretrained_ckpt --wandb_run_name $wandb_run_name --output_dir $output_dir |
| 164 | +``` |
| 165 | + |
| 166 | + |
| 167 | +#### Cell Type Classification #### |
| 168 | +``` |
| 169 | +# Linear Probing |
| 170 | +training_type=linear_probing |
| 171 | +# or Train from Scratch without Loading the Pre-trained Model |
| 172 | +# training_type=from_scratch_linear |
| 173 | +
|
| 174 | +torchrun ./sccello/script/run_cell_type_classification.py --pretrained_ckpt $pretrained_ckpt --training_type $training_type --wandb_run_name $wandb_run_name --further_downsample 0.01 --output_dir $output_dir |
| 175 | +``` |
| 176 | + |
| 177 | +#### Novel Cell Type Classification #### |
| 178 | +``` |
| 179 | +python ./sccello/script/run_novel_cell_type_classification.py --pretrained_ckpt $pretrained_ckpt --wandb_run_name $wandb_run_name --indist_repr_path ./embedding_storage/cellreprs_indist_frac_celltype_data1.pkl --output_dir $output_dir |
| 180 | +``` |
| 181 | + |
| 182 | +### Downstream Transferability ### |
| 183 | +#### Marker Gene Prediction #### |
| 184 | +``` |
| 185 | +python ./sccello/script/run_marker_gene_prediction.py --pretrained_ckpt $pretrained_ckpt --wandb_run_name $wandb_run_name --output_dir $output_dir |
| 186 | +``` |
| 187 | + |
| 188 | +#### Cancer Drug Response Prediction #### |
| 189 | +``` |
| 190 | +python ./sccello/script/run_cancer_drug_response.py --pretrained_ckpt $pretrained_ckpt --wandb_run_name $wandb_run_name |
| 191 | +``` |
| 192 | + |
| 193 | +### Pre-training ### |
| 194 | +``` |
| 195 | +python -m torch.distributed.run --nproc_per_node=1 ./sccello/script/run_pretrain_prototype_contrastive.py --wandb_run_name pretrain_test |
| 196 | +``` |
| 197 | + |
| 198 | +## Citation ## |
| 199 | + |
| 200 | +If you find this codebase useful in your research, please cite the original papers. |
| 201 | + |
| 202 | +The main scCello paper: |
| 203 | + |
| 204 | +```bibtex |
| 205 | +@inproceedings{yuancell, |
| 206 | + title={Cell ontology guided transcriptome foundation model}, |
| 207 | + author={Yuan, Xinyu and Zhan, Zhihao and Zhang, Zuobai and Zhou, Manqi and Zhao, Jianan and Han, Boyu and Li, Yue and Tang, Jian}, |
| 208 | + booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems} |
| 209 | +} |
| 210 | +``` |
0 commit comments