Skip to content

Commit ef57562

Browse files
committed
camera ready version
1 parent 37af7b7 commit ef57562

File tree

279 files changed

+1017277
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

279 files changed

+1017277
-1
lines changed

Diff for: README.md

+210-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,210 @@
1-
# scCello
1+
<div align="center">
2+
3+
# scCello: Cell-ontology Guided Transcriptome Foundation Model
4+
5+
[![pytorch](https://img.shields.io/badge/PyTorch_2.5+-ee4c2c?logo=pytorch&logoColor=white)](https://pytorch.org/get-started/locally/)
6+
[![scCello arxiv](http://img.shields.io/badge/arxiv-2408.12373-yellow.svg)](https://arxiv.org/abs/2408.12373)
7+
[![HuggingFace Hub](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-black)](https://huggingface.co/collections/katarinayuan/sccello-67a01b6841f3658ba443c58a)
8+
![license](https://img.shields.io/badge/License-MIT-green.svg?labelColor=gray)
9+
10+
</div>
11+
12+
PyG implementation of [scCello], a cell-ontology guided transcriptome foundation model (TFM) for single cell RNA-seq data. Authored by [Xinyu Yuan], and [Zhihao Zhan].
13+
14+
[Xinyu Yuan]: https://github.com/KatarinaYuan
15+
[Zhihao Zhan]: https://github.com/zhan8855
16+
[scCello]: https://github.com/DeepGraphLearning/scCello
17+
18+
## Overview ##
19+
20+
scCello enhances transcriptome foundation models (TFMs) by integrating cell ontology graphs into pre-training, addressing the limitation of treating cells as independent entities. By incorporating cell-level objectives: **cell-type coherence loss** and **ontology alignment loss**, scCello demonstrate superior or competitive generalization and transferability capability over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses.
21+
22+
This repository is based on PyTorch 2.0 and Python 3.9.
23+
24+
![Main Method](asset/main_method_sccello.png)
25+
26+
Table of contents:
27+
* [Features](#features)
28+
* [Updates](#updates)
29+
* [Installation](#installation)
30+
* [Download](#download)
31+
* [Model checkpoints](#model-checkpoints)
32+
* [Pre-training and downstream datasets](#pre-training-and-downstream-datasets)
33+
* [Example h5ad data](#example-h5ad-data)
34+
* [Usage](#usage)
35+
* [h5ad data format transformation](#h5ad-data-format-transformation)
36+
* [Downstream generalization](#downstream-generalization)
37+
* [Cell type clustering & batch integration](#cell-type-clustering--batch-integration)
38+
* [Cell type classification](#cell-type-classification)
39+
* [Novel cell type classification](#novel-cell-type-classification)
40+
* [Downstream transferability](#downstream-transferability)
41+
* [Marker gene prediction](#marker-gene-prediction)
42+
* [Cancer drug response prediction](#cancer-drug-response-prediction)
43+
* [Pre-training](#pre-training)
44+
* [Citation](#citation)
45+
46+
## Features ##
47+
* **Cell-type Specific Learning**: Utilizes cell-type coherence loss to learn specific gene expression patterns relevant to each cell type.
48+
* **Ontology-aware Modeling**: Employs ontology alignment loss to understand and preserve the hierarchical relationships among different cell types.
49+
* **Large-scale Pre-training**: Trained on over 22 million cells from the CellxGene database, ensuring robust and generalizable models.
50+
* **Advanced Generalization and Transferability**: Demonstrates superior performance on various biologically significant tasks such as identifying novel cell types and predicting cell-type-specific marker genes.
51+
52+
53+
54+
## Updates
55+
* **Feb 5th, 2025**: scCello code released!
56+
* **Oct 1st, 2024**: scCello got accepted at NeurIPS 2024!
57+
* **Aug 22nd, 2024**: scCello preprint release on arxiv!
58+
59+
## Installation ##
60+
61+
You may install the dependencies via the following bash command.
62+
63+
```bash
64+
conda install pytorch==2.0.1 pytorch-cuda=11.7 -c pytorch -c nvidia
65+
pip install transformers[torch]
66+
pip install easydict
67+
pip install psutil
68+
pip install wandb
69+
pip install pytz
70+
pip install ipdb
71+
pip install pandas
72+
pip install datasets
73+
pip install torchmetrics
74+
pip install rdflib
75+
pip install hickle
76+
pip install anndata
77+
pip install scikit-learn
78+
pip install scanpy
79+
pip install scib
80+
conda install -c conda-forge cupy
81+
conda install rapidsai::cuml
82+
conda install -c rapidsai -c conda-forge -c nvidia cugraph
83+
```
84+
85+
86+
## Download ##
87+
### Model Checkpoints ###
88+
89+
Quick start guide to load scCello checkpoint:
90+
* for zero-shot inference tasks
91+
```
92+
from sccello.src.model_prototype_contrastive import PrototypeContrastiveForMaskedLM
93+
94+
model = PrototypeContrastiveForMaskedLM.from_pretrained("katarinayuan/scCello-zeroshot", output_hidden_states=True)
95+
```
96+
97+
* for linear probing tasks (see details in sccello/script/run_cell_type_classification.py)
98+
```
99+
from sccello.src.model_prototype_contrastive import PrototypeContrastiveForSequenceClassification
100+
101+
model_kwargs = {
102+
"num_labels": NUM_LABELS, # number of labels for classification
103+
"total_logging_steps": training_cfg["logging_steps"] * training_args.gradient_accumulation_steps,
104+
}
105+
106+
model = PrototypeContrastiveForSequenceClassification.from_pretrained("katarinayuan/scCello-zeroshot", **model_kwargs)
107+
```
108+
### Pre-training and Downstream Datasets ###
109+
For downstreams, in-distribution (ID) data $D^{id}$ and out-of-distribution (OOD) data across cell type $\{D_i^{ct}\}|i\in{1,2}$, tissue $\{D_i^{ts}\}|i\in{1,2}$ and donors $\{D_i^{dn}\}|i\in{1,2}$ are summarized (see App. B for data preprocessing details.)
110+
111+
112+
```
113+
# Note that some datasets are extremely large, use the following command to change data caching directory. The default is "~/.cache/huggingface/datasets/".
114+
export HF_HOME="/path/to/another/directory/datasets"
115+
116+
from sccello.src.utils import data_loading
117+
118+
# pre-training data & D^{id}
119+
train_dataset = load_dataset("katarinayuan/scCello_pretrain_unsplitted")["train"]
120+
train_dataset, indist_test_data = train_dataset.train_test_split(test_size=0.001, seed=237) # seed used in scCello
121+
122+
# D_1^{ct} & D_2^{ct}
123+
d1_ct, d2_ct = data_loading.get_fracdata("celltype", "frac100", False, False)
124+
125+
# D_1^{ts} & D_2^{ts}
126+
d1_ts, d2_ts = data_loading.get_fracdata("tissue", "frac100", False, False)
127+
128+
# D_1^{dn} & D_2^{dn}
129+
d1_dn, d2_dn = data_loading.get_fracdata("donor", "frac100", False, False)
130+
131+
```
132+
133+
### Example h5ad data ###
134+
Example data for transforming h5ad format to huggingface format.
135+
For building pre-training datasets and downstream datasets, we downloaded a series of human h5ad data from [CellxGene](https://chanzuckerberg.github.io/cellxgene-census/)
136+
```bash
137+
pip install gdown
138+
cd ./data/example_h5ad/
139+
gdown https://drive.google.com/uc?id=1UsbkhmZwSDWTgY4die60fHvzL_FnXtWE
140+
```
141+
142+
## Usage ##
143+
The `sccello/script` folder contains all executable files.
144+
145+
General configurations:
146+
```
147+
pretrained_ckpt=katarinayuan/scCello-zeroshot
148+
output_dir=/home/xinyu402/single_cell_output/
149+
wandb_run_name=test
150+
```
151+
152+
### h5ad Data Format Transformation ###
153+
154+
```
155+
python ./sccello/script/run_data_transformation.py
156+
```
157+
158+
### Downstream Generalization ###
159+
160+
#### Cell Type Clustering & Batch Integration ####
161+
162+
```
163+
python ./sccello/script/run_cell_type_clustering.py --pretrained_ckpt $pretrained_ckpt --wandb_run_name $wandb_run_name --output_dir $output_dir
164+
```
165+
166+
167+
#### Cell Type Classification ####
168+
```
169+
# Linear Probing
170+
training_type=linear_probing
171+
# or Train from Scratch without Loading the Pre-trained Model
172+
# training_type=from_scratch_linear
173+
174+
torchrun ./sccello/script/run_cell_type_classification.py --pretrained_ckpt $pretrained_ckpt --training_type $training_type --wandb_run_name $wandb_run_name --further_downsample 0.01 --output_dir $output_dir
175+
```
176+
177+
#### Novel Cell Type Classification ####
178+
```
179+
python ./sccello/script/run_novel_cell_type_classification.py --pretrained_ckpt $pretrained_ckpt --wandb_run_name $wandb_run_name --indist_repr_path ./embedding_storage/cellreprs_indist_frac_celltype_data1.pkl --output_dir $output_dir
180+
```
181+
182+
### Downstream Transferability ###
183+
#### Marker Gene Prediction ####
184+
```
185+
python ./sccello/script/run_marker_gene_prediction.py --pretrained_ckpt $pretrained_ckpt --wandb_run_name $wandb_run_name --output_dir $output_dir
186+
```
187+
188+
#### Cancer Drug Response Prediction ####
189+
```
190+
python ./sccello/script/run_cancer_drug_response.py --pretrained_ckpt $pretrained_ckpt --wandb_run_name $wandb_run_name
191+
```
192+
193+
### Pre-training ###
194+
```
195+
python -m torch.distributed.run --nproc_per_node=1 ./sccello/script/run_pretrain_prototype_contrastive.py --wandb_run_name pretrain_test
196+
```
197+
198+
## Citation ##
199+
200+
If you find this codebase useful in your research, please cite the original papers.
201+
202+
The main scCello paper:
203+
204+
```bibtex
205+
@inproceedings{yuancell,
206+
title={Cell ontology guided transcriptome foundation model},
207+
author={Yuan, Xinyu and Zhan, Zhihao and Zhang, Zuobai and Zhou, Manqi and Zhao, Jianan and Han, Boyu and Li, Yue and Tang, Jian},
208+
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}
209+
}
210+
```

Diff for: asset/main_method_sccello.png

381 KB
Loading

Diff for: data/.DS_Store

8 KB
Binary file not shown.

Diff for: data/README.md

+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# scCello Processed Data
2+
3+
## Pretraining Dataset
4+
- [scCello pretraining dataset](https://huggingface.co/datasets/katarinayuan/scCello_pretrain_unsplitted) is processed from [CellxGene census LTS release 2023-07-25](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_data_release_info.html). We select all primary data with 10x protocols sequencing on non-cancer human cells. See paper **App. B Data Preprocessing Details** for details.
5+
6+
## Gene Token Vocabulary
7+
- [token_vocabulary/token_dictionary.pkl](token_vocabulary/token_dictionary.pkl): We use Geneformer's gene vocabulary. The vocabulary has 25424 gene ensembl ids, with 3 special tokens "pad", "mask" and "cls" (total vocab size 25427).
8+
- Matching ensembl ids with names using `biomart`:
9+
- Matched names: 25137 genes ([token_vocabulary/vocab_id2name.cs](token_vocabulary/vocab_id2name.csv)`)
10+
- Unmatched names: 291 genes ([token_vocabulary/vocab_ids_notFoundName.csv](token_vocabulary/vocab_ids_notFoundName.csv))
11+
- [token_vocabulary/gene_median_dictionary.pkl](token_vocabulary/gene_median_dictionary.pkl): Non-zero median value of expression of each detected gene across all cells for Geneformer-like gene-wise normalization.
12+
13+
## Cell Type Label
14+
- [new_pretrain/general_CLid2cellname.pkl](new_pretrain/general_CLid2cellname.pkl): Associates textual cell types used in pre-training with their cell type lineage ID (CLID).
15+
- [new_pretrain/pretrain_frac100_clid2name.pkl](new_pretrain/pretrain_frac100_clid2name.pkl): Maps CLID to cell type label indices used in pre-training.
16+
- [new_pretrain/pretrain_frac100_cell_type_idmap.pkl](new_pretrain/pretrain_frac100_cell_type_idmap.pkl): Associates textual cell types used in pre-training with their cell type label indices. Note that this file is not consistent with [new_pretrain/general_CLid2cellname.pkl](new_pretrain/general_CLid2cellname.pkl) and [new_pretrain/pretrain_frac100_clid2name.pkl](new_pretrain/pretrain_frac100_clid2name.pkl). Its dict keys is used for the correct dict mapping, which can be obtained from `get_prestored_data` in `sccello/src/utils/data_loading.py`.
17+
18+
## Cell Ontology Graph
19+
- [cell_taxonomy/cl.owl](cell_taxonomy/cl.owl): Cell ontology graph obtained from [Cell Ontology](https://bioportal.bioontology.org/ontologies/CL).
20+
- [cell_taxonomy/celltype_relationship.json](cell_taxonomy/celltype_relationship.json): A simpler version of cell ontology that adopts a tree structure for subclass relationships obtained from the authors of [Cell Taxonomy](https://ngdc.cncb.ac.cn/celltaxonomy/). Note that we only use this data to associate textual cell types with their cell type lineage ID (CLID), since we are using the graph version of cell ontology.
21+
22+
23+
## Marker Gene Label
24+
- [marker_gene/cellmarkers.tsv](marker_gene/cellmarkers.tsv): All cell types with their marker genes obtained from [Cell Marker](http://xteam.xbio.top/CellMarker/download/Human_cell_markers.txt) and [PanglaoDB](https://panglaodb.se/markers/PanglaoDB_markers_27_Mar_2020.tsv.gz).
25+
- [marker_gene/celllabel2typesquality.csv](marker_gene/celllabel2typesquality.csv): Aligns cell labels provided in downstream datasets to cell types.
26+
27+
## Cancer Drug Response
28+
- Adapted from [DeepCDR repo](https://github.com/kimmo1019/DeepCDR/tree/master/data).

0 commit comments

Comments
 (0)