Self-supervised learning in non-small cell lung cancer discovers novel morphological clusters linked to patient outcome and molecular phenotypes
Abstract:
Histopathological images provide the definitive source of cancer diagnosis, containing information used by pathologists to identify and subclassify malignant disease, and to guide therapeutic choices. These images contain vast amounts of information, much of which is currently unavailable to human interpretation. Supervised deep learning approaches have been powerful for classification tasks, but they are inherently limited by the cost and quality of annotations. Therefore, we developed Histomorphological Phenotype Learning, an unsupervised methodology, which requires no annotations and operates via the self-discovery of discriminatory image features in small image tiles. Tiles are grouped into morphologically similar clusters which appear to represent recurrent modes of tumor growth emerging under natural selection. These clusters have distinct features which can be identified using orthogonal methods. Applied to lung cancer tissues, we show that they align closely with patient outcomes, with histopathologically recognised tumor types and growth patterns, and with transcriptomic measures of immunophenotype.
@misc{QuirosCoudray2022,
title={Self-supervised learning in non-small cell lung cancer discovers novel morphological clusters linked to patient outcome and molecular phenotypes},
author={Adalberto Claudio Quiros and Nicolas Coudray and Anna Yeaton and Xinyu Yang and Luis Chiriboga and Afreen Karimkhan and Navneet Narula and Harvey Pass and Andre L. Moreira and John Le Quesne and Aristotelis Tsirigos and Ke Yuan},
year={2022},
eprint={2205.01931},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Slides summarizing methodology and results:
In this repository you will find the following sections:
- WSI tiling process: Instructions on how to create H5 files from WSI tiles.
- Workspace setup: Details on H5 file content and directory structure.
- HPL instructions: Step-by-step instructions on how to run the complete methodology.
- Self-supervised Barlow Twins training.
- Tile vector representations.
- Combination of all sets into one H5.
- Fold cross validation files.
- Include metadata in H5 file.
- Leiden clustering.
- Removing background tiles.
- Logistic regression for lung type WSI classification.
- Cox proportional hazards for survival regression.
- Correlation between annotations and clusters.
- Get tiles and WSI samples for HPCs.
- Frequently Asked Questions.
- TCGA HPL files: HPL output files of paper results.
- Dockers: Docker environments to run HPL steps.
- Python Environment: Python version and packages.
This step divides whole slide images (WSIs) into 224x224 tiles and store them into H5 files. At the end of this step, you should have three H5 files. One per training, validation, and test sets. The training set will be used to train the self-supervised CNN, in our work this corresponded to 60% of TCGA LUAD & LUSC WSIs.
We used the framework provided in Coudray et al. 'Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning' Nature Medicine, 2018. The steps to run the framework are 0.1, 0.2.a, and 4 (end of readme). In our work we used Reinhardt normalization, which can be applied at the same time as the tiling is done through the '-N' option in step 0.1.
This section specifies requirements on H5 file content and directory structure to run the flow.
In the instructions below we use the following variables and names:
- dataset_name:
TCGAFFPE_LUADLUSC_5x_60pc
- marker_name:
he
- tile_size:
224
If you are not familiar with H5 files, you can find documentation on the python package here.
This framework makes the assumption that datasets inside each H5 set will follow the format 'set_labelname'. In addition, all H5 files are required to have the same number of datasets. Example:
- File:
hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_train.h5
- Dataset names:
train_img
,train_tiles
,train_slides
,train_samples
- Dataset names:
- File:
hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_validation.h5
- Dataset names:
valid_img
,valid_tiles
,valid_slides
,valid_samples
- Dataset names:
- File:
hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_test.h5
- Dataset names:
test_img
,test_tiles
,test_slides
,test_samples
- Dataset names:
The code will make the following assumptions with respect to where the datasets, model training outputs, and image representations are stored:
- Datasets:
- Dataset folder.
- Follows the following structure:
- datasets/dataset_name/marker_name/patches_htile_size_wtile_size
- E.g.:
datasets/TCGAFFPE_LUADLUSC_5x_60pc/he/patches_h224_w224
- Train, validation, and test sets:
- Each dataset will assume that at least there is a training set.
- Naming convention:
- hdf5_dataset_name_marker_name_set_name.h5
- E.g.:
datasets/TCGAFFPE_LUADLUSC_5x_60pc/he/patches_h224_w224/hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_train.h5
- Data_model_output:
- Output folder for self-supervised trained models.
- Follows the following structure:
- data_model_output/model_name/dataset_name/htile_size_wtile_size_n3_zdimlatent_space_size
- E.g.:
data_model_output/BarlowTwins_3/TCGAFFPE_LUADLUSC_5x_60pc/h224_w224_n3_zdim128
- Results:
- Output folder for self-supervised representations results.
- This folder will contain the representation, clustering data, and logistic/cox regression results.
- Follows the following structure:
- results/model_name/dataset_name/htile_size_wtile_size_n3_zdimlatent_space_size
- E.g.:
results/BarlowTwins_3/TCGAFFPE_LUADLUSC_5x_60pc/h224_w224_n3_zdim128
The flow consists in the following steps:
- Self-supervised Barlow Twins training.
- Tile vector representations.
- Combination of all sets into one H5.
- Fold cross validation files.
- Include metadata in H5 file.
- Leiden clustering.
- Removing background tiles.
- Logistic regression for lung type WSI classification.
- Cox proportional hazards for survival regression.
- Correlation between annotations and clusters.
- Get tiles and WSI samples for HPCs.
You can find the full details here.
You can find TCGA files, results, and commands to reproduce them here. For any questions regarding the New York University cohorts, please address reasonable requests to the corresponding authors.
You can follow steps on how to assign existing clusters in here. These instructions will give you assignation to the same clusters reported in the publication.
When I run the Leiden clustering step. I get an 'TypeError: can't pickle weakref objects' error in some folds.
Based on experience, this error occurs with non-compatible version on numba, umap-learn, and scanpy. The package versions in the python environment should work. But these alternative package combination works:
scanpy==1.7.1
pynndescent==0.5.0
numba==0.51.2
This section contains the following TCGA files produced by HPL:
- TCGA LUAD & LUSC WSI tile image datasets.
- TCGA Self-supervised trained weights.
- TCGA tile projections.
- TCGA cluster configurations.
- TCGA WSI & patient representations.
For the New York University cohorts, please send reasonable requests to the corresponding authors.
You can find the WSI tile images at:
- LUAD & LUSC 60% Background max
- LUAD & LUSC 60% Background max 250K subsample for self-supervised model training.
Self-supervised model weights:
- Lung adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC) model.
- PanCancer: BRCA, HNSC, KICH, KIRC, KIRP, LUSC, LUAD.
You can find tile projections for TCGA LUAD and LUSC cohorts at the following locations. These are the projections used in the publication results.
- TCGA LUAD & LUSC tile vector representations (background and artifact tiles unfiltered)
- TCGA LUAD & LUSC tile vector representations
You can find cluster configurations used in the publication results at:
You can find WSI and patient vector representations used in the publication results at:
These are the dockers with the environments to run the steps of HPL. Step 'Leiden clustering' needs to be run with docker [2], all other steps can be run with docker [1]:
- Self-Supervised models training and projections:
- Leiden clustering:
The code uses Python 3.7.12 and the following packages:
anndata==0.7.8
autograd==1.3
einops==0.3.0
h5py==3.4.0
lifelines==0.26.3
matplotlib==3.5.1
numba==0.52.0
numpy==1.21.2
opencv-python==4.1.0.25
pandas==1.3.3
Pillow==8.1.0
pycox==0.2.2
scanpy==1.8.1
scikit-bio==0.5.6
scikit-image==0.15.0
scikit-learn==0.24.0
scikit-network==0.24.0
scikit-survival==0.16.0
scipy==1.7.1
seaborn==0.11.2
setuptools-scm==6.3.2
simplejson==3.13.2
sklearn==0.0
sklearn-pandas==2.2.0
statsmodels==0.13.0
tensorboard==1.14.0
tensorflow-gpu==1.14.0
tqdm==4.32.2
umap-learn==0.5.0
wandb==0.12.7