Self-supervised learning in non-small cell lung cancer discovers novel morphological clusters linked to patient outcome and molecular phenotypes

Quiros A.C.⁺, Coudray N.⁺, Yeaton A., Yang X., Chiriboga L., Karimkhan A., Narula N., Pass H., Moreira A.L., Le Quesne J.^*, Tsirigos A.^*, and Yuan K.^* Self-supervised learning in non-small cell lung cancer discovers novel morphological clusters linked to patient outcome and molecular phenotypes. 2022

Abstract:

Histopathological images provide the definitive source of cancer diagnosis, containing information used by pathologists to identify and subclassify malignant disease, and to guide therapeutic choices. These images contain vast amounts of information, much of which is currently unavailable to human interpretation. Supervised deep learning approaches have been powerful for classification tasks, but they are inherently limited by the cost and quality of annotations. Therefore, we developed Histomorphological Phenotype Learning, an unsupervised methodology, which requires no annotations and operates via the self-discovery of discriminatory image features in small image tiles. Tiles are grouped into morphologically similar clusters which appear to represent recurrent modes of tumor growth emerging under natural selection. These clusters have distinct features which can be identified using orthogonal methods. Applied to lung cancer tissues, we show that they align closely with patient outcomes, with histopathologically recognised tumor types and growth patterns, and with transcriptomic measures of immunophenotype.

Citation

@misc{QuirosCoudray2022,
      title={Self-supervised learning in non-small cell lung cancer discovers novel morphological clusters linked to patient outcome and molecular phenotypes},
      author={Adalberto Claudio Quiros and Nicolas Coudray and Anna Yeaton and Xinyu Yang and Luis Chiriboga and Afreen Karimkhan and Navneet Narula and Harvey Pass and Andre L. Moreira and John Le Quesne and Aristotelis Tsirigos and Ke Yuan},
      year={2022},
      eprint={2205.01931},
      archivePrefix={arXiv},
      primaryClass={cs.CV}        
}

Demo Materials

Slides summarizing methodology and results:

Light-weight version.
High-resolution version.

Repository overview

In this repository you will find the following sections:

WSI tiling process: Instructions on how to create H5 files from WSI tiles.
Workspace setup: Details on H5 file content and directory structure.
HPL instructions: Step-by-step instructions on how to run the complete methodology.
1. Self-supervised Barlow Twins training.
2. Tile vector representations.
3. Combination of all sets into one H5.
4. Fold cross validation files.
5. Include metadata in H5 file.
6. Leiden clustering.
7. Removing background tiles.
8. Logistic regression for lung type WSI classification.
9. Cox proportional hazards for survival regression.
10. Correlation between annotations and clusters.
11. Get tiles and WSI samples for HPCs.
Frequently Asked Questions.
TCGA HPL files: HPL output files of paper results.
Dockers: Docker environments to run HPL steps.
Python Environment: Python version and packages.

WSI tiling process

This step divides whole slide images (WSIs) into 224x224 tiles and store them into H5 files. At the end of this step, you should have three H5 files. One per training, validation, and test sets. The training set will be used to train the self-supervised CNN, in our work this corresponded to 60% of TCGA LUAD & LUSC WSIs.

We used the framework provided in Coudray et al. 'Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning' Nature Medicine, 2018. The steps to run the framework are 0.1, 0.2.a, and 4 (end of readme). In our work we used Reinhardt normalization, which can be applied at the same time as the tiling is done through the '-N' option in step 0.1.

Workspace setup

This section specifies requirements on H5 file content and directory structure to run the flow.

In the instructions below we use the following variables and names:

dataset_name: TCGAFFPE_LUADLUSC_5x_60pc
marker_name: he
tile_size: 224

H5 file content specification.

If you are not familiar with H5 files, you can find documentation on the python package here.

This framework makes the assumption that datasets inside each H5 set will follow the format 'set_labelname'. In addition, all H5 files are required to have the same number of datasets. Example:

File: hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_train.h5
- Dataset names: train_img, train_tiles, train_slides, train_samples
File: hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_validation.h5
- Dataset names: valid_img, valid_tiles, valid_slides, valid_samples
File: hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_test.h5
- Dataset names: test_img, test_tiles, test_slides, test_samples

Directory Structure

The code will make the following assumptions with respect to where the datasets, model training outputs, and image representations are stored:

Datasets:
- Dataset folder.
- Follows the following structure:
  - datasets/dataset_name/marker_name/patches_htile_size_wtile_size
  - E.g.: datasets/TCGAFFPE_LUADLUSC_5x_60pc/he/patches_h224_w224
- Train, validation, and test sets:
  - Each dataset will assume that at least there is a training set.
  - Naming convention:
    - hdf5_dataset_name_marker_name_set_name.h5
    - E.g.: datasets/TCGAFFPE_LUADLUSC_5x_60pc/he/patches_h224_w224/hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_train.h5
Data_model_output:
- Output folder for self-supervised trained models.
- Follows the following structure:
  - data_model_output/model_name/dataset_name/htile_size_wtile_size_n3_zdimlatent_space_size
  - E.g.: data_model_output/BarlowTwins_3/TCGAFFPE_LUADLUSC_5x_60pc/h224_w224_n3_zdim128
Results:
- Output folder for self-supervised representations results.
- This folder will contain the representation, clustering data, and logistic/cox regression results.
- Follows the following structure:
  - results/model_name/dataset_name/htile_size_wtile_size_n3_zdimlatent_space_size
  - E.g.: results/BarlowTwins_3/TCGAFFPE_LUADLUSC_5x_60pc/h224_w224_n3_zdim128

HPL Instructions

The flow consists in the following steps:

Self-supervised Barlow Twins training.
Tile vector representations.
Combination of all sets into one H5.
Fold cross validation files.
Include metadata in H5 file.
Leiden clustering.
Removing background tiles.
Logistic regression for lung type WSI classification.
Cox proportional hazards for survival regression.
Correlation between annotations and clusters.
Get tiles and WSI samples for HPCs.

You can find the full details here.

Frequently Asked Questions

I want to reproduce the paper results.

You can find TCGA files, results, and commands to reproduce them here. For any questions regarding the New York University cohorts, please address reasonable requests to the corresponding authors.

I have my own cohort and I want to assign existing clusters to my own WSI.

You can follow steps on how to assign existing clusters in here. These instructions will give you assignation to the same clusters reported in the publication.

When I run the Leiden clustering step. I get an 'TypeError: can't pickle weakref objects' error in some folds.

Based on experience, this error occurs with non-compatible version on numba, umap-learn, and scanpy. The package versions in the python environment should work. But these alternative package combination works:

scanpy==1.7.1 
pynndescent==0.5.0 
numba==0.51.2

If you are having any issue running these scripts, please leave a message on the Issues Github tab.

TCGA HPL files

This section contains the following TCGA files produced by HPL:

TCGA LUAD & LUSC WSI tile image datasets.
TCGA Self-supervised trained weights.
TCGA tile projections.
TCGA cluster configurations.
TCGA WSI & patient representations.

For the New York University cohorts, please send reasonable requests to the corresponding authors.

TCGA LUAD & LUSC WSI tile image datasets

You can find the WSI tile images at:

LUAD & LUSC 60% Background max
LUAD & LUSC 60% Background max 250K subsample for self-supervised model training.

TCGA Pretrained Models

Self-supervised model weights:

Lung adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC) model.
PanCancer: BRCA, HNSC, KICH, KIRC, KIRP, LUSC, LUAD.

TCGA tile vector representations

You can find tile projections for TCGA LUAD and LUSC cohorts at the following locations. These are the projections used in the publication results.

TCGA LUAD & LUSC tile vector representations (background and artifact tiles unfiltered)
TCGA LUAD & LUSC tile vector representations

TCGA clusters

You can find cluster configurations used in the publication results at:

Background and artifact removal
LUAD vs LUSC type classification
LUAD survival

TCGA WSI & patient vector representations

You can find WSI and patient vector representations used in the publication results at:

LUAD vs LUSC type classification
LUAD survival

Dockers

These are the dockers with the environments to run the steps of HPL. Step 'Leiden clustering' needs to be run with docker [2], all other steps can be run with docker [1]:

Self-Supervised models training and projections:
- aclaudioquiros/tf_package:v16
Leiden clustering:
- gcfntnu/scanpy:1.7.0

Python Environment

The code uses Python 3.7.12 and the following packages:

anndata==0.7.8
autograd==1.3
einops==0.3.0
h5py==3.4.0
lifelines==0.26.3
matplotlib==3.5.1
numba==0.52.0
numpy==1.21.2
opencv-python==4.1.0.25
pandas==1.3.3
Pillow==8.1.0
pycox==0.2.2
scanpy==1.8.1
scikit-bio==0.5.6
scikit-image==0.15.0
scikit-learn==0.24.0
scikit-network==0.24.0
scikit-survival==0.16.0
scipy==1.7.1
seaborn==0.11.2
setuptools-scm==6.3.2
simplejson==3.13.2
sklearn==0.0
sklearn-pandas==2.2.0
statsmodels==0.13.0
tensorboard==1.14.0
tensorflow-gpu==1.14.0
tqdm==4.32.2
umap-learn==0.5.0
wandb==0.12.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_bu.md

README_bu.md

Self-supervised learning in non-small cell lung cancer discovers novel morphological clusters linked to patient outcome and molecular phenotypes

Citation

Demo Materials

Repository overview

WSI tiling process

Workspace setup

H5 file content specification.

Directory Structure

HPL Instructions

Frequently Asked Questions

I want to reproduce the paper results.

I have my own cohort and I want to assign existing clusters to my own WSI.

When I run the Leiden clustering step. I get an 'TypeError: can't pickle weakref objects' error in some folds.

If you are having any issue running these scripts, please leave a message on the Issues Github tab.

TCGA HPL files

TCGA LUAD & LUSC WSI tile image datasets

TCGA Pretrained Models

TCGA tile vector representations

TCGA clusters

TCGA WSI & patient vector representations

Dockers

Python Environment

Files

README_bu.md

Latest commit

History

README_bu.md

File metadata and controls

Self-supervised learning in non-small cell lung cancer discovers novel morphological clusters linked to patient outcome and molecular phenotypes

Citation

Demo Materials

Repository overview

WSI tiling process

Workspace setup

H5 file content specification.

Directory Structure

HPL Instructions

Frequently Asked Questions

I want to reproduce the paper results.

I have my own cohort and I want to assign existing clusters to my own WSI.

When I run the Leiden clustering step. I get an 'TypeError: can't pickle weakref objects' error in some folds.

If you are having any issue running these scripts, please leave a message on the Issues Github tab.

TCGA HPL files

TCGA LUAD & LUSC WSI tile image datasets

TCGA Pretrained Models

TCGA tile vector representations

TCGA clusters

TCGA WSI & patient vector representations

Dockers

Python Environment