Skip to content

idekerlab/MutationProjector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MutationProjector

MutationProjector is a neural network that translates clinical gene panels into a foundational representation of tumor subtypes. This is a tumor mutation-based foundation model capable of predicting cancer therapeutic response and metastatic potential in cancer, in which multiple types of molecular interaction networks were incorporated into the model.

Pre-training MutationProjector

To pre-train MutationProjector, we leveraged large-scale genomic alteration data, histopathology images and multiple molecular interaction networks. Simplified overview of the approach is visualized below: Screenshot

Environment set up

MutationProjector require the following environmental setup:

  • GPU server with CUDA>=11 installed
  • Python >= 3.6
  • Anaconda: conda
  • PyTorch (ver 2.1.2 was used in the manuscript)
  • To install all dependencies, use the below command: conda env create -f conda-envs/env.yml

Download protein interaction graphs

All of the networks used in this study are available on NDEx (Network Data Exchange). Use the following links to download network. Make sure to have all the newtork files under /data/networks.

Other requirements

  • Calculate tumor mutation burden: use Maftools
  • Calculate aneuploidy: use ASCETS
  • Calculate mutational signatures from targeted gene panels: use MESiCA
  • Calculate mutational signatures from whole exome/genome sequencing: use SigProfiler

Required input files for downstream tasks

Make sure to create a folder under /data/downstream_data/train_dataset and/or /data/downstream_data/eval_dataset, dependeing on your task requirements. Also, make sure that you have all the tab-delimited files under the folder created above.

  1. mut.txt
  2. cna.txt
  3. cnd.txt
  4. covariates.txt
  5. [optional] outcomes.txt
    (if further training MutationProjector on specific task or dataset). Include two columns, sample and outcomes. outcomes column should contain binary outcome label (either 0 or 1).

Example files are under ./data/downstream_data/sample folder.

Codes for generating the input files for TMB, aneuploidy and mutational signatures

All codes related to generating the input files for TMB and mutational signatures are available under ./src folder. For generating aneuploidy, please use ASCETS

  1. calculate_TMB.R : calculates TMB from MAF (Mutation Annotation Format) files using Maftools
  2. mutation_signatures-compute_SBS.py : compute mutation signatures from MAF files using SigProfiler
  3. mutation_signatures-identify_dominant_signature.py : compute dominant mutation signatures

Making predictions using the pre-trained MutationProjector

Screenshot To make predictions for the task of your interest using the pre-trained MutationProjector, execute the following:

  1. Make sure you have all the mut.txt, cna.txt, cnd.txt, covariates.txt and outcomes.txt files under /data/downstream_data/train_dataset/{your_dataset_name} and /data/downstream_data/eval_dataset/{your_dataset_name}
    (please change {your_dataset_name} to the desired name)
  2. Run the model in a GPU server by execute the following in the /src/ folder:

python predict.py 
		   -downstream_train [name of the downstream dataset to additionally train] 
		   -downstream_eval [name of the downstream dataset to predict] 
		   -max_depth [max depth for downstream random forest model] [OPTIONAL] 
		   -n_estimators [number of estimators for downstream random forest model] [OPTIONAL] 
		   -o [file output prefix] [OPTIONAL]  

3. Output files - Predicted probabilities for each tumor samples
- Output file available at:
`/data/downstream_data/eval_dataset/{your_dataset_name}/TransferLearning_predictions.txt`

Code used for pre-training

MutationProjector is pre-trained using self-supervised learning and weakly supervised learning. The code for pre-training is /src/pretrain.py.

Cite

Please cite the MutationProjector paper if using this repo:

1. MutationProjector

If using protein interaction graphs or other tools, please cite the papers below:

2. Networks

  • BioPlex: Huttlin, E. L. et al. Dual proteome-scale networks reveal cell-specific remodeling of the human interactome. Cell 184, 3022–3040.e28 (2021)
  • SIGNOR: Lo Surdo, P. et al. SIGNOR 3.0, the SIGnaling network open resource 3.0: 2022 update. Nucleic Acids Res 51, D631–D637 (2023)
  • SignaLink: Csabai, L. et al. SignaLink3: a multi-layered resource to uncover tissue-specific signaling networks. Nucleic Acids Res 50, D701–D709 (2022)
  • TRRUST v2: Han, H. et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res 46, D380–D386 (2018)
  • PhosphoSitePlus: Hornbeck, P. V. et al. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res 40, D261–70 (2012)
  • UbiNet v2.0: Li, Z. et al. UbiNet 2.0: a verified, classified, annotated and updated database of E3 ubiquitin ligase-substrate interactions. Database (Oxford) 2021, (2021)
  • UbiBrowser v2.0: Wang, X. et al. UbiBrowser 2.0: a comprehensive resource for proteome-wide known and predicted ubiquitin ligase/deubiquitinase-substrate interactions in eukaryotic species. Nucleic Acids Res 50, D719–D728 (2022)
  • ISLE: Lee, J. S. et al. Harnessing synthetic lethality to predict the response to cancer treatment. Nat Commun 9, 2546 (2018)
  • SynLethDB v2.0: Wang, J. et al. SynLethDB 2.0: a web-based knowledge graph database on synthetic lethality for novel anticancer drug discovery. Database (Oxford) 2022, (2022)
  • DDRAM: Kratz, A. et al. A multi-scale map of protein assemblies in the DNA damage response. Cell Syst 14, 447–463.e8 (2023)
  • PCNet v1.3: Huang, J. K. et al. Systematic Evaluation of Molecular Networks for Discovery of Disease Genes. Cell Syst 6, 484–495.e5 (2018)
  • STRING v12: Szklarczyk, D. et al. The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res 51, D638–D646 (2023)

3. Network data repository

  • NDEx: Pratt, D. et al. NDEx, the Network Data Exchange. Cell Syst 1, 302–305 (2015)

4. tumor mutation burden

  • Maftools: Mayakonda, A., Lin, D.-C., Assenov, Y., Plass, C. & Koeffler, H. P. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Res. 28, 1747–1756 (2018)

5. aneuploidy

  • ASCETS: Spurr, L. F. et al. Quantification of aneuploidy in targeted sequencing data using ASCETS. Bioinformatics 37, 2461–2463 (2021)

6. mutational signatures (targeted sequencing)

  • MESiCA: Yaacov, A. et al. Cancer mutational signatures identification in clinical assays using neural embedding-based representations. Cell Rep Med 5, 101608 (2024)

7. mutational signatures (whole exome/genome sequencing)

  • SigProfiler: Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020)

About

Source code for MutationProjector

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published