Skip to content

Conversation

@AbrahamArellano
Copy link

This PR adds support for cancer genomics datasets to PyHealth, enabling
variant classification and survival prediction research.

Datasets

  • ClinVarDataset: Variant clinical significance annotations
  • COSMICDataset: Cancer somatic mutations catalogue
  • TCGAPRADDataset: TCGA Prostate Adenocarcinoma multi-omics data

Tasks

  • VariantClassificationClinVar: Classify variants (Pathogenic/Benign/VUS)
  • MutationPathogenicityPrediction: FATHMM-based pathogenicity (COSMIC)
  • CancerSurvivalPrediction: Patient survival outcome prediction
  • CancerMutationBurden: High vs low TMB classification

Implementation

  • YAML configs following existing PyHealth patterns
  • Helper methods for robust data handling
  • Class constants for ACMG/AMP category mappings
  • Comprehensive docstrings with usage examples

Test Plan

  • 43 unit tests (all passing)

Background

This contribution stems from our cancer genomics benchmarking work:

Prostate-VarBench: A Somatic Variant Calling Benchmark for Prostate Cancer
arXiv: https://arxiv.org/abs/2511.09576
Repository: https://github.com/AbrahamArellano/uiuc-cancer-research

These datasets enable reproducible cancer genomics research within PyHealth.

This PR adds support for cancer genomics data to PyHealth:

Datasets:
- ClinVarDataset: Variant clinical significance annotations
- COSMICDataset: Cancer somatic mutations catalogue
- TCGAPRADDataset: TCGA Prostate Adenocarcinoma multi-omics data

Tasks:
- VariantClassificationClinVar: Predict pathogenic/benign variants
- MutationPathogenicityPrediction: FATHMM-based mutation prediction
- CancerSurvivalPrediction: Patient survival outcome prediction
- CancerMutationBurden: High vs low TMB classification

Features:
- YAML configs for all three datasets
- Helper methods for data cleaning (_safe_float, _extract_genes, etc.)
- Class constants for category mappings (ACMG/AMP guidelines)
- Comprehensive docstrings with examples
- 43 unit tests (all passing)
Copy link
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! Thanks for doing this! Can you add a couple doc pages for each of your datasets here? Here's an exemplar with the task and dataset contributions:

https://github.com/sunlabuiuc/PyHealth/pull/392/files

Very nice PR! I'm excited to merge this since I know it's from your guys' paper.

@AbrahamArellano
Copy link
Author

Hi @jhnwu3 I'm working on it, and push it soon

- Add RST docs for ClinVarDataset, COSMICDataset, TCGAPRADDataset
- Add RST docs for variant classification and cancer survival tasks
- Update datasets.rst and tasks.rst index files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants