integrateCNV is a Python package designed to integrate copy number variation (CNV) data from bulk DNA sequencing with single-cell RNA sequencing data to infer single-cell copy number profiles in targeted regions of the genome likely to harbor alterations.
- Integration of bulk DNA-seq CNV calls with scRNA-seq data
- Support for FACETS CNV output format
- Cell-level copy number inference
- Statistical analysis of CNV-expression relationships
- Visualization tools for integrated analysis
You can install integrateCNV using pip:
pip install git+https://github.com/dpeerlab/integrateCNV.git
or poetry:
poetry add git+https://github.com/dpeerlab/integrateCNV.git
integrateCNV requires three main types of input data:
-
Bulk DNA Sequencing CNV Calls
- Format: FACETS output (_hisens.cncf.txt files)
- Required columns:
ID - Sample identifier (e.g., s_RA19_10_3_s_RA19_10_11_1_hisens) chrom - Chromosome number (e.g., 1) loc.start - Start position (e.g., 13118) loc.end - End position (e.g., 16817418) tcn - Total copy number lcn - Lesser copy number cf - Cellular fraction
- Additional FACETS columns (optional):
seg - Segment identifier num.mark - Number of markers nhet - Number of heterozygous positions cnlr.median - Copy number log-ratio median mafR - Minor allele frequency ratio segclust - Segment cluster
-
Gene Annotations
- Format: Tab-separated BED file
- Required columns:
chromosome - Chromosome identifier (e.g., chr1) start - Gene start position (e.g., 29554) end - Gene end position (e.g., 31109) gene_id - Ensembl gene ID (e.g., ENSG00000243485) gene_name - Gene symbol (e.g., MIR1302-2) strand - Strand direction (+ or -)
-
Single-cell RNA Sequencing Data
- Format: AnnData object (.h5ad)
- Requirements:
- Gene expression matrix
- Gene annotations matching the BED file
- Cell metadata (optional)
-
Initialize and Load Data
import integratecnv as cnv import pandas as pd # Set paths gene_annot_tab = "path/to/annotations.gtf.bed" cna_dir = "path/to/facets/output/"
-
Process CNV Files
# Find FACETS output files cna_paths = cnv.prepare_regions.find_files(cna_dir, "_hisens.cncf.txt") # Determing regions that are neutral in all samples from WGS data cna_neutral_bed = cnv.prepare_regions.get_neutral_regions(cna_paths) # Determine regions that contain alterations from WGS data cna_alterations_bed_filtered = cnv.prepare_regions.get_altered_regions(cna_paths, gene_annot_tab, filter_inconsistent=True, gene_count_cutoff=20) n_altered = cna_alterations_bed_filtered.shape[0] print(f'Found {n_altered} altered regions.')
-
Map CNVs to Genes
# Get genes that fall in altered and neutral regions cna_gene_map_alt = cnv.prepare_regions.map_gene_to_cna(cna_alterations_bed_filtered, gene_annot_tab) cna_gene_map_nm = cnv.prepare_regions.map_gene_to_cna(cna_neutral_bed, gene_annot_tab)
You may then run the entire pipeline by running the function cnv.score.run_integrateCNV(ad, normal_celltypes, cna_gene_map_alt, cna_gene_map_nm)
.
The function takes the following arguments:
ad
: AnnData objectnormal_celltypes
: List of cell types to use as normal cellscna_gene_map_alt
: Gene-to-CNV mapping for altered regionscna_gene_map_nm
: Gene-to-CNV mapping for non-altered regions
Alternatively, you can run the pipeline step by step by running the functions described in our example notebook.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use integrateCNV in your research, please cite our paper.