scDECAF: Single-Cell Disentanglement by Canonical Factors

Overview

scDECAF is a statistical learning algorithm for single-cell RNA-seq analysis. It identifies gene signatures, states, and transcriptional programs by learning vector representations of gene sets.

Through sparse selection, scDECAF highlights the most biologically relevant programs, improving interpretability of single-cell data.

Installation

Requires R ≥ 4.0.0.

install.packages("devtools")
devtools::install_github("DavisLaboratory/scDECAF")

Quick Start

Here is a minimal runnable example with toy data:

library(scDECAF)

# Simulated expression matrix (100 genes x 200 cells)
set.seed(123)
x <- matrix(rpois(100*200, lambda = 5), nrow = 100, ncol = 200)
rownames(x) <- paste0("Gene", 1:100)
colnames(x) <- paste0("Cell", 1:200)

# Define toy gene sets
genesetlist <- list(
  Pathway_A = rownames(x)[1:15],
  Pathway_B = rownames(x)[16:30],
  Pathway_C = rownames(x)[31:45]
)

# Highly variable genes (here, all genes for simplicity)
hvg <- rownames(x)

# Dummy 2D embedding (e.g., from PCA/UMAP)
cell_embedding <- matrix(rnorm(200*2), ncol = 2)
rownames(cell_embedding) <- colnames(x)

# Sparse selection of relevant gene sets
selected_gs <- pruneGenesets(
  data = x,
  genesetlist = genesetlist,
  hvg = hvg,
  embedding = cell_embedding,
  min_gs_size = 5,
  lambda = exp(-3)
)

# Build gene–set assignment matrix
target <- genesets2ids(
  x[match(hvg, rownames(x)), ],
  genesetlist[selected_gs]
)

# Compute gene-set scores
ann_res <- scDECAF(
  data = x,
  gs = target,
  hvg = hvg,
  k = 5,
  embedding = cell_embedding,
  n_components = min(2, ncol(target) - 1),
  max_iter = 2,
  thresh = 0.5
)

# Extract per-cell scores
scores <- attributes(ann_res)$raw_scores
head(scores[, 1:3])  # preview first few components

You can now add scores to your single-cell object (SingleCellExperiment, Seurat, or AnnData) and visualize them per cell.

Input Requirements

data: log-normalised single-cell expression matrix
genesetlist: named list of gene sets
hvg: highly variable genes
embedding: reduced dimension embedding (UMAP, PCA, PHATE, etc.)
min_gs_size: minimum gene set size
lambda: shrinkage penalty
n_components: number of CCA components
k: nearest neighbors for refinement

Reproducibility

Full analysis notebooks reproducing the manuscript are available in the reproducibility repository.

Citation

If you use scDECAF, please cite:

Hediyehzadeh, Whitfield, et al., Identification of cell types, states and programs by learning gene set representations, bioRxiv (2023). https://doi.org/10.1101/2023.09.08.556842

License

This project is released under the same license as the Davis Laboratory repositories.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
R		R
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
scDECAF.Rproj		scDECAF.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scDECAF: Single-Cell Disentanglement by Canonical Factors

Overview

Installation

Quick Start

Input Requirements

Reproducibility

Citation

License

About

Uh oh!

Releases 1

Packages

Languages

DavisLaboratory/scDECAF

Folders and files

Latest commit

History

Repository files navigation

scDECAF: Single-Cell Disentanglement by Canonical Factors

Overview

Installation

Quick Start

Input Requirements

Reproducibility

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages