- Introduction
- The Problem
- How FracFixR Works
- Installation
- Dependencies
- Quick Start
- Input Requirements
- Output Description
- Detailed Workflow
- Function Reference
- Test Data Generation
- Examples
- Troubleshooting
- Citation
- Contact
FracFixR is a compositional statistical framework for absolute proportion estimation between fractions in RNA sequencing data. It addresses the fundamental challenge in fractionated RNA-seq experiments where library preparation and sequencing depth obscure the original proportions of RNA fractions.
- Reconstructs original fraction proportions using non-negative linear regression
- Estimates the "lost" unrecoverable fraction not captured in sequencing
- Corrects individual transcript frequencies for accurate abundance estimation
- Performs differential proportion testing between conditions
- Handles any RNA fractionation protocol (polysome profiling, subcellular localization, RNA-protein complexes)
- Polysome profiling experiments
- Subcellular RNA localization studies
- Nuclear-cytoplasmic fractionation
- RNA-protein complex isolation
- Any experiment involving RNA fractionation
In fractionated RNA-seq experiments, several biases compromise data interpretation:
- Library preparation effects: Amplification and sampling introduce variable efficiency across fractions
- Sequencing depth differences: Different fractions may be sequenced at different depths
- Lost material: Some RNA is unrecoverable during fractionation (e.g., degraded RNA, non-fractionated material)
- Compositional nature: Fractions are parts of a whole, requiring specialized statistical treatment
Standard RNA-seq analysis tools (DESeq2, edgeR) are inappropriate because they:
- Assume independent samples rather than compositional data
- Cannot estimate the unobserved "lost" fraction
- Fail to account for global shifts in RNA distribution between conditions
FracFixR uses a mathematical model based on the compositional relationship between whole cell RNA and its fractions:
Total = α₀ + α₁×Fraction1 + α₂×Fraction2 + ... + ε
Where:
- α₀ represents the "lost" fraction (intercept)
- α₁, α₂, ... are fraction weights
- ε is the error term
- Transcript Selection: Identifies informative transcripts in the 70-96% abundance quantile range
- NNLS Regression: Fits non-negative least squares model to estimate fraction weights
- Proportion Estimation: Calculates individual transcript proportions in each fraction
- Count Correction: Normalizes counts to account for library size differences
- Statistical Testing: Performs differential proportion analysis using GLM/Logit/Wald tests
install.packages("https://github.com/Arnaroo/FracFixR/releases/download/v1.0.0/FracFixR_1.0.0.tar.gz",repos = NULL,type = "source")# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
# Install FracFixR from GitHub
devtools::install_github("Arnaroo/FracFixR/CRAN")library(FracFixR)
packageVersion("FracFixR")FracFixR requires the following R packages:
- nnls (≥ 1.4): Non-negative least squares regression
- future.apply (≥ 1.8.1): Parallel processing
- dplyr (≥ 1.0.7): Data manipulation
- tidyr (≥ 1.1.3): Data tidying
- matrixStats (≥ 0.60.0): Efficient matrix operations
- ggplot2 (≥ 3.3.5): Plotting framework
- RColorBrewer (≥ 1.1-2): Color palettes
- EnhancedVolcano (≥ 1.10.0): Volcano plots
- aod (≥ 1.3.1): Beta-binomial regression
- R ≥ 4.0.0
- RAM: Minimum 8GB (16GB recommended for large datasets)
- CPU: Multi-core processor recommended for parallel processing
library(FracFixR)
# Load your data
counts <- read.csv("counts_matrix.csv", row.names = 1)
annotation <- read.csv("annotation.csv")
# Convert to matrix
counts_matrix <- as.matrix(counts)
# Run FracFixR
results <- FracFixR(counts_matrix, annotation)
# Visualize fraction proportions
PlotFractions(results)
# Differential proportion testing
diff_results <- DiffPropTest(results,
Conditions = c("Control", "Treatment"),
Types = "Heavy_Polysome",
Test = "GLM")
# Visualize differential results
PlotComparison(diff_results)A numeric matrix with:
- Rows: Transcripts/genes (must have unique row names)
- Columns: Samples (must match Sample column in annotation)
- Values: Raw read counts (not normalized)
# Example structure
# Sample1 Sample2 Sample3 Sample4
# Gene1 234 567 123 890
# Gene2 456 789 234 567
# Gene3 123 345 567 234A data frame with required columns:
- Sample: Sample identifiers matching column names in count matrix
- Condition: Experimental condition (e.g., "Control", "Treatment")
- Type: Fraction type (must include at least one "Total")
- Replicate: Replicate identifier (e.g., "Rep1", "Rep2")
# Example structure
# Sample Condition Type Replicate
# 1 Sample1 Control Total Rep1
# 2 Sample2 Control Heavy_P Rep1
# 3 Sample3 Treatment Total Rep1
# 4 Sample4 Treatment Heavy_P Rep1- Each condition-replicate combination must have one "Total" sample
- Fraction names must be consistent across replicates
- Minimum 2 replicates per condition recommended for statistical testing
The FracFixR() function returns a list containing:
results <- FracFixR(counts_matrix, annotation)
# Access components:
results$OriginalData # Filtered input count matrix
results$Annotation # Input annotation data
results$Propestimates # Proportion estimates for each transcript
results$NewData # Corrected count matrix
results$Coefficients # Regression coefficients (α values)
results$Fractions # Estimated fraction proportions
results$plots # List of diagnostic plotsThe DiffPropTest() function returns a data frame with:
# Column descriptions:
# transcript - Transcript identifier
# mean_success_cond1 - Mean proportion in condition 1
# mean_success_cond2 - Mean proportion in condition 2
# mean_diff - Difference in proportions (cond2 - cond1)
# log2FC - Log2 fold change
# pval - Raw p-value
# padj - FDR-adjusted p-value# Read count data
counts <- read.csv("polysome_counts.csv", row.names = 1)
counts_matrix <- as.matrix(counts)
# Create annotation
annotation <- data.frame(
Sample = colnames(counts_matrix),
Condition = c(rep("Control", 6), rep("Treatment", 6)),
Type = rep(c("Total", "Light_Poly", "Heavy_Poly"), 4),
Replicate = c(rep("Rep1", 3), rep("Rep2", 3), rep("Rep1", 3), rep("Rep2", 3))
)
# Verify data integrity
all(colnames(counts_matrix) %in% annotation$Sample) # Should be TRUE
any(annotation$Type == "Total") # Should be TRUE# Run with default parameters
results <- FracFixR(MatrixCounts = counts_matrix,
Annotation = annotation)
# The function will:
# 1. Filter transcripts present in Total samples
# 2. Process each condition-replicate combination
# 3. Fit NNLS regression
# 4. Calculate fraction proportions and lost fraction
# 5. Correct individual transcript abundances# Plot fraction proportions
frac_plot <- PlotFractions(results)
ggsave("fraction_proportions.pdf", frac_plot, width = 10, height = 8)
# Access specific results
fraction_data <- results$Fractions
write.csv(fraction_data, "estimated_fractions.csv", row.names = FALSE)# Compare conditions for specific fraction
diff_heavy <- DiffPropTest(NormObject = results,
Conditions = c("Control", "Treatment"),
Types = "Heavy_Poly",
Test = "GLM")
# Compare combined fractions
diff_combined <- DiffPropTest(NormObject = results,
Conditions = c("Control", "Treatment"),
Types = c("Light_Poly", "Heavy_Poly"),
Test = "GLM")
# Filter significant results
sig_transcripts <- diff_heavy[diff_heavy$padj < 0.01, ]# Volcano plot
volcano <- PlotComparison(diff_heavy,
Conditions = c("Control", "Treatment"),
Types = "Heavy_Poly",
cutoff = 50)
ggsave("volcano_plot.pdf", volcano, width = 12, height = 10)
# Export results
write.csv(diff_heavy, "differential_proportion_results.csv", row.names = FALSE)FracFixR(MatrixCounts, Annotation)Parameters:
MatrixCounts: Numeric matrix of raw counts (genes × samples)Annotation: Data frame with Sample, Condition, Type, and Replicate columns
Returns: List containing corrected data and analysis results
Details:
- Automatically detects and uses available CPU cores for parallel processing
- Filters transcripts to those present in Total samples
- Uses 70-96% quantile range for regression transcript selection
DiffPropTest(NormObject, Conditions, Types, Test = c("GLM", "Logit", "Wald"))Parameters:
NormObject: Output from FracFixR()Conditions: Character vector of exactly 2 conditions to compareTypes: Character vector of fraction type(s) to analyzeTest: Statistical test method"GLM": Binomial generalized linear model (most powerful, default)"Logit": Logit transformation test (faster)"Wald": Beta-binomial Wald test (for overdispersed data)
Returns: Data frame with differential proportion results
PlotFractions(FracFixed)Parameters:
FracFixed: Output from FracFixR()
Returns: ggplot2 object showing stacked bar plot of fraction proportions
PlotComparison(DiffPropResult, Conditions = NULL, Types = NULL, cutoff = NULL)Parameters:
DiffPropResult: Output from DiffPropTest()Conditions: Optional character vector for plot titleTypes: Optional character vector for plot subtitlecutoff: Optional y-axis maximum for volcano plot
Returns: Enhanced volcano plot
FracFixR includes a Python script for generating synthetic test data with known ground truth:
pip install numpy pandas# generate_data.py usage
python generate_data.py \
--total_reads 1000000 \
--n_transcripts 10000 \
--fraction_weights "0.3,0.5" \
--lost_fraction 0.2 \
--output_dir test_data/--total_reads: Total number of reads in dataset--n_transcripts: Number of transcripts to simulate--fraction_weights: Comma-separated fraction proportions--lost_fraction: Proportion of lost/unrecoverable material--recovery_rates: Library recovery rates for each fraction--output_dir: Output directory for generated files
# Load generated test data
test_counts <- read.csv("test_data/counts_matrix.csv", row.names = 1)
test_annotation <- read.csv("test_data/annotation.csv")
ground_truth <- read.csv("test_data/ground_truth.csv")
# Run FracFixR
test_results <- FracFixR(as.matrix(test_counts), test_annotation)
# Compare with ground truth
cor(test_results$Fractions$Lost[1], ground_truth$lost_fraction[1])# Minimal example with polysome data
library(FracFixR)
# Create example data
set.seed(123)
n_genes <- 1000
n_samples <- 12
# Generate count matrix
counts <- matrix(
rnbinom(n_genes * n_samples, mu = 100, size = 10),
nrow = n_genes,
dimnames = list(
paste0("Gene", 1:n_genes),
paste0("Sample", 1:n_samples)
)
)
# Create annotation
annotation <- data.frame(
Sample = paste0("Sample", 1:12),
Condition = rep(c("Control", "Stressed"), each = 6),
Type = rep(c("Total", "Monosome", "Polysome"), 4),
Replicate = c(rep("Rep1", 3), rep("Rep2", 3), rep("Rep1", 3), rep("Rep2", 3))
)
# Run analysis
results <- FracFixR(counts, annotation)
# Check results
print(results$Fractions)
PlotFractions(results)# Nuclear-cytoplasmic fractionation example
annotation_subcell <- data.frame(
Sample = colnames(counts_matrix),
Condition = rep(c("WT", "Mutant"), each = 6),
Type = rep(c("Total", "Nuclear", "Cytoplasmic"), 4),
Replicate = rep(c("Rep1", "Rep2"), each = 3, times = 2)
)
# Run FracFixR
subcell_results <- FracFixR(counts_matrix, annotation_subcell)
# Test for differential nuclear localization
diff_nuclear <- DiffPropTest(subcell_results,
Conditions = c("WT", "Mutant"),
Types = "Nuclear",
Test = "GLM")
# Find transcripts with altered localization
nuclear_shifted <- diff_nuclear[abs(diff_nuclear$mean_diff) > 0.2 &
diff_nuclear$padj < 0.01, ]# Combine multiple fractions for analysis
combined_poly <- DiffPropTest(results,
Conditions = c("Control", "Stressed"),
Types = c("Monosome", "Polysome"),
Test = "GLM")
# This tests for overall ribosome association changes
# regardless of whether transcripts shift between mono- and polysomesCause: Too few transcripts in the 70-96% quantile range
Solution:
# Check transcript abundance distribution
total_counts <- rowSums(counts_matrix[, annotation$Type == "Total"])
hist(log10(total_counts + 1), breaks = 50)
# Adjust if needed (not recommended unless necessary)
# Contact package maintainers for guidanceCause: Mismatch between count matrix columns and annotation
Solution:
# Check for mismatches
setdiff(colnames(counts_matrix), annotation$Sample)
setdiff(annotation$Sample, colnames(counts_matrix))
# Ensure exact matching
colnames(counts_matrix) <- gsub(" ", "_", colnames(counts_matrix))
annotation$Sample <- gsub(" ", "_", annotation$Sample)Solution:
# Reduce parallel workers
library(future)
plan("multisession", workers = 2) # Use only 2 cores
# Process in batches
# Contact maintainers for batch processing scriptsCause: Extreme proportions or low counts
Solution:
# Try alternative test
diff_results <- DiffPropTest(results,
Conditions = c("Control", "Treatment"),
Types = "Heavy_Poly",
Test = "Wald") # More robust to extremes-
Parallel Processing: FracFixR automatically uses available cores
# Check current plan future::plan() # Adjust if needed future::plan("multisession", workers = 4)
-
Memory Management: For datasets >50,000 transcripts
# Pre-filter low-abundance transcripts keep <- rowSums(counts_matrix) >= 10 counts_filtered <- counts_matrix[keep, ]
-
Speed Optimization: Use Logit test for initial exploration
# Fast initial scan quick_test <- DiffPropTest(results, Conditions = c("A", "B"), Types = "Fraction1", Test = "Logit")
If you use FracFixR in your research, please cite:
Cleynen A, Ravindran A, Shirokikh N (2025). FracFixR: A compositional statistical framework for absolute proportion estimation between fractions
- Bug Reports: (https://github.com/Arnaroo/FracFixR/issues)
- Feature Requests: (https://github.com/Arnaroo/FracFixR/issues)
- Alice Cleynen ([email protected]) - Statistical methodology
- Agin Ravindran ([email protected]) - Experimental data generation
- Nikolay Shirokikh ([email protected]) - Conceptual framework
We welcome contributions! Please see our Contributing Guidelines for details.
FracFixR is released under the CC BY-NC ND License. See LICENSE file for details.