RNA-Seq Pipeline: End-to-End Gene Expression and Differential Analysis

Release page: https://github.com/saidmlonji/rnaseq_pipeline/releases

🧬 A complete workflow to turn FASTQ data into gene expression profiles and differential expression results. Built around Bowtie2 for alignment, SAMtools for data handling, and R for statistics and visualization (DESeq2, Gviz, and more).

Table of contents

Overview
What this project does
Key concepts
Repository structure
Getting started
Installation
Data and input formats
How the pipeline runs
Configuration and customization
Outputs and interpretation
Visualization and reporting
Reproducibility and environments
Examples and tutorials
Performance and scalability
Troubleshooting
Testing and quality assurance
Security and data privacy
Collaboration and contribution
Licensing and credits
Resources and references

Overview This repository hosts an end-to-end RNA-seq pipeline. It starts with FASTQ data and ends with differential expression results and gene-level visuals. It uses Bowtie2 for alignment, SAMtools for file handling, and R with DESeq2 and Gviz for statistics and visualization. The pipeline is modular. You can run it as a whole or run individual steps.

What this project does

Accepts raw FASTQ data and a sample sheet that defines groups and conditions.
Performs quality control and optional trimming.
Aligns reads to a reference genome using Bowtie2.
Converts, sorts, and indexes alignments with SAMtools.
Quantifies gene expression with a counting tool.
Analyzes differential expression with DESeq2.
Produces plots and interactive visualizations with Gviz and base R plots.
Outputs a clear, shareable report and a complete results package.

Key concepts

FASTQ: raw sequence reads that feed the pipeline.
Alignment: mapping reads to a reference genome.
BAM/SAM: formats for aligned reads.
Counting: assigning reads to genes or features.
Differential expression: comparing expression across conditions.
Visualization: showing results with genome views and plots.

Repository structure

docs/ and tutorials/ for guidance and examples.
pipelines/ contains the main bash scripts that stitch steps together.
src/ includes helper scripts for preprocessing, alignment, and counting.
config/ stores example configuration templates and sample sheets.
data/ contains example data for testing and demonstration.
reports/ stores generated reports and visuals.
envs/ or containers/ for environment definitions (Conda, Docker, Singularity).
tests/ contains unit and integration tests.

Getting started

Prerequisites:
- Linux or macOS operating system.
- Command line access with a reasonable amount of RAM and CPU.
- Tools: Bowtie2, SAMtools, R (with Bioconductor), and a counting tool (such as featureCounts).
- Optional: Python for wrapper utilities.
Quick start plan:
- Install the required tools or pull a container image.
- Prepare a small sample dataset to test the pipeline.
- Create a minimal configuration file.
- Run the pipeline on the sample data.
- Inspect the output and adjust parameters as needed.

Installation

Option 1: Native installation
- Install Bowtie2, SAMtools, and a compatible R version.
- Install Bioconductor packages DESeq2, Gviz, and Rsamtools within R.
- Set up environment variables to locate references and index files.
- Ensure PATH and R_LIBS_USER point to the right locations.
Option 2: Conda environment
- Create an environment file (yaml) and install with conda.
- Activate the environment before running the pipeline.
- This keeps tools and libraries isolated and compatible.
Option 3: Containers
- Docker image: use a prebuilt image that includes Bowtie2, SAMtools, and R with Bioconductor libraries.
- Singularity image: for HPC use; pull or build from a definition file.
Example commands (conda):
- conda env create -f envs/rnaseq.yaml
- conda activate rnaseq
- conda install -c bioconda -c conda-forge deseq2 gviz Rsamtools
Example commands (docker):
- docker pull saidmlonji/rnaseq_pipeline:latest
- docker run --rm -v /path/to/data:/data saidmlonji/rnaseq_pipeline:latest /bin/bash

Data and input formats

Input FASTQ files
- Paired-end: sample_id_R1.fastq.gz and sample_id_R2.fastq.gz
- Single-end is supported with corresponding options in the config
Sample sheet
- A simple tab-delimited file listing sample IDs, file paths, and group/condition labels
- Example: sample_id fastq_r1 fastq_r2 condition sample1 data/sample1_R1.fq.gz data/sample1_R2.fq.gz treated
Reference data
- Reference genome FASTA
- Gene annotation GTF/GFF file
- Bowtie2 index directory or packaged index
Output layout
- results/
  - qc/ (quality control plots)
  - alignments/ (BAM/SAM files and statistics)
  - counts/ (gene-level counts)
  - deseq_results/ (differential expression results)
  - viz/ (plots and genome browser tracks)
- reports/ (HTML or PDF reports)
Example data
- A small, synthetic dataset is included under data/ for demonstrations.
- Use this dataset to validate the workflow before running on full samples.

How the pipeline runs

Step 1: Quality control
- Run FastQC on all FASTQ files.
- Compile reports with MultiQC.
Step 2: Preprocessing (optional)
- Trim adapters and filter low-quality reads (if configured).
- Re-run QC on cleaned data.
Step 3: Alignment
- Bowtie2 aligns reads to the reference genome.
- Generate SAM files and convert to sorted BAM files with SAMtools.
Step 4: Post-alignment processing
- Mark duplicates if needed.
- Index BAM files for fast access.
Step 5: Counting
- Use a gene annotation to count reads mapping to genes.
- Produce a counts matrix with samples as columns and genes as rows.
Step 6: Differential expression
- Load the counts into R.
- Run DESeq2 to identify differentially expressed genes.
- Save results with key statistics (log2 fold change, p-value, adjusted p-value).
Step 7: Visualization
- Generate genome tracks with Gviz.
- Produce plots for sample relationships, dispersion estimates, and DEG lists.
Step 8: Reporting
- Create a comprehensive report summarizing QC, alignment metrics, counts, DEG results, and visualizations.
Step 9: Packaging
- Package outputs for sharing with collaborators or submitting to a repository.

Configuration and customization

Core ideas
- The pipeline is driven by a configuration file.
- You set file paths, sample metadata, reference files, and analysis options in YAML or JSON.
Minimal config example (YAML) project_name: "RNA-Seq Demo" reference: genome_fasta: "data/reference/genome.fa" annotation_gtf: "data/reference/genes.gtf" bowtie2_index: "data/reference/bowtie2_index" fastq_dir: "data/fastq" sample_sheet: "data/samples.tsv" design_formula: "~ condition" run_parameters: qc: true trim: false aligner: "bowtie2" quant_method: "featureCounts" deseq2_method: "DESeq" output_dir: "results" threads: 8
Advanced options
- You can enable multi-threading for alignment, counting, and statistics.
- You can switch counting backends (featureCounts, HTSeq) if needed.
- You can enable a subset of steps for debugging or quick checks.
How to set up a sample sheet
- Include sample IDs, file paths, and group labels.
- Ensure consistent naming between fastq_dir and sample_sheet.
- Validate that all required fields exist for each sample.

Outputs and interpretation

Differential expression results
- A table with gene IDs, log2 fold changes, standard errors, test statistics, p-values, and adjusted p-values.
- Genes with adjusted p-values below a chosen threshold (commonly 0.05) are considered differentially expressed.
QC and alignment metrics
- Per-sample read counts, alignment rates, and library complexity metrics.
- Boxplots, MA plots, and dispersion estimates to assess data quality.
Gene-level visualizations
- Genome browser tracks showing gene models and read coverage.
- Interactive plots for exploration of top DE genes.
Reports
- A consolidated report with figures, methods, and a summary table.
- Reproducibility notes and environment details.

Visualization and reporting

Gviz integration
- Use Gviz to display tracks such as gene models, read density, and coverage across chromosomes.
- Create custom plots to highlight genes of interest.
Plot types
- PCA or MDS plots for sample relationships.
- MA plots for expression changes.
- Volcano plots for significance vs effect size.
- Heatmaps for top DE genes.
Export formats
- PNG, PDF, and interactive HTML where possible.
- Data tables in CSV or TSV formats.

Reproducibility and environments

Environment options
- Conda environments for reproducible installs.
- Docker containers for consistent runtimes.
- Singularity images for HPC clusters.
Version control
- The pipeline tracks tool versions and references.
- A config-driven approach ensures that runs are reproducible.
Data provenance
- Each run records input data, reference files, and parameters.
- Outputs include a run log with timestamps and command history.
Example container usage
- Docker: docker run -v /path/to/data:/data saidmlonji/rnaseq_pipeline:latest
- Singularity: singularity pull library://rnaseq_pipeline:latest

Examples and tutorials

Beginner walk-through
- Use the included small dataset to run the pipeline end-to-end.
- Validate each step with basic QC metrics and simple counts.
Intermediate scenario
- Add a second condition, adjust the design matrix, and re-run differential expression.
- Compare results across runs and interpret the changes.
Advanced scenario
- Integrate a custom annotation, apply a custom contrast, or add a post-analysis step with additional plots.
Step-by-step notes
- Each tutorial includes a checklist: inputs present, config valid, memory limits adequate, and outputs examined.

Performance and scalability

Parallelism
- The pipeline supports multi-threading for alignment, counting, and summaries.
- Use HPC-friendly options to request more CPUs when available.
Memory usage
- Bowtie2 and BAM processing require substantial memory for large genomes.
- Plan for peak memory during alignment and counting.
Large datasets
- Break data into batches if needed.
- Run QC on a per-sample basis to identify outliers early.
Scaling tips
- Use distributed file systems for data storage.
- Keep reference indices on fast storage to speed up access.

Troubleshooting

Common issues
- Bowtie2 not found: ensure the binary is on PATH or use the container image.
- Reference index missing: verify the bowtie2_index path in the config.
- GTF/GFF mismatch: confirm the annotation matches the genome build.
- Insufficient permissions: check read/write permissions on output directories.
Debug steps
- Run a single step manually with a small sample to isolate the problem.
- Check log files for error messages and stack traces.
- Validate sample metadata before starting the run.
Logging
- The pipeline writes a log file per run with all commands and timings.
- Review logs to understand failures and recoveries.

Testing and quality assurance

Automated tests
- A lightweight test suite runs a small sample to verify that all steps complete.
- Tests cover configuration parsing, basic file existence checks, and a minimal end-to-end run.
CI integration
- A CI workflow runs tests on push and pull requests.
- Checks include environment consistency and basic output validation.
Code quality
- The project uses linting for shell and R scripts.
- Style guidelines help keep scripts readable and maintainable.

Security and data privacy

Data handling
- The pipeline processes sensitive sequencing data; protect source data.
- Prefer local runs or controlled storage when handling real datasets.
Access controls
- Restrict access to data directories and results.
- Use signed containers and verify image integrity before use.
Dependencies
- Pin tool versions to reduce drift and vulnerabilities.
- Regularly update and audit dependencies.

Collaboration and contribution

How to contribute
- Open issues to report bugs or request features.
- Submit pull requests with focused changes and tests.
- Follow the contribution guidelines in the CONTRIBUTING file.
Community standards
- Be clear, concise, and respectful.
- Include tests for new features.
- Update documentation for any user-facing changes.

Licensing and credits

License
- This project is released under the MIT license.
Credits
- Core contributors and the open-source tools leveraged by the pipeline.
- Acknowledgments for collaborators and testers.

Resources and references

Bowtie2 documentation
SAMtools documentation
DESeq2 documentation and vignettes
Gviz documentation
Bioconductor suite for RNA-seq workflows
FastQC and MultiQC tools for quality control

Notes on usage and download

The release page contains bundles that you can download. The release asset is a packaged file with the pipeline, scripts, and example data. The file needs to be downloaded and executed as part of the setup.
The release page is the central place to obtain stable versions and updates. For quick access to the latest stable package, visit: https://github.com/saidmlonji/rnaseq_pipeline/releases
The same link is used here as a reference point for distribution and validation. Ensure you check the Releases section for the most recent version, test data, and example configurations.

FAQ

Q: Can I run this on Windows?
- A: The pipeline is designed for Unix-like systems. Use a Linux VM or Windows Subsystem for Linux (WSL) if you must run on Windows.
Q: Do I need internet access during a run?
- A: For most runs, you need file access to reference genomes, indices, and tools. If you preinstall dependencies, the run itself can be offline.
Q: Can I swap tools (e.g., HTSeq for counting) with this pipeline?
- A: Yes, the counting step can be configured to use alternative tools. Update the config to select the desired backend.
Q: How do I add my own annotations?
- A: Provide a compatible GTF/GFF file and align it with the reference genome. Update the config to point to the new annotation.

Future directions

Expand support for single-cell RNA-seq workflows.
Add integration with alternate aligners and pseudo-alignment tools.
Improve parallelization strategies for very large cohorts.
Enhance visualizations with more interactive genome browser tracks.

Community guidelines

Share reproducible configurations.
Include sample datasets when possible.
Document any deviations from the standard workflow.
Help others reproduce results and interpret outputs.

Appendix: sample workflow commands

Quick end-to-end run (illustrative)
- Prepare: ensure data/ and config/ exist with proper files.
- Run: bash pipelines/run_pipeline.sh --config config/sample_config.yaml
- Inspect: results/ and reports/ for outputs.
Minimal configuration (snippet) project_name: "RNA-Seq Demo" design_formula: "~ condition" fastq_dir: "data/fastq" sample_sheet: "data/samples.tsv" reference: genome_fasta: "data/reference/genome.fa" annotation_gtf: "data/reference/genes.gtf" bowtie2_index: "data/reference/bowtie2_index" output_dir: "results" threads: 4

Appendix: glossary

FASTQ: a text-based format for sequencing reads.
Bowtie2: a fast aligner for short reads.
SAMtools: tools for manipulating SAM and BAM files.
DESeq2: a Bioconductor package for differential expression testing.
Gviz: a Bioconductor package for genomic visuals.
FeatureCounts: a tool for counting reads per feature.
GTF/GFF: genome annotation formats.

End of document

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.Rproj.user		.Rproj.user
results		results
rnaseq_pipeline		rnaseq_pipeline
.gitignore		.gitignore
README.md		README.md
RNAseq_analysis.R		RNAseq_analysis.R
deseq2_analysis.R		deseq2_analysis.R
gene_counting.R		gene_counting.R
rnaseq_pipeline.Rproj		rnaseq_pipeline.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RNA-Seq Pipeline: End-to-End Gene Expression and Differential Analysis

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

saidmlonji/rnaseq_pipeline

Folders and files

Latest commit

History

Repository files navigation

RNA-Seq Pipeline: End-to-End Gene Expression and Differential Analysis

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages