Skip to content

Latest commit

 

History

History
201 lines (135 loc) · 11 KB

01_Introduction.md

File metadata and controls

201 lines (135 loc) · 11 KB

Reference Information

This workshop is developed based on the RNA-seq Bioinformatics Course by the Griffith Lab at Washington University, with extensive help from ChatGPT.

What is RNA-Seq Analysis?

RNA-Seq (RNA sequencing) is a powerful technique for studying the transcriptome, the complete set of RNA scripts produced by a genome at a given time. It allows researchers to identify and quantify RNA molecules, providing insights into gene expression, alternative splicing, non-coding RNAs, and other molecular phenomena.

Attention!! - RNA-seq is not working on RNA directly but cDNA, here's why:

  • RNA isolation: The first step in RNA-seq is extracting RNA from the sample, which consists of various types of RNA, such as mRNA, rRNA, and non-coding RNAs.
  • cDNA synthesis: Since sequencing technologies generally work with DNA, the extracted RNA is reverse-transcribed into cDNA using the enzyme reverse transcriptase. This is because the sequencing platforms (like Illumina, for example) read DNA sequences, not RNA.
  • Sequencing: The cDNA, which mirrors the RNA sequence (except for the replacement of uracil with thymine), is then fragmented, and libraries are prepared for sequencing.

Before the analyses

Should write about how the experimental design, library prepare, and sequencing technology used can influnce our downstream analyses, such as the use of different pipelines and software, quality control etc....

There are many RNA-seq library construction strategies.

Types of RNA-seq Analysis

RNA-seq is a versatile technology with diverse applications, and the specifics of experimental design, library preparation, sequencing technology, and downstream analysis vary significantly based on the research goal.

Summary of applications and which one we are focusing on?

  • Differential gene expression (DGE) analysis
  • Single-cell RNA-seq analysis
  • Transcript isoform analysis
  • Small RNA analysis
  • De novo transcriptome assembly
  • RNA editing and modifications
  • Fusion gene detection
  • Metatranscriptomics

In this workshop, we will be focusing on differential gene expression analysis and single-cell RNA-seq analysis.

Below are examples of how different the steps are for different types of analyses.

Differential Gene Expression (DEG) Analysis

  • Goal: Compare gene expression levels between conditions (e.g., treated vs. untreated, diseased vs. healthy).
  • Design: Biological replicates are critical.
  • Library Prep: Typically uses poly(A) enrichment or rRNA depletion for mRNA-focused studies.
  • Sequencing: Moderate depth (20-50M reads per sample) is sufficient.
  • Downstream Analysis:
    • Alignment tools: STAR, HISAT2.
    • Differential gene expression tools: DESeq2, edgeR.

Single-cell RNA-seq

  • Goal: Investigate gene expression at the resolution of individual cells.
  • Design: Optimise for cell capture efficiency and number of cells.
  • Library Prep: Droplet-based methods (e.g., 10x Genomics) or plate-based methods (e.g., Smart-seq2).
  • Sequencing: Shallow sequencing per cell (~50K-100K reads per cell) but many cells (>10,000).
  • Downstream Analysis:
    • Tools: Seurat, Scanpy, or Monocle.
    • Focus: Clustering, trajectory analysis, cell type identification.

Transcript Isoform Analysis

  • Goal: Detect alternative splicing, isoform usage, or novel transcripts.
  • Design: High sequencing depth for robust isoform detection.
  • Library Prep: Use protocols that preserve strand information (e.g., stranded RNA-seq).
  • Sequencing: Paired-end sequencing with ~150bp reads.
  • Downstream Analysis:
    • Tools: StringTie, Cufflinks, or IsoQuant.
    • Focus: Identify isoforms, splicing junctions, and exon usage.

Small RNA Analysis

  • Goal: Study non-coding RNAs (e.g., miRNAs, siRNAs).
  • Design: Small RNA-specific library prep kits.
  • Library Prep: Capture RNAs with sizes between 15-50bp.
  • Sequencing: Use short-read sequencing platforms.
  • Downstream Analysis:
    • Tools: miRDeep, miRBase.
    • Focus: miRNA quantification and novel miRNA discovery.

General computational steps of RNA-seq workflows

Each type of RNA-seq analysis has distinct requirements and challenges but also a common theme.

  1. Quality control
  2. Trimming
  3. Read alignment
  4. Read quantification
  5. Downstream analysis
    • Differential gene expression analysis
    • Functional annotation and enrichment
    • Alternative splicing
    • Isoform quantification
    • and more...

Quality Control

  • The first step after obtaining raw data.
  • To check read quality, adapter contamination, GC content, etc.
  • Tools: FastQC, MultiQC

Trimming

  • To remove low-quality bases and adapter sequences.
  • Tools: Trimmomatic, Cutadapt, fastp

Read Alignment

  • Align reads to reference genome/transcriptome to identify their origins.
  • Tools: STAR, HISAT2
  • Pseudo-alignment: Kallisto, Salmon for faster processing.
  • Outputs: SAM/BAM files (aligned reads).

Read Quantification

  • To count reads that mapped to genes or transcripts to generate a count matrix.
  • Tools: HTSeq, featureCounts

Normalisation

  • Normalise counts to correct for sequencing depth and library size.
  • Tools: Built into DESeq2, edgeR, or limma.

Differential Gene Expression (DGE) Analysis

  • To identify genes that are differentially expressed between experimental conditions.
  • Tools: DESeq2, edgeR, or limma.

Functional Annotation and Enrichment

  • Perform Gene Ontology (GO) or KEGG pathway analysis to interpret biological significance.
  • Tools: DAVID, clusterProfiler, GOseq.

Visualisation

  • Visulaise results with PCA, heatmaps, volcano plots, and more.
  • Tools: ggplot2, pheatmap, Seurat.

Other questions that can help you understand

What is Sequencing Depth?

When you are designing an RNA-seq experiment and deciding on the sequence depth (or coverage) before sending your samples for sequencing, it referes to the number of sequencing read you want to generate per sample. This decision is crucial because it impacts the sensitivity and resolution of your results, determining how well you can capture the transcriptome's full complexity.

How long are transcripts?

...

What is adapter contamination?

Adapter contamination refers to the presence of adapter sequences in RNA-seq reads that were not fully removed during the library preparation process. Adapters are short, synthetic DNA sequences that are ligated to RNA or cDNA fragments during library preparation to facilitate sequencing. They are essential for binding the fragments to the sequencing platform and allow for amplification and sequencing. However, sometimes fragments of RNA or cDNA are too short, and the sequencing process reads into the adapter sequence. This results in "adapter contamination" in the sequencing data.

When to align to genome and when to align to transcriptome?

Deciding whether to align RNA-seq reads to a genome or transcriptome depends on your research goals and the type of analysis you are performing.

When to align to the genome:

  • Novel transcript discovery: If you want to identify new transcripts, novel splice junctions, or previously unannotated regions of transcription, aligning to the genome is essential. Examples such as identifying non-coding RNAs, novel isoforms, or novel genes.
  • Alternative Splicing Analysis: For detailed analysis of splicing events (e.g., exon skipping, intron retention), genome alignment is crucial since splice junctions need to be inferred.
  • Poor Transcriptome Annotation: If the transcriptome of your organism is imcomplete or poorly annotated, genome alignment ensures that all reads are mapped to their correct locations.
  • Non-Model Organisms: For non-model organisms, where a high-quality transcriptome may not exist, aligning to the genome is necessary to extract transcript-level information.
  • Other Specialised Applications: Studying intronic/intergenic regions. Detecting unspliced pre-mRNA.

When to align to the transcriptome:

  • Gene/Transcript Quantification: If your primary goal is to quantify gene or transcript expression levels (e.g., for differential expression analysis), aligning to the transcriptome is typically faster and sufficient.
  • Well-Annotated Transcriptomes: When working with well-studied organisms that have high-quality transcriptome annotations, transcriptome alignment can simplify the analysis.
  • Pseudoalignment Methods: For fast, lightweight quantification (e.g., tools like Salmon or Kallisto) that use pseudoalignment, transcriptome alignment is standard.
  • Single-Cell RNA-seq: Single-cell RNA-seq analysis often uses transcriptome alignment or pseudoalignment, as the focus is typically on transcript quantification rather than novel discovery.

What is pseudo-alignment and why Kallisto and Salmon are faster?

Pseudoalignment is a computational method used in RNA-seq analysis to assign reads directly to transcripts or genes without performing traditional base-by-base alignment to a reference genome or transcriptome. Instead of aligning reads precisely to a sequence, pseusoalignment focused on identifying the compatibility of a read with one or more transcripts.

How does pseudoalignment work?

  • Indexing the transcriptome: A database of transcript sequences is preprocessed to create an index, which maps k-mers (short subsequences of a fixed length) to transcripts. Tools like Kallisto or Salmon use this approach. This step builds an efficient data structure, such as a k-mer hash table or a de Bruijn graph.
  • Read querying: For each RNA-seq read, its k-mers are extracted and compared to the indexed transcriptome. Instread of finding the exact position where the read aligns, pseudoalignment determines which transcripts contain the k-mers in the read.
  • Transcript compatibility: The method determines a set of compatible transcripts for each read - i.e., the transcripts where the read could have originated. This avoids computing an exact alignment but still provides information about transcript-level expression.
  • Quantification: Using a statistical model, pseudoalignment tools estimate the abundance of each transcript based on the number of reads compatible with it.

When do I need to do read quantification? Is it every time for every type of analyses?

Read quantification is the process of counting how many sequencing reads map to particular gene, transcript, or genomic feature. These counts represent the expression levels of genes or transcripts and are essential for comparing expression across samples.

When do you need read quantification?

  • Differential expression analysis (DEA)
  • Gene/transcript expression profiling
  • Single-cell RNA-seq analysis
  • Gene set enrichment analysis (GSEA)

When is read quantification not necessary?

  • Novel transcript discovery
  • Alternative splicing analysis
  • Variant calling
  • Fusion transcript detection

What is gene ontology and KEGG pathway? What are their application scenarios?