- Define the biological question (e.g., comparing gene expression between conditions).
- Plan replicates to ensure statistical power (at least 3 biological replicates per group is recommended).
- Select an appropriate sequencing depth (typically 20-50 million reads per sample).
- Extract high-quality RNA from your samples.
- Assess RNA quality (e.g., using an Agilent Bioanalyser for RNA Integrity Number (RIN)).
- Prepare cDNA libraries for sequencing.
- Perform sequencing on a platform (e.g., Illumina) to generate raw reads.
- Inspect raw sequencing data using tools like:
- FastQC: Provides an overview of quality metrics (base quality, GC content, adapter contamination).
- Trim low-quality bases and remove adapters using tools like:
- Trimmomatic or Cutadapt.
- Map the cleaned reads to a reference genome (or transcriptome) using aligners like:
- STAR: Fast and widely used for RNA-seq.
- HISAT2: Efficient for spliced alignments.
- Output: Aligned reads in a BAM file.
???What is a transcriptome? How does it different from a genome?
- Evaluate alignment results:
- Use samtools flagstat to check the percentage of mapped reads.
- Use RSeQC to access read distribution across genomic features. WHY???
- Check for biases (e.g., 3' bias due to degraded RNA). WHY???
- Count the number of reads mapped to each gene using tools like:
- HTSeq or featureCounts: Count reads based on gene annotation files (e.g., GTF/GFF).
- Output: A count matrix, where rows are genes and columns are samples.
- Normalise the count data to account for differences in sequencing depth and gene length. HOW???
- Common normalisation methods:
- TPM (Transcripts Per Million): For comparing gene expression within a sample.
- RPKM/FPKM: Length-normalised, but less commonly used now.
- DESeq2 or edgeR normalisation: Scales raw counts for differential expression analysis.
- Identify genes with significant expression differences between experimental groups.
- Common tools:
- DESeq2 (R-based): Handles raw counts directly.
- edgeR (R-based): Suitable for small sample sizes.
- Output:
- List of differentially expressed genes (DEGs) with log2 fold changes and p-values.
- Interpret the biological relevance of DEGs by performing:
- Gene Ontology (GO) Enrichment Analysis: Identify enriched biological processes, molecular functions, or cellular components.
- Pathway Analysis: Map DEGs to pathways using tools like KEGG, Reactome, or GSEA.
- Quality Control:
- PCA plot: Visualise sample clustering.
- Heatmaps: Show clustering of samples/genes.
- Differential Expression:
- MA plot: Log fold change vs. mean expression.
- Volcano plot: Significant vs. log fold change.
- Pathway Analysis:
- Enrichment bar plots or network diagrams.
- Validate key findings using an independent method:
- qRT-PCR: Validate differential expression for a subset of genes.
- Cross-reference with existing datasets or prior research.
- Compile findings into a report or publication:
- Document methods, quality control steps, and statistical analyses.
- Share data and scripts for reproducibility (e.g., GitHub or public repositories).