OPM_transcriptome_assembly

OPM_transcriptome_assembly
Objective
Experimental design
Data
Read processing
De novo transcriptome assembly
- Pool all data
- Run Trinity
NCBI submission
Final statistics cleaned assembly
ORF prediction
- Step 1: extract the long open reading frames
- Step 2: predict the likely coding regions
Trinotate annotation
- Install Trinotate and dependencies
  - Infernal
  - Diamond
  - Sqlite3
  - HMMER
  - SignalP6
  - TMHMM
- Run Trinotate
RNA-seq analysis
Authors
License

Table of contents generated with markdown-toc

Pipeline of the de novo transcriptome assembly of three larval stages of the oak processionary moth (OPM) (Thaumetopoea processionea).

Objective

Thaumetopoea processionea moths cause health problem due to their urticant hairs (Rahlenbeck 2015). The group of Prof. Dr. med. Timo Buhl want to identify candidate allergens that start being produced from larvae stage L3. Therefore, transcriptomic data are needed at larval stages before and after this transition to allergenicity. This work was published in Zicola et al., 2024.

Experimental design

Larvae of Thaumetopoea processionea were all collected from a single nest in an English oak tree (Quercus robur) in Briesener Zootzen (Germany, 52.755167N, 12.674806E), in May 14, 2022 (L2 and L4 stages) and June 15, 2022 (L5 stage). The larvae were then brought to the laboratory, snap frozen in liquid nitrogen, and stored at -80°C. Larvae were homogenized with mortar and pestle in liquid nitrogen and 20 mg of tissue was used for total RNA extraction with the Quick-RNAT Tissue/Insect Microprep kit (Zymo, R2030). Eleven RNA-seq libraries (4 x L2 larvae, 4 x L4 larvae, 3 x L5 larvae) were prepared with the NEBNext Ultra II Directional RNA Library Prep Kit for Illumina (NEB E7760L) kit. Paired-end sequencing (100+100 bp) was performed on the 11 pooled libraries on the MGISEQ-2000 (BGI) to obtain about 30-55 million reads per library.

Data

The data contain 3 stages (L2, L4, L5) with at least 3 biological replicates, 11 samples in total. Raw fastq files and the transcriptome shotgun assembly (TSA) are available on the NCBI BioProject PRJNA1072613 and GKRZ00000000, respectively.

Summary from fastqc and multiqc on the 22 raw fastq files:

Sample Name	% Dups	% GC	M Seqs
OPM_2_1_L1_1	74.3%	45%	48.2
OPM_2_1_L1_2	78.4%	45%	48.2
OPM_2_2_L1_1	72.1%	45%	38.3
OPM_2_2_L1_2	75.5%	45%	38.3
OPM_2_3_L1_1	70.0%	45%	42.0
OPM_2_3_L1_2	74.6%	45%	42.0
OPM_2_4_L1_1	72.1%	45%	46.6
OPM_2_4_L1_2	75.8%	45%	46.6
OPM_4_1_L1_1	80.2%	46%	54.8
OPM_4_1_L1_2	84.4%	46%	54.8
OPM_4_2_L1_1	87.0%	51%	30.8
OPM_4_2_L1_2	89.2%	51%	30.8
OPM_4_3_L1_1	83.2%	49%	41.6
OPM_4_3_L1_2	86.0%	49%	41.6
OPM_4_4_L1_1	84.4%	49%	44.0
OPM_4_4_L1_2	87.1%	49%	44.0
OPM_5_1_L1_1	84.9%	49%	44.0
OPM_5_1_L1_2	87.5%	49%	44.0
OPM_5_2_L1_1	84.4%	49%	48.5
OPM_5_2_L1_2	86.9%	49%	48.5
OPM_5_4_L1_1	83.0%	49%	47.3
OPM_5_4_L1_2	86.2%	49%	47.3

Read processing

Examine quality metrics (fastqc)
Remove erroneous k-mers from reads with perl script Rcorrector
Discard read pairs for which one of the reads is deemed unfixable with python parser script
Trim adapter and low quality bases from fastq files

More general good practices https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8#Sec3

The script here is give for one sample (OPM_2_1_L1_1) to illustrate the name changes taking place at each step. OPM_2_1_L1_1.fq.gz contains the read 1 and OPM_2_1_L1_2.fq.gz contains read 2.

Remove rare kmers with Rcorrector


git clone https://github.com/mourisl/rcorrector.git
cd rcorrector
make

perl run_rcorrector.pl -t 12 -1 OPM_2_1_L1_1.fq.gz -2 OPM_2_1_L1_2.fq.gz

Output files will be OPM_2_1_L1_1.cor.fq.gz and OPM_2_1_L1_2.cor.fq.gz.

Discard unfixable read pairs

Discard read pairs for which one of the reads is deemed unfixable with a python script from https://github.com/harvardinformatics/TranscriptomeAssemblyTools#filteruncorrectablepefastqpy


git clone https://github.com/harvardinformatics/TranscriptomeAssemblyTools.git

python2 TranscriptomeAssemblyTools/FilterUncorrectabledPEfastq.py \
  -1 OPM_2_1_L1_1.cor.fq \
  -2 OPM_2_1_L1_2.cor.fq \
  -s OPM_2_1_L1

Output files will be unfixrm_OPM_2_1_L1_1.cor.fq and unfixrm_OPM_2_1_L1_2.cor.fq.

Trimming

Remove 3' adapters from read 1 and read 2 and trim by quality, allowing no Ns.

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
  --quality-base 33 --max-n 0  -o unfixrm_OPM_2_1_L1_1.trimmed.fastq.gz \
    -p unfixrm_OPM_2_1_L1_2.trimmed.fastq.gz unfixrm_OPM_2_1_L1_1.cor.fq \
    unfixrm_OPM_2_1_L1_2.cor.fq

Output files will be unfixrm_OPM_2_1_L1_1.trimmed.fastq.gz and unfixrm_OPM_2_1_L1_2.trimmed.fastq.gz.

Remove rRNA

Download the sequences as fasta files without gaps rRNA sequence from Lepidoptera as taxonomy (1068 hits for SSU and 40 hits for LSU) (https://www.arb-silva.de).


cat arb-silva.de_2023-03-22_id1242715_tax_silva_SSU.fasta \ 
  arb-silva.de_2023-03-22_id1242716_tax_silva_LSU.fasta > LSU_SSU.fa

bowtie2-build -f LSU_SSU.fa LSU_SSU

bowtie2 --nofw --quiet --very-sensitive-local --phred33  \
    -x LSU_SSU -1 unfixrm_OPM_2_1_L1_1.trimmed.fastq.gz \
    -2 unfixrm_OPM_2_1_L1_2.trimmed.fastq.gz \
    --un-conc-gz unfixrm_OPM_2_1_L1.rRNA_removed.fq.gz > /dev/null

Output files will be unfixrm_OPM_2_1_L1_rRNA_removed.fq.1.gz and unfixrm_OPM_2_1_L1_2.rRNA_removed.fq.2.gz.

Remove remaining over-represented sequences

python2 TranscriptomeAssemblyTools/RemoveFastqcOverrepSequenceReads.py \
      -1 unfixrm_OPM_2_1_L1_rRNA_removed.fq.1.gz \
      -2 unfixrm_OPM_2_1_L1_2.rRNA_removed.fq.2.gz \
      -fql rRNA_removed.1.fastqc \
      -fqr rRNA_removed.2.fastqc

Replace empty reads

Cutadapt has removed some read and quality sequences but let the headers to allow complete pairing between the two fastq files. However, this causes a exit failure in the seqtk part of Trinity (conversion of fastq to fasta format). I looked around and found the simplest fix in lh3/seqtk#3, where empty lines are replaced with one N.


perl -i -p -e 's/^$/N/g;' rmoverrep_unfixrm_OPM_2_1_L1_1.rRNA_removed.1.fq
perl -i -p -e 's/^$/N/g;' rmoverrep_unfixrm_OPM_2_1_L1_2.rRNA_removed.2.fq

De novo transcriptome assembly

Pool all data

cat *.1.fq > reads.ALL.1.fq
cat *.2.fq > reads.ALL.2.fq

Run Trinity

Download last version of Trinity and install it.

tar -xzvf trinityrnaseq-v2.15.1.FULL.tar.gz

cd trinityrnaseq-v2.15.1

module load cmake

make

Dependencies of Trinity were pre-installed on our HPC and loaded using Module


module load samtools python/3.9.0 openjdk/17.0.0_35 \
  salmon/0.10.2 bowtie/2.3.4.1 jellyfish/2.2.10 samtools/1.12

Trinity \
      --seqType fq \
      --max_memory 100G \
      --left reads.ALL.1.fq \
      --right reads.ALL.2.fq \
      --SS_lib_type RF \
      --output trinity_out_dir \
      --CPU 8 \

Time needed on our HPC with 8 cores and 100 Gb of RAM: 2045 min (34 hours). The output is Trinity.fasta.

Summary of the Trinity run:


$TRINITY_HOME/util/TrinityStats.pl  Trinity.fasta

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  160417
Total trinity transcripts:      216476
Percent GC: 41.56

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 5438
        Contig N20: 3611
        Contig N30: 2644
        Contig N40: 1997
        Contig N50: 1522

        Median contig length: 413
        Average contig: 836.07
        Total assembled bases: 180989306

NCBI submission

Submit the assembly fasta file to NCBI as TSA (transcriptome shotgun assembly) project. NCBI will assess potential contaminant transcripts (from plants, fungi, bacteria, ...) and indicate which transcripts to exclude.

In total, from 216,476 transcripts annotated by Trinity, NCBI identified 71,225 transcripts to be excluded. The suspect transcripts are indicated in NCBI RemainingContamination.txt file:

17 transcripts of less than 199 nucleotides long (minimum requirement for TSA)
16 transcripts that match Illumina adapter sequences
458 transcripts that are reverse complements of another transcript (duplicates)
71,128 transcripts that belong to another kingdom: *fung 1119 *plnt 38429 *prok 31368 *prst 2 *virs 210

A final set of 145,251 was accepted in NCBI and used for ORF prediction and Trinotate annotation.

# Exclude all transcripts found in id_contaminants.txt
# (transcript IDs provided by NCBI RemainingContamination.txt output files)
seqkit grep -v -f id_contaminants.txt Trinity.fasta > cleaned_up/Trinity.fasta

After filtering. Resubmit the fasta file to NCBI.

Final statistics cleaned assembly


$TRINITY_HOME/util/TrinityStats.pl Trinity.fasta > Trinity_OPM.fsa.stats

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  99868
Total trinity transcripts:      145251
Percent GC: 40.22

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 6139
        Contig N20: 4232
        Contig N30: 3153
        Contig N40: 2414
        Contig N50: 1837

        Median contig length: 454
        Average contig: 952.17
        Total assembled bases: 138302981


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 4678
        Contig N20: 3148
        Contig N30: 2311
        Contig N40: 1719
        Contig N50: 1239

        Median contig length: 373
        Average contig: 727.88
        Total assembled bases: 72691842

ORF prediction

Open reading frames and peptide sequences are predicted using TransDecoder. The minimum peptide length is set by default to 100 amino acids.


wget https://github.com/TransDecoder/TransDecoder/archive/refs/tags/TransDecoder-v5.7.0.tar.gz

tar -zxvf  TransDecoder-v5.7.0.tar.gz

Step 1: extract the long open reading frames

~/bin/TransDecoder-TransDecoder-v5.7.0/TransDecoder.LongOrfs -t Trinity.fasta \
      --output_dir $output_folder

Step 2: predict the likely coding regions

~/bin/TransDecoder-TransDecoder-v5.7.0/TransDecoder.Predict -t Trinity.fasta \
    --output_dir $output_folder

Transdecoder provides four output files (Trinity.fasta.transdecoder.bed, Trinity.fasta.transdecoder.cds, Trinity.fasta.transdecoder.gff3, and Trinity.fasta.transdecoder.pep) that the transcript and peptide sequences and annotations.

file                              format  type     num_seqs     sum_len  min_len  avg_len  max_len
Trinity.fasta.transdecoder.pep  FASTA   Protein    45,364  18,267,489       85    402.7   20,578

Trinity.fasta.transdecoder.pep contains 45,364 peptides between 85 and 20,578 amino acids in length.

Trinotate annotation

See Trinotate documentation and tutorial.

Install Trinotate and dependencies

# Download last Trinotate version (4.0.1) on https://github.com/Trinotate/Trinotate/releases
cd  ~
tar -xvf Trinotate-Trinotate-v4.0.1.tar.gz

Infernal


# Download infernal (cmpress utility)  http://eddylab.org/infernal/ 
# (see https://www.biostars.org/p/9569120/#9569121)
tar -xvf infernal-1.1.4.tar.gz
cd infernal-1.1.4-
./configure
make
make check
./configure --prefix /usr/users/zicola/bin/infernal-1.1.4

# Add to .bashrc
export PATH="${PATH}:/usr/users/zicola/bin/infernal-1.1.4/src"

Diamond

Faster alternative to blast

# Download diamond (https://github.com/bbuchfink/diamond/wiki)
cd ~/bin
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.8/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz

Sqlite3

# sqlite is already installed
sqlite3 --help

HMMER

Already preinstalled in Modules

# Load dependency (hmmpress)
module load hmmer/3.3

SignalP6

Download signalP6 (need to give email address) on https://services.healthtech.dtu.dk/services/SignalP-6.0/

Follow instructions in https://github.com/fteufel/signalp-6.0/blob/main/installation_instructions.md

tar xvf signalp-6.0h.fast.tar.gz

cd signalp6_fast

pip install signalp-6-package/

SIGNALP_DIR=$(python3 -c "import signalp; import os; print(os.path.dirname(signalp.__file__))" )
cp -r signalp-6-package/models/* $SIGNALP_DIR/model_weights/

TMHMM

Download TMHMM 2.0 for Linux (https://services.healthtech.dtu.dk)


cd /usr/users/zicola/bin/tmhmm-2.0c/bin

# Change shebang in tmhmm and tmhmmformat.pl perl 
# script and use `/usr/bin/perl`

#Then, edit line 33 of tmhmm:
   #$opt_basedir = "/usr/cbs/packages/tmhmm/2.0c/tmhmm-2.0c/";
as:
   $opt_basedir = "/usr/users/zicola/bin/tmhmm-2.0c/"

#removing the comment '#' and setting the path to the 
# directory where you installed the software.

# Add this to .bashrc
export PATH="${PATH}:/usr/users/zicola/bin/tmhmm-2.0c/bin"

Run Trinotate


# Modules to load
module load gcc
module load hmmer/3.3


# Test that all programs needed are installed
test_program(){
  command -v $1 >/dev/null || \
    { echo >&2 " $1 required but not installed. Aborting"; exit 1; }
}

test_program "diamond"
test_program "cmpress"
test_program "hmmpress"
test_program "tmhmm"
test_program "signalp6"


# Variables
TRINOTATE_HOME="/path/to/Trinotate-Trinotate-v4.0.1"

sqlite_db="/path/to/Trinotate-Trinotate-v4.0.1/Trinotate_OPM.sqlite"

# Output from Trinity
gene_trans_map="/path/to/Trinity.fasta.gene_trans_map"

transcript_fasta="path/to/Trinity.fasta"

# Output from Transdecoder
transdecoder_pep_file="/path/to/Trinity.fasta.transdecoder.pep"


echo "############################"
echo "Trinotate create"
date
echo "############################"

$TRINOTATE_HOME/Trinotate --create \
    --db $sqlite_db \
    --trinotate_data_dir $TRINOTATE_HOME/TRINOTATE_DATA_DIR \
    --use_diamond

echo "###############################"
echo "Trinotate Init"
date
echo "###############################"


$TRINOTATE_HOME/Trinotate --db $sqlite_db --init \
           --gene_trans_map $gene_trans_map \
           --transcript_fasta $transcript_fasta \
           --transdecoder_pep $transdecoder_pep_file

echo "##############################"
echo  "Trinotate Run"
date
echo "##############################"
 
$TRINOTATE_HOME/Trinotate --db ${sqlite_db} --run ALL \
  --trinotate_data_dir $TRINOTATE_HOME/TRINOTATE_DATA_DIR \
    --transcript_fasta $transcript_fasta \
      --transdecoder_pep $transdecoder_pep_file \
      --use_diamond

echo "###########################"
echo "Generating report table"
date
echo "###########################"

$TRINOTATE_HOME/Trinotate --db ${sqlite_db} \
  --report --incl_pep --incl_trans > Trinotate_report.tsv

##################
## Misc value adds
##################

# Extract GO terms
${TRINOTATE_HOME}/util/extract_GO_assignments_from_Trinotate_xls.pl \
  --Trinotate_xls Trinotate_report.tsv -G -I > Trinotate_report.tsv.gene_ontology

# Generate trinotate report summary statistics
${TRINOTATE_HOME}/util/report_summary/trinotate_report_summary.pl \
  Trinotate_report.tsv Trinotate_report_stats

echo "##########################"
echo "done.  See annotation summary file:  Trinotate_report.tsv"
date
echo "##########################"

Job took 2311 minutes (38.5 hours).

RNA-seq analysis

Samples file

Create a sample file Excel file as follow for Salmon processing containing as column stage, replicate, left read fastq file name, right read fastq file name. Copy the columns without headers in the file samples_file.txt. This file be provided for the flag --samples_file for the command align_and_estimate_abundance.pl.

Estimating Transcript Abundance

See Trinity tutorial for more information.

Salmon index

Build an salmon index (see https://combine-lab.github.io/salmon/getting_started/)

module load salmon/0.10.2

salmon index -t Trinity.fasta -i Trinity.fasta.salmon.idx

align and estimate abundance


module load samtools python/3.9.0 openjdk/17.0.0_35 salmon/0.10.2 \
  bowtie/2.3.4.1 jellyfish/2.2.10 samtools/1.12

perl $TRINITY_HOME/util/align_and_estimate_abundance.pl \
          --transcripts Trinity.fasta \
          --seqType fq \
          --samples_file samples_file.txt \
          --gene_trans_map Trinity.fasta.gene_trans_map \
          --est_method salmon \
          --SS_lib_type RF \
          --thread_count 12

For each sample, I get two files:

quant.sf : transcript abundance estimates (generated by salmon)
quant.sf.genes : gene-level abundance estimates (generated here by summing transcript values)

These files can be imported in DESeq2 for expression analysis.

DESeq2 analysis

Example of tutorial on http://sthda.com/english/wiki/rna-seq-differential-expression-work-flow-using-deseq2

R libraries

library("tidyverse")
library("ggplot2")
library("DESeq2")
library("tximport")

Coldata

Create a coldata text file. First column should contain the sample name as in the count matrix. The second column contain the larval stage, and the third the allergenicity status.

sample	stage	allergenicity
OMP_2_rep1	L2	non_allergenic
OMP_2_rep2	L2	non_allergenic
OMP_2_rep3	L2	non_allergenic
OMP_2_rep4	L2	non_allergenic
OPM_4_rep1	L4	allergenic
OPM_4_rep2	L4	allergenic
OPM_4_rep3	L4	allergenic
OPM_4_rep4	L4	allergenic
OPM_5_rep1	L5	allergenic
OPM_5_rep2	L5	allergenic
OPM_5_rep4	L5	allergenic

Read count matrix

Import salmon quant.sf files with tximport package.

This tutorial recommends to import salmon quant.sf files via the tximport function.

Create a tx2gene file (transcript id in 1st column and gene id in second column, header names are not important) it is basically the Trinity.fasta.gene_trans_map file but with reversed column position.

# Make tx2gene.txt file
awk '{print $2,$1}' Trinity.fasta.gene_trans_map > tx2gene.txt

In R:


# Upload the coldata file
coldata=read.table ("coldata", header=T)
col_names <- names(coldata)
coldata[,col_names] <- lapply(coldata[,col_names] , factor)
rm(col_names)

# Path to quant.sf files
files <- file.path("expression_analysis", coldata$sample, "quant.sf")
names(files) <- coldata$sample

# Import tx2gene file
tx2gene <- read.table("tx2gene.txt", quote="\"", comment.char="")

# Import salmon data
txi <- tximport(files, type="salmon", tx2gene=tx2gene)

# Create a DESeq2 matrix
dds <- DESeqDataSetFromTximport(txi,
                  colData = coldata,
                  design = ~ allergenicity)

PCA vst all samples


# Variance stabilizing transformation
vsd <- vst(dds, blind=TRUE)

# Create PCA
pcaData <- plotPCA(vsd, intgroup=c("sample", "stage"), returnData=TRUE)

# Get percentage variation
percentVar <- round(100 * attr(pcaData, "percentVar"))

p <- ggplot(pcaData, aes(PC1, PC2, color=stage, label = sample)) +
    xlab(paste0("PC1: ",percentVar[1],"% variance")) +
    ylab(paste0("PC2: ",percentVar[2],"% variance")) +  
    theme_bw()

p + geom_text(size = 4)

The samples OPM_5_rep1 and OPM_4_rep2 are clustering very separately from all the rest. These should be removed in the rest of the analysis.

dds <- dds[,!colnames(dds) %in% c("OPM_4_rep2","OPM_5_rep1")]

PCA without outliers

We have now 9 samples left.

# Variance stabilizing transformation
vsd <- vst(dds, blind=TRUE)

# Create PCA
pcaData <- plotPCA(vsd, intgroup=c("sample", "stage"), returnData=TRUE)

# Get percentage variation
percentVar <- round(100 * attr(pcaData, "percentVar"))

p <- ggplot(pcaData, aes(PC1, PC2, color=stage, label = sample)) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) + 
  ylab(paste0("PC2: ",percentVar[2],"% variance")) +  theme_bw()


p + geom_text(size = 4)

The samples cluster separately depending on their stage, with the stage L2 (red) individuals being the closest.

# Create PCA
pcaData <- plotPCA(vsd, intgroup=c("sample", "allergenicity"), returnData=TRUE)

# Get percentage variation
percentVar <- round(100 * attr(pcaData, "percentVar"))

p <- ggplot(pcaData, aes(PC1, PC2, color=allergenicity, label = sample)) + 
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) +  theme_bw()

p + geom_text(size = 4)

If we color-code the samples by allergenicity, we can clearly see that the non-allergenic stage (L2) and the two allergenic stages (L4 and L5) cluster separately on the first principal component.

Read count filtering

Remove all genes with less than 10 reads across all 9 samples.


keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

The filtering kept 73576/160417 genes (45%) after filtering out low expressed genes.

Comparison Allergenic vs non-allergenic

Remove all genes with less than 10 reads across all 9 samples.


# Create a DESeq2 matrix
dds <- DESeqDataSetFromTximport(txi,
                  colData = coldata,
                  design = ~ allergenicity)

dds <- dds[,!colnames(dds) %in% c("OPM_4_rep2","OPM_5_rep1")]

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

The filtering kept 55509/99868 genes after filtering out low expressed genes.

Use Shrunken log2 foldchanges (LFC) https://hbctraining.github.io/DGE_workshop/lessons/05_DGE_DESeq2_analysis2.html

Check the trick for the lfcShrink with multiple levels (https://support.bioconductor.org/p/p132853/)


# Remake a deseq object to integrate the change
dds <- DESeq(dds)

# Do pair-wise comparison of the 3 stages. Fix earlier stage as reference
res_allergenic_vs_non_allergenic <- results(dds, 
  contrast=c("allergenicity","allergenic","non_allergenic"))

# LFC shrinkage
res_allergenic_vs_non_allergenic_lfc <- lfcShrink(dds=dds, 
        coef="allergenicity_allergenic_vs_non_allergenic",
        res=res_allergenic_vs_non_allergenic, type="apeglm")

# Retrieve only significant DEGs
df_res_allergenic_vs_non_allergenic <- as.data.frame(res_allergenic_vs_non_allergenic_lfc) %>% 
  rownames_to_column("geneID") %>% filter(padj < 0.05)

# Export as tabulated text files
write.table(df_res_allergenic_vs_non_allergenic, 
            "data/expression_analysis/df_res_allergenic_vs_non_allergenic.txt",
            quote=FALSE, row.names=FALSE, sep="\t")

Pair-wise comparison of larval stages

Compare each stages, using the younger stage as reference.

Remove all genes with less than 10 reads across all 9 samples.


# Create a DESeq2 matrix
dds <- DESeqDataSetFromTximport(txi,
                  colData = coldata,
                  design = ~ stage)

dds <- dds[,!colnames(dds) %in% c("OPM_4_rep2","OPM_5_rep1")]

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

The filtering kept 55509/99868 genes after filtering out low expressed genes.

Compare each stages, using the younger stage as reference.


# Remake a deseq object to integrate the change
dds <- DESeq(dds)

# L5 vs L4
dds$stage <- relevel(dds$stage, ref = "L4")  
dds <- nbinomWaldTest(dds)
resultsNames(dds)

# Do pair-wise comparison of the 3 stages. Fix earlier stage as reference
res_L5_vs_L4 <- results(dds, contrast=c("stage","L5","L4"))
res_L5_vs_L4_lfc <- lfcShrink(dds=dds, coef="stage_L5_vs_L4", type="apeglm")

df_res_L5_vs_L4 <- as.data.frame(res_L5_vs_L4_lfc) %>% 
  rownames_to_column("geneID") %>% filter(padj < 0.05)

write.table(df_res_L5_vs_L4, "data/expression_analysis/df_res_L5_vs_L4.txt", 
            quote=FALSE, row.names=FALSE, sep="\t")


# L5 vs L2
dds$stage <- relevel(dds$stage, ref = "L2")  
dds <- nbinomWaldTest(dds)
resultsNames(dds)

res_L5_vs_L2 <- results(dds, contrast=c("stage","L5","L2"))
res_L5_vs_L2_lfc <- lfcShrink(dds=dds, coef="stage_L5_vs_L2",type="apeglm")

df_res_L5_vs_L2 <- as.data.frame(res_L5_vs_L2_lfc) %>% 
  rownames_to_column("geneID") %>% filter(padj < 0.05)

write.table(df_res_L5_vs_L2, "data/expression_analysis/df_res_L5_vs_L2.txt", 
            quote=FALSE, row.names=FALSE, sep="\t")

# L4 vs L2
dds$stage <- relevel(dds$stage, ref = "L2")  
dds <- nbinomWaldTest(dds)
resultsNames(dds)

res_L4_vs_L2 <- results(dds, contrast=c("stage","L4","L2"))
res_L4_vs_L2_lfc <- lfcShrink(dds=dds, coef="stage_L4_vs_L2", 
                              res=res_L4_vs_L2, type="apeglm")

df_res_L4_vs_L2 <- as.data.frame(res_L4_vs_L2_lfc) %>%
  rownames_to_column("geneID") %>% filter(padj < 0.05)

write.table(df_res_L4_vs_L2, "data/expression_analysis/df_res_L4_vs_L2.txt", 
            quote=FALSE, row.names=FALSE, sep="\t")

Gather all the output files into one Excel file (Data File 5).

Authors

Johan Zicola - johanzi

License

This project is licensed under the MIT License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
images		images
LICENSE		LICENSE
README.md		README.md

License

johanzi/OPM_transcriptome_assembly

Folders and files

Latest commit

History

Repository files navigation

OPM_transcriptome_assembly

Objective

Experimental design

Data

Read processing

Remove rare kmers with Rcorrector

Discard unfixable read pairs

Trimming

Remove rRNA

Remove remaining over-represented sequences

Replace empty reads

De novo transcriptome assembly

Pool all data

Run Trinity

NCBI submission

Final statistics cleaned assembly

ORF prediction

Step 1: extract the long open reading frames

Step 2: predict the likely coding regions

Trinotate annotation

Install Trinotate and dependencies

Infernal

Diamond

Sqlite3

HMMER

SignalP6

TMHMM

Run Trinotate

RNA-seq analysis

Samples file

Estimating Transcript Abundance

Salmon index

align and estimate abundance

DESeq2 analysis

R libraries

Coldata

Read count matrix

PCA vst all samples

PCA without outliers

Read count filtering

Comparison Allergenic vs non-allergenic

Pair-wise comparison of larval stages

Authors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages