Skip to content

Latest commit

 

History

History
156 lines (116 loc) · 5.14 KB

quantification.md

File metadata and controls

156 lines (116 loc) · 5.14 KB

Analyzing RNAseq expression with Salmon

Based on the Eel-Pond RNAseq workflow protocol here.

Getting Started

If needed, start up a fresh Ubuntu 16.04 instance on Jetstream using the instructions here.

On a Jetstream instance, run the following commands to update the base software:

sudo apt-get update && \
sudo apt-get -y install screen git curl gcc make g++ python-dev unzip \
        default-jre pkg-config libncurses5-dev r-base-core r-cran-gplots \
        python-matplotlib python-pip python-virtualenv sysstat fastqc \
        trimmomatic bowtie samtools blast2 wget bowtie2 openjdk-8-jre \
        hmmer ruby

Installation

We will use Salmon to quantify expression. Salmon is a new breed of software for quantifying RNAseq reads that is both really fast and takes transcript length into consideration (Patro et al. 2015).

For further reading, see

Download and make Salmon available in the path

cd
wget https://github.com/COMBINE-lab/salmon/releases/download/v0.8.2/Salmon-0.8.2_linux_x86_64.tar.gz
tar xvfz Salmon-0.8.2_linux_x86_64.tar.gz
export PATH=$PATH:$HOME/Salmon-0.8.2_linux_x86_64/bin 

Then, let's check we still have our reads from yesterday's QC work

set -u
printf "\nMy trimmed data is in $PROJECT/quality/, and consists of $(ls -1 ${PROJECT}/quality/*.qc.fq.gz | wc -l) files\n\n"
set +u

where set -u should let you know if you have any unset variables, i.e. if the $PROJECT variable is not defined.

If you see -bash: PROJECT: unbound variable, then you need to set the $PROJECT variable.

export PROJECT=/mnt/work

and then re-run the printf code block.

NOTE: if you do not have files, please rerun quality trimming steps here

Download or link an assembly

We can download a full assembly to use for mapping. This assembly was made with all Nematostella vectensis reads, rather than the subset we used in the assembly tutorial.

   cd ${PROJECT}
   mkdir -p quant
   cd quant
   curl -O https://s3.amazonaws.com/public.ged.msu.edu/trinity-nematostella-raw.fa.gz
   gunzip trinity-nematostella-raw.fa.gz
   ln -s trinity-nematostella-raw.fa trinity.nema.full.fasta

Note: if you prefer, you can use the annotated assembly we generated with the read subsets instead

   #ln -s ${PROJECT}/annotation/trinity.nema.fasta.dammit/trinity.nema.fasta.dammit.fasta ./trinity.nema.annot.fasta 

Run Salmon

Build an index for your new transcriptome:

   salmon index --index nema --transcripts trinity.nema.annot.fasta --type quasi

Next, link in the QC reads (produced in quality:

   ln -s ../quality/*R1*.qc.fq.gz .
   ln -s ../quality/*R2*.qc.fq.gz .

Then, run the salmon command:

  for R1 in *R1*.qc.fq.gz
  do
    sample=$(basename $R1 extract.qc.fq.gz)
    echo sample is $sample, R1 is $R1
    R2=${R1/R1/R2}
    echo R2 is $R2
    salmon quant -i nema -p 2 -l IU -1 <(gunzip -c $R1) -2 <(gunzip -c $R2) -o ${sample}quant
  done

This will create a bunch of directories named something like 0Hour_ATCACG_L002001.quant, containing a bunch of files. Take a look at what files there are:

    find 0Hour_ATCACG_L002_R1_001* -type f

You should see::

    0Hour_ATCACG_L002_R1_001.extract.quant/quant.sf
    0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/observed_bias.gz
    0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/observed_bias_3p.gz
    0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/fld.gz
    0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/expected_bias.gz
    0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/meta_info.json
    0Hour_ATCACG_L002_R1_001.extract.quant/cmd_info.json
    0Hour_ATCACG_L002_R1_001.extract.quant/logs/salmon_quant.log
    0Hour_ATCACG_L002_R1_001.extract.quant/libParams/flenDist.txt
    0Hour_ATCACG_L002_R1_001.extract.quant/lib_format_counts.json
    0Hour_ATCACG_L002_R1_001.extract.quant.counts

The two most interesting files are salmon_quant.log and quant.sf. The latter contains the counts; the former contains the log information from running things.

Working with the counts

The quant.sf files actually contain the relevant information about expression -- take a look

    head -20 0Hour_ATCACG_L002_R1_001.quant/quant.sf

The first column contains the transcript names, and the fifth column is what edgeR etc will want - the "raw counts". However, they're not in a convenient location / format for edgeR to use; let's fix that.

Other useful tutorials and references

https://github.com/ngs-docs/2015-nov-adv-rna/blob/master/salmon.rst http://angus.readthedocs.io/en/2016/rob_quant/tut.html https://2016-aug-nonmodel-rnaseq.readthedocs.io/en/latest/quantification.html