Based on the Eel-Pond RNAseq workflow protocol here.
If needed, start up a fresh Ubuntu 16.04 instance on Jetstream using the instructions here.
On a Jetstream instance, run the following commands to update the base software:
sudo apt-get update && \
sudo apt-get -y install screen git curl gcc make g++ python-dev unzip \
default-jre pkg-config libncurses5-dev r-base-core r-cran-gplots \
python-matplotlib python-pip python-virtualenv sysstat fastqc \
trimmomatic bowtie samtools blast2 wget bowtie2 openjdk-8-jre \
hmmer ruby
We will use Salmon to quantify expression. Salmon is a new breed of software for quantifying RNAseq reads that is both really fast and takes transcript length into consideration (Patro et al. 2015).
For further reading, see
- Intro blog post: http://robpatro.com/blog/?p=248
- A 2016 blog post evaluating and comparing methods here
- Salmon github repo here
cd
wget https://github.com/COMBINE-lab/salmon/releases/download/v0.8.2/Salmon-0.8.2_linux_x86_64.tar.gz
tar xvfz Salmon-0.8.2_linux_x86_64.tar.gz
export PATH=$PATH:$HOME/Salmon-0.8.2_linux_x86_64/bin
set -u
printf "\nMy trimmed data is in $PROJECT/quality/, and consists of $(ls -1 ${PROJECT}/quality/*.qc.fq.gz | wc -l) files\n\n"
set +u
where set -u should let you know if you have any unset variables, i.e. if the $PROJECT
variable is not defined.
If you see -bash: PROJECT: unbound variable
, then you need to set the $PROJECT variable.
export PROJECT=/mnt/work
and then re-run the printf
code block.
NOTE: if you do not have files, please rerun quality trimming steps here
We can download a full assembly to use for mapping. This assembly was made with all Nematostella vectensis reads, rather than the subset we used in the assembly tutorial.
cd ${PROJECT}
mkdir -p quant
cd quant
curl -O https://s3.amazonaws.com/public.ged.msu.edu/trinity-nematostella-raw.fa.gz
gunzip trinity-nematostella-raw.fa.gz
ln -s trinity-nematostella-raw.fa trinity.nema.full.fasta
Note: if you prefer, you can use the annotated assembly we generated with the read subsets instead
#ln -s ${PROJECT}/annotation/trinity.nema.fasta.dammit/trinity.nema.fasta.dammit.fasta ./trinity.nema.annot.fasta
Build an index for your new transcriptome:
salmon index --index nema --transcripts trinity.nema.annot.fasta --type quasi
Next, link in the QC reads (produced in quality:
ln -s ../quality/*R1*.qc.fq.gz .
ln -s ../quality/*R2*.qc.fq.gz .
Then, run the salmon command:
for R1 in *R1*.qc.fq.gz
do
sample=$(basename $R1 extract.qc.fq.gz)
echo sample is $sample, R1 is $R1
R2=${R1/R1/R2}
echo R2 is $R2
salmon quant -i nema -p 2 -l IU -1 <(gunzip -c $R1) -2 <(gunzip -c $R2) -o ${sample}quant
done
This will create a bunch of directories named something like
0Hour_ATCACG_L002001.quant
, containing a bunch of files. Take a
look at what files there are:
find 0Hour_ATCACG_L002_R1_001* -type f
You should see::
0Hour_ATCACG_L002_R1_001.extract.quant/quant.sf
0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/observed_bias.gz
0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/observed_bias_3p.gz
0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/fld.gz
0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/expected_bias.gz
0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/meta_info.json
0Hour_ATCACG_L002_R1_001.extract.quant/cmd_info.json
0Hour_ATCACG_L002_R1_001.extract.quant/logs/salmon_quant.log
0Hour_ATCACG_L002_R1_001.extract.quant/libParams/flenDist.txt
0Hour_ATCACG_L002_R1_001.extract.quant/lib_format_counts.json
0Hour_ATCACG_L002_R1_001.extract.quant.counts
The two most interesting files are salmon_quant.log
and
quant.sf
. The latter contains the counts; the former contains the
log information from running things.
The quant.sf
files actually contain the relevant information about
expression -- take a look
head -20 0Hour_ATCACG_L002_R1_001.quant/quant.sf
The first column contains the transcript names, and the fifth column is what edgeR etc will want - the "raw counts". However, they're not in a convenient location / format for edgeR to use; let's fix that.
https://github.com/ngs-docs/2015-nov-adv-rna/blob/master/salmon.rst http://angus.readthedocs.io/en/2016/rob_quant/tut.html https://2016-aug-nonmodel-rnaseq.readthedocs.io/en/latest/quantification.html