ERVmap is one part curated database of human proviral ERV loci and one part a stringent algorithm to determine which ERVs are transcribed in their RNA seq data.
Tokuyama M. et. al., ERVmap analysis reveals genome-wide transcription of human endogenous retroviruses. Proc Natl Acad Sci USA 2018 Dec 11;115(50):12565-12572. doi: 10.1073/pnas.1814589115.
This version of the tool consists on 2 steps: 1. alignment to the human genome (GRC38) and 2. quantification of the ERV regions. To download and install ERVmap latest version provided as docker image, simply type:
docker pull eipm/ervmap:latestNOTE: for a specific version replace latest with the release version.
To run ERVmap, you'd need: 1. an indexed genome reference for STAR; 2. A bed file with the curated ERV regions on the human genome (see ERVmap.bed); 3. the input FASTQ data (gzipped). Assuming that your sample is called SAMPLE, and has 2 FASTQ files (one per read) in the folder /path/to/input/data; the reference genome is in /path/to/genome and the ERV bed file is in /path/to/erv/file here is the command:
docker run --rm \
-u $(id -u):$(id -g) \
-v /path/to/input/data:/data:ro \
-v /path/to/genome:/genome:ro \
-v /path/to/erv/file:/resources:ro \
-v /path/to/output:/results \
ervmap \
--read1 /data/SAMPLE_1.fastq.gz \
--read2 /data/SAMPLE_2.fastq.gz \
--output SAMPLE/SAMPLE. \
--mode ALLThis command will generate the alignment files (BAMs) in the /path/to/output/SAMPLE/ folder and all files will have the prefix SAMPLE.. The generated files will be:
SAMPLE.Aligned.sortedByCoord.out.bam
SAMPLE.Aligned.sortedByCoord.out.bam.bai
SAMPLE.ERVresults.txt
SAMPLE.Log.final.out
SAMPLE.Log.out
SAMPLE.Log.progress.out
SAMPLE.SJ.out.tab(See STAR documentation for the description of the output files of the STAR aligner ).
The results of ERV quantification will be in the SAMPLE.ERVresults.txt file. This is a tab-delimited file with 7 columns from bedtools. For example:
1 896176 898458 5803 500 + 70
1 1412251 1418852 5804 500 + 36
1 3801730 3806808 5807 500 + 6
1 4178468 4187573 5808 500 + 1This option can only have 3 values: { ALL, STAR, BED }:
ALLto run both the STAR aligner and the ERV quantification from start to finish;STARto only perform the alignment;BEDto only run the ERV quantification.
There are a few parameters that can be added to the ERVmap image to make the process more efficient.
--cpus 20: if you have a multi-core system (and you should have one), you can specify the number of CPUs to use (e.g. 20);--limit-ram 48000000000: this limits the amount of RAM used to avoid overusing the resources You can see the full set of parameters by typing:docker run --rm ervmap.
There are also other parameters from Docker that should be included before ervmap in the command line, e.g.
--memory 50G \
--memory-swap 100GTo run this pipeline using Nextflow, simply run the following:
nextflow -C nextflow.config run main.nf
where nextflow.config include the minimum set of parameters to run ERVmap within the docker container. Specifically:
params {
genome='/path/to/genome' # external path to the indexed genome for the STAR aligner
inputDir='path/to/input/folder' # external path of the input data
inputPattern="*{1,2}.fastq.gz" # pattern to search for input FASTQ files, or BAM files (*.{bam,bam.bai})
skipAlignment=false # if skipAlignment is true, the process ERValign is skipped, and the input dir and pattern should point to the BAM files
outputDir='/path/to/output/folder' # external path of the output results
starTmpDir='/path/to/STAR/temp/folder' # external path of the STAR aligner temporary folder. REQUIRED
localOutDir='.' # internal path of the results
cpus=20 # Number of cpus/threads to use for the alignment
limitMemory=1850861158 # memory limit for STAR
debug='off' # either [on|off]
}NOTE: Adjust the memory settings of the docker container if needed, but recall that STAR requires about 32G of RAM (see Optional Parameters).
NOTE: The BAM files are rsync'ed into the outputDir folder. Make sure to have sufficient disk space. By cleaning up the work folder, e.g. by running nextflow clean, the bam files will be removed. The ERVmap results are copied into outputDir and thus are permanent.
Please note that the instructions hereafter refer to the orignal published version (see ERVmap on GitHub)
bedtools2
cufflinks
bwa-0.7.17
cufflinks-2.2.1.Linux_x86_64
python
samtools-1.8
tophat-2.1.1.Linux_x86_64
tophat2
trim (http://graphics.med.yale.edu/trim/)erv_genome.pl
interleaved.pl
run_clean_htseq.pl
clean_htseq.pl
merge_count.pl
normalize_with_file.pl
normalize_deseq.rThis step will yield raw counts for cellular genes and ERVmap loci as separate files.
erv_genome.pl -stage 1 -stage2 6 -fastq /${i}_SS.fastq.gzinterleaved.pl --read1 ${i}_R1.fastq.gz --read2 ${i}_R2.fastq.gz > ${i}.fastq.gz
erv_genome.pl -stage 1 -stage2 6 -fastq /${i}.fastq.gzmkdir -p output
mv ./sample/herv_coverage_GRCh38_genome.txt ./output/erv/${i}.e
mv ./sample/GRCh38/htseq.cnt ./output/cellular/${i}.cThese steps will yield normalized ERV read counts based on size factors obtained through DESeq2 analysis. Use the output files from above.
run_clean_htseq.pl ./output/cellular c c2 __
merge_count.pl 3 6 e ./output/erv > ./output/erv/merged_erv.txt
merge_count.pl 0 1 c2 ./output/cellular > ./output/cellular/merged_cellular.txt
normalize_deseq.r ./output/cellular/merged_cellular.txt ./output/cellular/normalized_cellular ./output/cellular/normalized_factors
normalize_with_file.pl ./output/cellular/normalized_factors ./output/erv/merged_erv.txt > ./output/$folder_name.txt- Maria Tokuyama
- Yong Kong