version update for LRSDAY: v1.2.0 -> v1.3.0

yjx1217 · Nov 13, 2018 · ed48f2c · ed48f2c
1 parent 304ceb4
commit ed48f2c
Showing 19 changed files with 409 additions and 309 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,17 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 
 ## [Unreleased]
 
+## [1.3.0] - 2018-11-13
+### Added
+- Support for one more alternative assembler: wtdbg2.
+### Changed
+- Substantially more automated installation/setup process.
+- Software version or downloading URL updates for a number of dependencies.
+### Fixed
+- Bugs introduced due to changes made for file/parameter names in the LRSDAY.01.Long-read-based_Genome_Assembly.sh script when using some alternative assemblers.
+- Mismatched step numbers and file names in the manual due to previous version changes.
+- Typos in the manual.
+
 ## [1.2.0] - 2018-10-15
 ### Added
 - Support for adapter trimming for Nanopore reads (via Porechop).

diff --git a/Example_Outputs/SK1.assembly.final.fa.gz b/Example_Outputs/SK1.assembly.final.fa.gz
diff --git a/Example_Outputs/SK1.assembly.final.filter.mummer2vcf.INDEL.vcf.gz b/Example_Outputs/SK1.assembly.final.filter.mummer2vcf.INDEL.vcf.gz
diff --git a/Example_Outputs/SK1.assembly.final.filter.mummer2vcf.SNP.vcf.gz b/Example_Outputs/SK1.assembly.final.filter.mummer2vcf.SNP.vcf.gz
diff --git a/Example_Outputs/SK1.assembly.final.filter.pdf b/Example_Outputs/SK1.assembly.final.filter.pdf
diff --git a/Example_Outputs/SK1.assembly.final.stats.txt b/Example_Outputs/SK1.assembly.final.stats.txt
@@ -1,17 +1,17 @@
-total sequence count: 33
-total sequence length: 12490496
+total sequence count: 34
+total sequence length: 12448004
 min sequence length: 1248
-max sequence length: 1480288
-mean sequence length: 378499.88
-median sequence length: 84643.00
-N50: 923711
+max sequence length: 1480301
+mean sequence length: 366117.76
+median sequence length: 60826.50
+N50: 923676
 L50: 6
-N90: 341493
+N90: 341518
 L90: 14
-A%: 30.88
-T%: 30.79
-G%: 19.13
-C%: 19.16
-AT%: 61.67
-GC%: 38.29
+A%: 30.89
+T%: 30.81
+G%: 19.14
+C%: 19.13
+AT%: 61.70
+GC%: 38.26
 N%: 0.04
diff --git a/Example_Outputs/SK1.final.cds.fa.gz b/Example_Outputs/SK1.final.cds.fa.gz
diff --git a/Example_Outputs/SK1.final.gff3.gz b/Example_Outputs/SK1.final.gff3.gz
diff --git a/Example_Outputs/SK1.final.manual_check.list b/Example_Outputs/SK1.final.manual_check.list
diff --git a/Example_Outputs/SK1.final.pep.fa.gz b/Example_Outputs/SK1.final.pep.fa.gz
diff --git a/Example_Outputs/SK1.final.trimmed_cds.fa.gz b/Example_Outputs/SK1.final.trimmed_cds.fa.gz
diff --git a/LRSDAY_flowchart.png b/LRSDAY_flowchart.png
diff --git a/Manual.pdf b/Manual.pdf
diff --git a/..._Template/01.Long-read-based_Genome_Assembly/LRSDAY.01.Long-read-based_Genome_Assembly.sh b/..._Template/01.Long-read-based_Genome_Assembly/LRSDAY.01.Long-read-based_Genome_Assembly.sh
@@ -11,8 +11,8 @@ prefix="SK1" # The file name prefix for the output files.
 long_reads="./../00.Long_Reads/SK1.filtered_subreads.fastq.gz" # The file path of the long reads file (in fastq or fastq.gz format).
 long_reads_type="pacbio-raw" # The long reads data type. Use "pacbio-raw" or "pacbio-corrected" or "nanopore-raw" or "nanopore-corrected". Default = "pacbio-raw" for the testing example
 genome_size="12.5m" # The estimated genome size with the format of <number>[g|m|k], e.g. 12.5m for 12.5 Mb. Default = "12.5m".
-assembler="canu" # The long-read assembler: Use "canu" or "flye" or "smartdenovo" or "canu-flye" or "canu-smartdenovo". For "canu-flye" and "canu-smartdenovo", the assembler canu is used first to generate error-corrected reads from the raw reads and then the assembler flye/smartdenovo is used to assemble the genome. Based on our test, assembler="canu" generally gives the best result but will take substantially longer time than the other options.
-customized_canu_parameters="-correctedErrorRate=0.04" # For assembler="canu" only. Users can set customized Canu assembly parameters here or simply leave it empty like "" to use Canu's default assembly parameter. For example you could set "-correctedErrorRate=0.04" for high coverage (>60X) PacBio data and "-correctedErrorRate=0.12 -overlapper=mhap -utgReAlign=true" for high coverage (>60X) Nanopore data to improve the assembly speed. More than one customized parameters can be set here as long as they are separeted by space (e.g. "-option1=XXX -option2=YYY -option3=ZZZ"). Please consult Canu's manual "http://canu.readthedocs.io/en/latest/faq.html#what-parameters-can-i-tweak" for advanced customization settings. Default = "-correctedErrorRate=0.04" for the testing example.
+assembler="canu" # The long-read assembler: Use "canu" or "flye" or "wtdbg2" or "smartdenovo" or "canu-flye" or "canu-wtdbg2" or "canu-smartdenovo". For "canu-flye", "canu-wtdbg2", and "canu-smartdenovo", the assembler canu is used first to generate error-corrected reads from the raw reads and then the assembler flye/wtdbg2/smartdenovo is used to assemble the genome. Based on our test, assembler="canu" generally gives the best result but will take substantially longer time than the other options.
+customized_canu_parameters="-correctedErrorRate=0.04" # For assembler="canu" only. Users can set customized Canu assembly parameters here or simply leave it empty like "" to use Canu's default assembly parameter. For example you could set "-correctedErrorRate=0.04" for high coverage (>60X) PacBio data and "-overlapper=mhap -utgReAlign=true" for high coverage (>60X) Nanopore data to improve the assembly speed. More than one customized parameters can be set here as long as they are separeted by space (e.g. "-option1=XXX -option2=YYY -option3=ZZZ"). Please consult Canu's manual "http://canu.readthedocs.io/en/latest/faq.html#what-parameters-can-i-tweak" for advanced customization settings. Default = "-correctedErrorRate=0.04" for the testing example.
 threads=1 # The number of threads to use. Default = 1.
 vcf="yes" # Use "yes" if prefer to have vcf file generated to show SNP and INDEL differences between the assembled genome and the reference genome for their uniquely alignable regions. Otherwise use "no". Default = "yes".
 dotplot="yes" # Use "yes" if prefer to plot genome-wide dotplot based on the comparison with the reference genome below. Otherwise use "no". Default = "yes".
@@ -59,25 +59,33 @@ then
 	-${long_reads_type} $long_reads \
 	$customized_canu_parameters
 
-    perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i ./$out_dir/$prefix.contigs.fasta -o $prefix.assembly.$assembler.fa
+    perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i $out_dir/$prefix.contigs.fasta -o $prefix.assembly.$assembler.fa
 elif [[ "$assembler" == "flye" ]]
 then
     if [[ "$long_reads_type" == "pacbio-corrected" ]]
     then
-	reads_type="pacbio-corr"
+	long_reads_type="pacbio-corr"
     elif [[ "$long_reads_type" == "nanopore-raw" ]]
     then
-        reads_type="nano-raw"
+        long_reads_type="nano-raw"
     elif [[ "$long_reads_type" == "nanopore-corrected" ]]
     then
-        reads_type="nano-corr"
+        long_reads_type="nano-corr"
     fi
-    $flye_new_dir/flye -o $out_dir \
+    $flye_dir/flye -o $out_dir \
 	-t $threads \
 	-g $genome_size \
 	--${long_reads_type} $long_reads \
 	-i 2
-    perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i ./$out_dir/contigs.fasta -o $prefix.assembly.$assembler.fa
+    perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i $out_dir/scaffolds.fasta -o $prefix.assembly.$assembler.fa
+elif [[ "$assembler" == "wtdbg2" ]]
+then
+    mkdir $out_dir
+    cd $out_dir
+    $wtdbg2_dir/wtdbg2 -t $threads -L 5000 -i ./../$long_reads -fo $prefix
+    $wtdbg2_dir/wtpoa-cns -t $threads -i $prefix.ctg.lay.gz -fo $prefix.ctg.lay.fa
+    cd ..
+    perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i $out_dir/$prefix.ctg.lay.fa -o $prefix.assembly.$assembler.fa
 elif [[ "$assembler" == "smartdenovo" ]]
 then
     mkdir $out_dir
@@ -95,7 +103,7 @@ then
 	maxThreads=$threads \
 	genomeSize=$genome_size \
 	gnuplot=$gnuplot_dir/gnuplot \
-	-${reads_type} $long_reads \
+	-${long_reads_type} $long_reads \
 	# $customized_canu_parameters
 
     if [[ "$long_reads_type" == "pacbio-raw" ]]
@@ -116,7 +124,23 @@ then
 	-g $genome_size \
 	--${long_reads_type} $out_dir/canu/$prefix.correctedReads.fasta.gz \
 	-i 2
-    perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i ./$out_dir/flye/contigs.fasta -o $prefix.assembly.$assembler.fa
+    perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i $out_dir/flye/scaffolds.fasta -o $prefix.assembly.$assembler.fa
+elif [[ "$assembler" == "canu-wtdbg2" ]]
+then
+    $canu_dir/canu -correct -p $prefix -d $out_dir/canu \
+        useGrid=false \
+        maxThreads=$threads \
+        genomeSize=$genome_size \
+        gnuplot=$gnuplot_dir/gnuplot \
+        -${long_reads_type} $long_reads \
+	# $customized_canu_parameters
+
+    mkdir -p $out_dir/wtdbg2
+    cd $out_dir/wtdbg2
+    $wtdbg2_dir/wtdbg2 -t $threads -L 5000 -i ./../canu/$prefix.correctedReads.fasta.gz -fo $prefix
+    $wtdbg2_dir/wtpoa-cns -t $threads -i $prefix.ctg.lay.gz -fo $prefix.ctg.lay.fa
+    cd ../..
+    perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i $out_dir/wtdbg2/$prefix.ctg.lay.fa -o $prefix.assembly.$assembler.fa
 elif [[ "$assembler" == "canu-smartdenovo" ]]
 then
     $canu_dir/canu -correct -p $prefix -d $out_dir/canu \
@@ -129,9 +153,9 @@ then
 
     mkdir -p $out_dir/smartdenovo
     cd $out_dir/smartdenovo
-    $smartdenovo_dir/smartdenovo.pl -p $prefix -t $threads -c 1 $out_dir/canu/$prefix.correctedReads.fasta.gz  > $prefix.mak
+    $smartdenovo_dir/smartdenovo.pl -p $prefix -t $threads -c 1 ./../canu/$prefix.correctedReads.fasta.gz  > $prefix.mak
     make -f $prefix.mak
-    cd ..
+    cd ../..
     perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i $out_dir/smartdenovo/$prefix.dmo.cns  -o $prefix.assembly.$assembler.fa
 fi
 
@@ -168,13 +192,10 @@ fi
 # clean up intermediate files
 if [[ $debug == "no" ]]
 then
-    if [[ "$reads_type" == "nanopore-raw" || "$reads_type" == "nanopore-corrected" ]]
-    then
-	rm reads.cleaned.fastq.gz
-    fi
     rm *.delta
     rm *.delta_filter
     rm ref_genome.fa
+    rm ref_genome.fa.fai
     if [[ $vcf == "yes" ]] 
     then
 	rm *.filter.coords

diff --git a/Project_Template/07.Supervised_Final_Assembly/LRSDAY.07.Supervised_Final_Assembly.2.sh b/Project_Template/07.Supervised_Final_Assembly/LRSDAY.07.Supervised_Final_Assembly.2.sh
@@ -12,6 +12,7 @@ prefix="SK1" # The file name prefix for the output files.
 vcf="yes" # Whether to generate a vcf file generated to show SNP and INDEL differences between the assembled genome and the reference genome for their uniquely alignable regions. Use "yes" if prefer to have vcf file generated to show SNP and INDEL differences between the assembled genome and the reference genome. Default = "yes".
 dotplot="yes" # Whether to plot genome-wide dotplot based on the comparison with the reference genome below. Use "yes" if prefer to plot, otherwise use "no". Default = "yes".
 ref_genome_raw="./../00.Ref_Genome/S288C.ASM205763v1.fa" # The path of the raw reference genome, only needed when dotplot="yes" or vcf="yes".
+threads=1 # The number of threads to use. Default = 1.
 debug="no" # Whether to keep intermediate files for debugging. Use "yes" if prefer to keep intermediate files, otherwise use "no". Default = "no".
 
 #######################################
@@ -46,13 +47,13 @@ then
 fi
 
 # make the comparison between the assembled genome and the reference genome
-$mummer_dir/nucmer --maxmatch --nosimplify  -p $prefix.assembly.final  $ref_genome_raw $prefix.assembly.final.fa 
-$mummer_dir/delta-filter -m  $prefix.assembly.final.delta > $prefix.assembly.final.delta_filter
+$mummer_dir/nucmer -t $threads --maxmatch --nosimplify  -p $prefix.assembly.final  $ref_genome_raw $prefix.assembly.final.fa 
+$mummer_dir/delta-filter -m $prefix.assembly.final.delta > $prefix.assembly.final.delta_filter
 
 # generate the vcf output
 if [[ $vcf == "yes" ]]
 then
-    $mummer_dir/show-coords -b -T -r -c -l -d   $prefix.assembly.final.delta_filter > $prefix.assembly.final.filter.coords
+    $mummer_dir/show-coords -b -T -r -c -l -d $prefix.assembly.final.delta_filter > $prefix.assembly.final.filter.coords
     $mummer_dir/show-snps -C -T -l -r $prefix.assembly.final.delta_filter > $prefix.assembly.final.filter.snps
     perl $LRSDAY_HOME/scripts/mummer2vcf.pl -r ref_genome.fa -i $prefix.assembly.final.filter.snps -t SNP -p $prefix.assembly.final.filter
     perl $LRSDAY_HOME/scripts/mummer2vcf.pl -r ref_genome.fa -i $prefix.assembly.final.filter.snps -t INDEL -p $prefix.assembly.final.filter
@@ -75,8 +76,8 @@ if [[ $debug == "no" ]]
 then
     rm *.delta
     rm *.delta_filter
-    # rm ref_genome.fa
-    # rm ref_genome.fa.fai
+    rm ref_genome.fa
+    rm ref_genome.fa.fai
     if [[ $vcf == "yes" ]] 
     then
         rm *.filter.coords

diff --git a/README.md b/README.md
@@ -1,20 +1,24 @@
 # LRSDAY
+
 **LRSDAY: Long-read Sequencing Data Analysis for Yeasts**
 
 A highly transparent, automated and powerful computational framework for high-quality genome assembly and annotation.
 
-![LRSDAY_flowchart](https://github.com/yjx1217/LRSDAY/blob/master/LRSDAY_flowchart.png)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
 ## Description
 Long-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, *Saccharomyces cerevisiae*, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for *S. cerevisiae*, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. 
 
+![LRSDAY_flowchart](https://github.com/yjx1217/LRSDAY/blob/master/LRSDAY_flowchart.png)
+
 
 ## Citations
 Jia-Xing Yue & Gianni Liti. (2018) Long-read sequencing data analysis for yeasts. *Nature Protocols*, 13:1213–1231. 
 
 Jia-Xing Yue, Jing Li, Louise Aigrain, Johan Hallin, Karl Persson, Karen Oliver, Anders Bergström, Paul Coupland, Jonas Warringer, Marco Cosentino Lagomarsino, Gilles Fischer, Richard Durbin, Gianni Liti. (2017) Contrasting evolutionary genome dynamics between domesticated and wild yeasts. *Nature Genetics*, 49:913-924.
 
 ## Release history
+* v1.3.0 Released on 2018/11/13
 * v1.2.0 Released on 2018/10/15
 * v1.1.0 Released on 2018/07/11
 * v1.0.0 Released on 2018/02/04