Skip to content

Commit

Permalink
version update for LRSDAY: v1.2.0 -> v1.3.0
Browse files Browse the repository at this point in the history
yjx1217 committed Nov 13, 2018

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
1 parent 304ceb4 commit ed48f2c
Showing 19 changed files with 409 additions and 309 deletions.
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -7,6 +7,17 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.

## [Unreleased]

## [1.3.0] - 2018-11-13
### Added
- Support for one more alternative assembler: wtdbg2.
### Changed
- Substantially more automated installation/setup process.
- Software version or downloading URL updates for a number of dependencies.
### Fixed
- Bugs introduced due to changes made for file/parameter names in the LRSDAY.01.Long-read-based_Genome_Assembly.sh script when using some alternative assemblers.
- Mismatched step numbers and file names in the manual due to previous version changes.
- Typos in the manual.

## [1.2.0] - 2018-10-15
### Added
- Support for adapter trimming for Nanopore reads (via Porechop).
Binary file modified Example_Outputs/SK1.assembly.final.fa.gz
Binary file not shown.
Binary file modified Example_Outputs/SK1.assembly.final.filter.mummer2vcf.INDEL.vcf.gz
Binary file not shown.
Binary file modified Example_Outputs/SK1.assembly.final.filter.mummer2vcf.SNP.vcf.gz
Binary file not shown.
Binary file modified Example_Outputs/SK1.assembly.final.filter.pdf
Binary file not shown.
26 changes: 13 additions & 13 deletions Example_Outputs/SK1.assembly.final.stats.txt
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
total sequence count: 33
total sequence length: 12490496
total sequence count: 34
total sequence length: 12448004
min sequence length: 1248
max sequence length: 1480288
mean sequence length: 378499.88
median sequence length: 84643.00
N50: 923711
max sequence length: 1480301
mean sequence length: 366117.76
median sequence length: 60826.50
N50: 923676
L50: 6
N90: 341493
N90: 341518
L90: 14
A%: 30.88
T%: 30.79
G%: 19.13
C%: 19.16
AT%: 61.67
GC%: 38.29
A%: 30.89
T%: 30.81
G%: 19.14
C%: 19.13
AT%: 61.70
GC%: 38.26
N%: 0.04
Binary file modified Example_Outputs/SK1.final.cds.fa.gz
Binary file not shown.
Binary file modified Example_Outputs/SK1.final.gff3.gz
Binary file not shown.
455 changes: 227 additions & 228 deletions Example_Outputs/SK1.final.manual_check.list

Large diffs are not rendered by default.

Binary file modified Example_Outputs/SK1.final.pep.fa.gz
Binary file not shown.
Binary file modified Example_Outputs/SK1.final.trimmed_cds.fa.gz
Binary file not shown.
Binary file modified LRSDAY_flowchart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified Manual.pdf
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -11,8 +11,8 @@ prefix="SK1" # The file name prefix for the output files.
long_reads="./../00.Long_Reads/SK1.filtered_subreads.fastq.gz" # The file path of the long reads file (in fastq or fastq.gz format).
long_reads_type="pacbio-raw" # The long reads data type. Use "pacbio-raw" or "pacbio-corrected" or "nanopore-raw" or "nanopore-corrected". Default = "pacbio-raw" for the testing example
genome_size="12.5m" # The estimated genome size with the format of <number>[g|m|k], e.g. 12.5m for 12.5 Mb. Default = "12.5m".
assembler="canu" # The long-read assembler: Use "canu" or "flye" or "smartdenovo" or "canu-flye" or "canu-smartdenovo". For "canu-flye" and "canu-smartdenovo", the assembler canu is used first to generate error-corrected reads from the raw reads and then the assembler flye/smartdenovo is used to assemble the genome. Based on our test, assembler="canu" generally gives the best result but will take substantially longer time than the other options.
customized_canu_parameters="-correctedErrorRate=0.04" # For assembler="canu" only. Users can set customized Canu assembly parameters here or simply leave it empty like "" to use Canu's default assembly parameter. For example you could set "-correctedErrorRate=0.04" for high coverage (>60X) PacBio data and "-correctedErrorRate=0.12 -overlapper=mhap -utgReAlign=true" for high coverage (>60X) Nanopore data to improve the assembly speed. More than one customized parameters can be set here as long as they are separeted by space (e.g. "-option1=XXX -option2=YYY -option3=ZZZ"). Please consult Canu's manual "http://canu.readthedocs.io/en/latest/faq.html#what-parameters-can-i-tweak" for advanced customization settings. Default = "-correctedErrorRate=0.04" for the testing example.
assembler="canu" # The long-read assembler: Use "canu" or "flye" or "wtdbg2" or "smartdenovo" or "canu-flye" or "canu-wtdbg2" or "canu-smartdenovo". For "canu-flye", "canu-wtdbg2", and "canu-smartdenovo", the assembler canu is used first to generate error-corrected reads from the raw reads and then the assembler flye/wtdbg2/smartdenovo is used to assemble the genome. Based on our test, assembler="canu" generally gives the best result but will take substantially longer time than the other options.
customized_canu_parameters="-correctedErrorRate=0.04" # For assembler="canu" only. Users can set customized Canu assembly parameters here or simply leave it empty like "" to use Canu's default assembly parameter. For example you could set "-correctedErrorRate=0.04" for high coverage (>60X) PacBio data and "-overlapper=mhap -utgReAlign=true" for high coverage (>60X) Nanopore data to improve the assembly speed. More than one customized parameters can be set here as long as they are separeted by space (e.g. "-option1=XXX -option2=YYY -option3=ZZZ"). Please consult Canu's manual "http://canu.readthedocs.io/en/latest/faq.html#what-parameters-can-i-tweak" for advanced customization settings. Default = "-correctedErrorRate=0.04" for the testing example.
threads=1 # The number of threads to use. Default = 1.
vcf="yes" # Use "yes" if prefer to have vcf file generated to show SNP and INDEL differences between the assembled genome and the reference genome for their uniquely alignable regions. Otherwise use "no". Default = "yes".
dotplot="yes" # Use "yes" if prefer to plot genome-wide dotplot based on the comparison with the reference genome below. Otherwise use "no". Default = "yes".
@@ -59,25 +59,33 @@ then
-${long_reads_type} $long_reads \
$customized_canu_parameters

perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i ./$out_dir/$prefix.contigs.fasta -o $prefix.assembly.$assembler.fa
perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i $out_dir/$prefix.contigs.fasta -o $prefix.assembly.$assembler.fa
elif [[ "$assembler" == "flye" ]]
then
if [[ "$long_reads_type" == "pacbio-corrected" ]]
then
reads_type="pacbio-corr"
long_reads_type="pacbio-corr"
elif [[ "$long_reads_type" == "nanopore-raw" ]]
then
reads_type="nano-raw"
long_reads_type="nano-raw"
elif [[ "$long_reads_type" == "nanopore-corrected" ]]
then
reads_type="nano-corr"
long_reads_type="nano-corr"
fi
$flye_new_dir/flye -o $out_dir \
$flye_dir/flye -o $out_dir \
-t $threads \
-g $genome_size \
--${long_reads_type} $long_reads \
-i 2
perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i ./$out_dir/contigs.fasta -o $prefix.assembly.$assembler.fa
perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i $out_dir/scaffolds.fasta -o $prefix.assembly.$assembler.fa
elif [[ "$assembler" == "wtdbg2" ]]
then
mkdir $out_dir
cd $out_dir
$wtdbg2_dir/wtdbg2 -t $threads -L 5000 -i ./../$long_reads -fo $prefix
$wtdbg2_dir/wtpoa-cns -t $threads -i $prefix.ctg.lay.gz -fo $prefix.ctg.lay.fa
cd ..
perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i $out_dir/$prefix.ctg.lay.fa -o $prefix.assembly.$assembler.fa
elif [[ "$assembler" == "smartdenovo" ]]
then
mkdir $out_dir
@@ -95,7 +103,7 @@ then
maxThreads=$threads \
genomeSize=$genome_size \
gnuplot=$gnuplot_dir/gnuplot \
-${reads_type} $long_reads \
-${long_reads_type} $long_reads \
# $customized_canu_parameters

if [[ "$long_reads_type" == "pacbio-raw" ]]
@@ -116,7 +124,23 @@ then
-g $genome_size \
--${long_reads_type} $out_dir/canu/$prefix.correctedReads.fasta.gz \
-i 2
perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i ./$out_dir/flye/contigs.fasta -o $prefix.assembly.$assembler.fa
perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i $out_dir/flye/scaffolds.fasta -o $prefix.assembly.$assembler.fa
elif [[ "$assembler" == "canu-wtdbg2" ]]
then
$canu_dir/canu -correct -p $prefix -d $out_dir/canu \
useGrid=false \
maxThreads=$threads \
genomeSize=$genome_size \
gnuplot=$gnuplot_dir/gnuplot \
-${long_reads_type} $long_reads \
# $customized_canu_parameters

mkdir -p $out_dir/wtdbg2
cd $out_dir/wtdbg2
$wtdbg2_dir/wtdbg2 -t $threads -L 5000 -i ./../canu/$prefix.correctedReads.fasta.gz -fo $prefix
$wtdbg2_dir/wtpoa-cns -t $threads -i $prefix.ctg.lay.gz -fo $prefix.ctg.lay.fa
cd ../..
perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i $out_dir/wtdbg2/$prefix.ctg.lay.fa -o $prefix.assembly.$assembler.fa
elif [[ "$assembler" == "canu-smartdenovo" ]]
then
$canu_dir/canu -correct -p $prefix -d $out_dir/canu \
@@ -129,9 +153,9 @@ then

mkdir -p $out_dir/smartdenovo
cd $out_dir/smartdenovo
$smartdenovo_dir/smartdenovo.pl -p $prefix -t $threads -c 1 $out_dir/canu/$prefix.correctedReads.fasta.gz > $prefix.mak
$smartdenovo_dir/smartdenovo.pl -p $prefix -t $threads -c 1 ./../canu/$prefix.correctedReads.fasta.gz > $prefix.mak
make -f $prefix.mak
cd ..
cd ../..
perl $LRSDAY_HOME/scripts/simplify_seq_name.pl -i $out_dir/smartdenovo/$prefix.dmo.cns -o $prefix.assembly.$assembler.fa
fi

@@ -168,13 +192,10 @@ fi
# clean up intermediate files
if [[ $debug == "no" ]]
then
if [[ "$reads_type" == "nanopore-raw" || "$reads_type" == "nanopore-corrected" ]]
then
rm reads.cleaned.fastq.gz
fi
rm *.delta
rm *.delta_filter
rm ref_genome.fa
rm ref_genome.fa.fai
if [[ $vcf == "yes" ]]
then
rm *.filter.coords
Original file line number Diff line number Diff line change
@@ -12,6 +12,7 @@ prefix="SK1" # The file name prefix for the output files.
vcf="yes" # Whether to generate a vcf file generated to show SNP and INDEL differences between the assembled genome and the reference genome for their uniquely alignable regions. Use "yes" if prefer to have vcf file generated to show SNP and INDEL differences between the assembled genome and the reference genome. Default = "yes".
dotplot="yes" # Whether to plot genome-wide dotplot based on the comparison with the reference genome below. Use "yes" if prefer to plot, otherwise use "no". Default = "yes".
ref_genome_raw="./../00.Ref_Genome/S288C.ASM205763v1.fa" # The path of the raw reference genome, only needed when dotplot="yes" or vcf="yes".
threads=1 # The number of threads to use. Default = 1.
debug="no" # Whether to keep intermediate files for debugging. Use "yes" if prefer to keep intermediate files, otherwise use "no". Default = "no".

#######################################
@@ -46,13 +47,13 @@ then
fi

# make the comparison between the assembled genome and the reference genome
$mummer_dir/nucmer --maxmatch --nosimplify -p $prefix.assembly.final $ref_genome_raw $prefix.assembly.final.fa
$mummer_dir/delta-filter -m $prefix.assembly.final.delta > $prefix.assembly.final.delta_filter
$mummer_dir/nucmer -t $threads --maxmatch --nosimplify -p $prefix.assembly.final $ref_genome_raw $prefix.assembly.final.fa
$mummer_dir/delta-filter -m $prefix.assembly.final.delta > $prefix.assembly.final.delta_filter

# generate the vcf output
if [[ $vcf == "yes" ]]
then
$mummer_dir/show-coords -b -T -r -c -l -d $prefix.assembly.final.delta_filter > $prefix.assembly.final.filter.coords
$mummer_dir/show-coords -b -T -r -c -l -d $prefix.assembly.final.delta_filter > $prefix.assembly.final.filter.coords
$mummer_dir/show-snps -C -T -l -r $prefix.assembly.final.delta_filter > $prefix.assembly.final.filter.snps
perl $LRSDAY_HOME/scripts/mummer2vcf.pl -r ref_genome.fa -i $prefix.assembly.final.filter.snps -t SNP -p $prefix.assembly.final.filter
perl $LRSDAY_HOME/scripts/mummer2vcf.pl -r ref_genome.fa -i $prefix.assembly.final.filter.snps -t INDEL -p $prefix.assembly.final.filter
@@ -75,8 +76,8 @@ if [[ $debug == "no" ]]
then
rm *.delta
rm *.delta_filter
# rm ref_genome.fa
# rm ref_genome.fa.fai
rm ref_genome.fa
rm ref_genome.fa.fai
if [[ $vcf == "yes" ]]
then
rm *.filter.coords
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,24 @@
# LRSDAY

**LRSDAY: Long-read Sequencing Data Analysis for Yeasts**

A highly transparent, automated and powerful computational framework for high-quality genome assembly and annotation.

![LRSDAY_flowchart](https://github.com/yjx1217/LRSDAY/blob/master/LRSDAY_flowchart.png)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Description
Long-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, *Saccharomyces cerevisiae*, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for *S. cerevisiae*, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms.

![LRSDAY_flowchart](https://github.com/yjx1217/LRSDAY/blob/master/LRSDAY_flowchart.png)


## Citations
Jia-Xing Yue & Gianni Liti. (2018) Long-read sequencing data analysis for yeasts. *Nature Protocols*, 13:1213–1231.

Jia-Xing Yue, Jing Li, Louise Aigrain, Johan Hallin, Karl Persson, Karen Oliver, Anders Bergström, Paul Coupland, Jonas Warringer, Marco Cosentino Lagomarsino, Gilles Fischer, Richard Durbin, Gianni Liti. (2017) Contrasting evolutionary genome dynamics between domesticated and wild yeasts. *Nature Genetics*, 49:913-924.

## Release history
* v1.3.0 Released on 2018/11/13
* v1.2.0 Released on 2018/10/15
* v1.1.0 Released on 2018/07/11
* v1.0.0 Released on 2018/02/04
Loading

0 comments on commit ed48f2c

Please sign in to comment.