alignment optimization suggestions #47

yuankunzhu · 2018-02-08T17:59:21Z

BWA Mem (scattered by read group)

Current setup:

Picard SamToFastq | BWA | Picard MergeBamAlignment

Proposed setup:

adamstruck · 2018-02-08T19:28:51Z

I believe your missing the merge step from the proposed setup above.

The PCAWG project had success using the biobambam2 suite of tools.

Using those the setup would be:

bamtofastq | BWA | bamsormadup | bammerge

bogdang989 · 2018-02-14T10:27:33Z

After some testing this is what is recommended:

Extract RG info from header &&
Picard SamToFastq |
BWA |
Samblaster |
Sambamba View |
Sambamba Sort

Example command line looks like this:

/opt/samtools-1.7/samtools  view -H /root_samtools_split/2895820049.bam | 
grep ^@RG > RGINFO && sed -i 's/ /\t/g' RGINFO && >&2 cat RGINFO && 
java -Xms5000m -jar /picard.jar SamToFastq INPUT=/root_samtools_split/2895820049.bam FASTQ=/dev/stdout INTERLEAVE=true NON_PF=true | 
bwa mem -R $(awk '{ print $1"\\t"$2"\\t"$3"\\t"$4"\\t"$5"\\t"$6 }' RGINFO) -K 100000000 -p -v 3 -t 16 -Y /Homo_sapiens_assembly38.fasta - | 
/opt/samblaster/samblaster -i /dev/stdin -o /dev/stdout | 
/opt/sambamba_v0.6.0 view -t 16 -f bam -l 0 -S /dev/stdin | 
/opt/sambamba_v0.6.0 sort -t 16 --natural-sort -m 5GiB --tmpdir ./ -o 2895820049.aligned.unsorted.bam -l 5 /dev/stdin

We returned Picard SamToFastq since samtools bam2fq did not do a great job in conversion - a significant number of the reads were converted as single-end for some reason which might affect the final outcome. Extract RG info from header part is some linux maneuvering to get the RG info from input BAM header (using BWA -R parameter), since in the original setup the RG information was added to the output BAM with Picard MergeBAMAlignment.

For one random pilot sample tested, average single BWA job duration is reduced from ~35min to ~28min and duplicate marking is done on-the-fly, saving nearly 5 hours needed for executing Picard Markduplicates.

@adamstruck Merge step is not included in the pipe, since alignment step is scattered by read group, so merge comes in later as a separate step in the current flow.
Regarding biobambam2 suite, bamsormadup might also be an option instead of the samblaster+sambamba combo in the current proposed setup, definitely could be worth exploring.
As for biobambam2 bam2fastq, I know it was also used in the GDC TCGA harmonization pipeline. I am not sure if this tool has an option to output interleaved fastq files, so it couldn't be just swapped into the pipe and it would probably require some reworking of the workflow configuration. Also, the output should also be checked and compared to the Picard execution. Other than that, it also might be worth exploring as an option.

bogdang989 · 2018-02-28T15:59:57Z

Inputs with a single read group are aligned twice as long because of using 18 threads. An expression tool should be added to autodetect number of read groups in input BAM and set threads:
If there is 1 RG, set 36;
Else set 18

Done here bogdang989@b4a6228

bogdang989 · 2018-03-05T17:02:38Z

@adamstruck I tested the biobambam bamtofastq and it indeed has the option to output interleaved fastq. What's even better it has the option to restore original quality scores before recalibration, thus fully allowing the use of samtools split instead of picard revertsam.
So the proposed setup now would be:

Extract RG info from header &&
biobambam2 bamtofastq |
BWA |
Samblaster |
Sambamba View |
Sambamba Sort

Added this change here: bogdang989@0e9d63c

Still didn't test bamsormadup.

yuankunzhu · 2018-04-16T17:54:03Z

@adamstruck do you know whether bamsormadup is library aware when handling the duplicate reads marking?

yuankunzhu mentioned this issue Feb 8, 2018

optimization suggestions list #46

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alignment optimization suggestions #47

alignment optimization suggestions #47

yuankunzhu commented Feb 8, 2018

adamstruck commented Feb 8, 2018

bogdang989 commented Feb 14, 2018

bogdang989 commented Feb 28, 2018 •

edited

Loading

bogdang989 commented Mar 5, 2018

yuankunzhu commented Apr 16, 2018

alignment optimization suggestions #47

alignment optimization suggestions #47

Comments

yuankunzhu commented Feb 8, 2018

Current setup:

Proposed setup:

adamstruck commented Feb 8, 2018

bogdang989 commented Feb 14, 2018

bogdang989 commented Feb 28, 2018 • edited Loading

bogdang989 commented Mar 5, 2018

yuankunzhu commented Apr 16, 2018

bogdang989 commented Feb 28, 2018 •

edited

Loading