Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alignment optimization suggestions #47

Open
yuankunzhu opened this issue Feb 8, 2018 · 5 comments
Open

alignment optimization suggestions #47

yuankunzhu opened this issue Feb 8, 2018 · 5 comments

Comments

@yuankunzhu
Copy link
Member

BWA Mem (scattered by read group)

Current setup:

Picard SamToFastq | BWA | Picard MergeBamAlignment

Proposed setup:

Samtools bam2fq | BWA |
Samblaster | (mark duplicates here and remove picard_markduplicates; It makes sense to mark duplicates only within the same read group and saves a lot of time by piping it with BWA) |
Sambamba view (sam to bam) |
Sambamba Sort

@adamstruck
Copy link

I believe your missing the merge step from the proposed setup above.

The PCAWG project had success using the biobambam2 suite of tools.

Using those the setup would be:

bamtofastq | BWA | bamsormadup | bammerge

@bogdang989
Copy link

After some testing this is what is recommended:

Extract RG info from header &&
Picard SamToFastq |
BWA |
Samblaster |
Sambamba View |
Sambamba Sort

Example command line looks like this:

/opt/samtools-1.7/samtools  view -H /root_samtools_split/2895820049.bam | 
grep ^@RG > RGINFO && sed -i 's/ /\t/g' RGINFO && >&2 cat RGINFO && 
java -Xms5000m -jar /picard.jar SamToFastq INPUT=/root_samtools_split/2895820049.bam FASTQ=/dev/stdout INTERLEAVE=true NON_PF=true | 
bwa mem -R $(awk '{ print $1"\\t"$2"\\t"$3"\\t"$4"\\t"$5"\\t"$6 }' RGINFO) -K 100000000 -p -v 3 -t 16 -Y /Homo_sapiens_assembly38.fasta - | 
/opt/samblaster/samblaster -i /dev/stdin -o /dev/stdout | 
/opt/sambamba_v0.6.0 view -t 16 -f bam -l 0 -S /dev/stdin | 
/opt/sambamba_v0.6.0 sort -t 16 --natural-sort -m 5GiB --tmpdir ./ -o 2895820049.aligned.unsorted.bam -l 5 /dev/stdin

We returned Picard SamToFastq since samtools bam2fq did not do a great job in conversion - a significant number of the reads were converted as single-end for some reason which might affect the final outcome. Extract RG info from header part is some linux maneuvering to get the RG info from input BAM header (using BWA -R parameter), since in the original setup the RG information was added to the output BAM with Picard MergeBAMAlignment.

For one random pilot sample tested, average single BWA job duration is reduced from ~35min to ~28min and duplicate marking is done on-the-fly, saving nearly 5 hours needed for executing Picard Markduplicates.

@adamstruck Merge step is not included in the pipe, since alignment step is scattered by read group, so merge comes in later as a separate step in the current flow.
Regarding biobambam2 suite, bamsormadup might also be an option instead of the samblaster+sambamba combo in the current proposed setup, definitely could be worth exploring.
As for biobambam2 bam2fastq, I know it was also used in the GDC TCGA harmonization pipeline. I am not sure if this tool has an option to output interleaved fastq files, so it couldn't be just swapped into the pipe and it would probably require some reworking of the workflow configuration. Also, the output should also be checked and compared to the Picard execution. Other than that, it also might be worth exploring as an option.

@bogdang989
Copy link

bogdang989 commented Feb 28, 2018

Inputs with a single read group are aligned twice as long because of using 18 threads. An expression tool should be added to autodetect number of read groups in input BAM and set threads:
If there is 1 RG, set 36;
Else set 18

Done here bogdang989@b4a6228

@bogdang989
Copy link

@adamstruck I tested the biobambam bamtofastq and it indeed has the option to output interleaved fastq. What's even better it has the option to restore original quality scores before recalibration, thus fully allowing the use of samtools split instead of picard revertsam.
So the proposed setup now would be:

Extract RG info from header &&
biobambam2 bamtofastq |
BWA |
Samblaster |
Sambamba View |
Sambamba Sort

Added this change here: bogdang989@0e9d63c

Still didn't test bamsormadup.

@yuankunzhu
Copy link
Member Author

@adamstruck do you know whether bamsormadup is library aware when handling the duplicate reads marking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants