📝 updated readme to reflect new updates

🔨 added param to allow for cram re-mapping 🔨 edited all related tools and subwf to accept new cram input
kids-first · Aug 22, 2022 · 637e33f · 637e33f
1 parent 74fb0da
commit 637e33f
Show file tree

Hide file tree

Showing 10 changed files with 168 additions and 64 deletions.
diff --git a/docs/KFDRC_SENTIEON_ALIGNMENT_GATK_HAPLOTYPER_WORKFLOW_README.md b/docs/KFDRC_SENTIEON_ALIGNMENT_GATK_HAPLOTYPER_WORKFLOW_README.md
@@ -41,8 +41,8 @@ For more information see: https://github.com/kids-first/kf-alignment-workflow#ou
 ## Sentieon Alignment: Similarities and Differences
 
 The two workflows start identically; both workflows start by splitting the
-input BAMs into read group (RG) BAMs using samtools split then convert those RG
-BAMs into FASTQ files using biobambam2 bamtofastq. After FASTQ creation, the
+input SAMs/BAMs/CRAMs (Alignment/Map files, or AMs) into read group (RG) AMs using samtools split then convert those RG
+AMs into FASTQ files using biobambam2 bamtofastq. After FASTQ creation, the
 two workflows diverge in software usage. Whereas the KFDRC GATK pipeline uses a
 wide variety of tools (bwa, sambamba, samblaster, GATK, Picard, and samtools)
 to generate the realigned CRAMs, the KFDRC Sentieon pipeline uses exclusively

diff --git a/readme.md b/readme.md
@@ -11,7 +11,7 @@ this can be used later on for further analysis in joint trio genotyping and subs
  This workflow is the current production workflow, equivalent to this [Cavatica public app](https://cavatica.sbgenomics.com/public/apps#cavatica/apps-publisher/kfdrc-alignment-workflow) and supersedes the [old workflow](https://github.com/kids-first/kf-alignment-workflow/tree/1.0.0) and [public app](https://cavatica.sbgenomics.com/public/apps#kids-first-drc/kids-first-drc-alignment-workflow/kfdrc-alignment-bam2cram2gvcf/); however outputs are considered equivalent.
 
 ## Input Agnostic Alignment Workflow
-Workflow for the alignment or realignment of input BAMs, PE reads, and/or SE reads; conditionally generate gVCF and metrics.
+Workflow for the alignment or realignment of input SAMs/BAMs/CRAMs (Alignment/Map files, or AMs), PE reads, and/or SE reads; conditionally generate gVCF and metrics.
 
 This workflow is a all-in-one workflow for handling any kind of reads inputs: BAM inputs, PE reads
 and mates inputs, SE reads inputs,  or any combination of these. The workflow will naively attempt
@@ -68,6 +68,7 @@ to `true`; no additonal inputs are required.
   input_se_rgs_list: { type: 'string[]?', doc: "List of RG strings to use in SE processing" }
   run_bam_processing: { type: boolean, doc: "BAM processing will be run. Requires: input_bam_list" }
   run_pe_reads_processing: { type: boolean, doc: "PE reads processing will be run. Requires: input_pe_reads_list, input_pe_mates_list, input_pe_rgs_list" }
+  cram_reference: { type: 'File?', doc: "If aligning from cram, need to provided reference used to generate that cram" }
   run_se_reads_processing: { type: boolean, doc: "SE reads processing will be run. Requires: input_se_reads_list, input_se_rgs_list" }
   # IF WGS or CREATE gVCF
   wgs_calling_interval_list: { type: 'File?', doc: "WGS interval list used to aid scattering Haplotype caller" }
@@ -113,18 +114,18 @@ to `true`; no additonal inputs are required.
 
 #### Detailed Input Information:
 The pipeline is build to handle three distinct input types:
-1. BAMs
+1. SAMs/BAMs/CRAMs (Alignment/Map files, or AMs)
 1. PE Fastqs
 1. SE Fastqs
 
-Additionally, the workflow supports these three in any combination. You can have PE Fastqs and BAMs,
-PE Fastqs and SE Fastqs, BAMS and PE Fastqs and SE Fastqs, etc. Each of these three classes will be
+Additionally, the workflow supports these three in any combination. You can have PE Fastqs and AMs,
+PE Fastqs and SE Fastqs, AMs and PE Fastqs and SE Fastqs, etc. Each of these three classes will be
 procsessed and aligned separately and the resulting BWA aligned bams will be merged into a final BAM
 before performing steps like BQSR and Metrics collection.
 
-##### BAM Inputs
-The BAM processing portion of the pipeline is the simplest when it comes to inputs. You may provide
-a single BAM or many BAMs. The input for BAMs is a file list. In Cavatica or other GUI interfaces,
+#####  Alignment/Map Inputs
+The Alignment/Map processing portion of the pipeline is the simplest when it comes to inputs. You may provide
+a single Alignment/Map file or many AMs. The input for AMs is a file list. In Cavatica or other GUI interfaces,
 simply select the files you wish to process. For command line interfaces such as cwltool, your input
 should look like the following.
 ```json

diff --git a/subworkflows/kfdrc_process_bam.cwl b/subworkflows/kfdrc_process_bam.cwl
@@ -11,6 +11,8 @@ inputs:
     secondaryFiles: ['.64.amb', '.64.ann', '.64.bwt', '.64.pac', '.64.sa', '.64.alt', '^.dict']
   sample_name: string
   min_alignment_score: int?
+  cram_reference: { type: 'File?', doc: "If aligning from cram, need to provided reference used to generate that cram" }
+
 outputs:
   unsorted_bams:
     type:
@@ -35,5 +37,6 @@ steps:
       sample_name: sample_name
       input_rgbam: samtools_split/bam_files
       min_alignment_score: min_alignment_score
+      cram_reference: cram_reference
     scatter: [input_rgbam]
     out: [unsorted_bams] #+1 Nesting File[][]
diff --git a/subworkflows/kfdrc_process_bamlist.cwl b/subworkflows/kfdrc_process_bamlist.cwl
@@ -13,6 +13,8 @@ inputs:
   sample_name: string
   conditional_run: int
   min_alignment_score: int?
+  cram_reference: { type: 'File?', doc: "If aligning from cram, need to provided reference used to generate that cram" }
+
 outputs:
   unsorted_bams:
     type:
@@ -32,6 +34,7 @@ steps:
       indexed_reference_fasta: indexed_reference_fasta
       sample_name: sample_name
       min_alignment_score: min_alignment_score
+      cram_reference: cram_reference
     scatter: input_bam
     out: [unsorted_bams] #+2 Nesting File[][][]
 

diff --git a/subworkflows/kfdrc_rgbam_to_realnbam.cwl b/subworkflows/kfdrc_rgbam_to_realnbam.cwl
@@ -10,6 +10,8 @@ inputs:
     secondaryFiles: ['.64.amb', '.64.ann', '.64.bwt', '.64.pac', '.64.sa', '.64.alt', '^.dict']
   sample_name: string
   min_alignment_score: int?
+  cram_reference: { type: 'File?', doc: "If aligning from cram, need to provided reference used to generate that cram" }
+
 outputs:
   unsorted_bams:
     type: File[]
@@ -19,8 +21,8 @@ steps:
   bamtofastq_chomp:
     run: ../tools/bamtofastq_chomp.cwl
     in:
-      input_bam: input_rgbam
-#      sample: sample_name
+      input_align: input_rgbam
+      reference: cram_reference
     out: [output, rg_string]
 
   expression_updatergsample:
@@ -37,7 +39,6 @@ steps:
       reads: bamtofastq_chomp/output
       interleaved:
         default: true
-#      rg: bamtofastq_chomp/rg_string
       rg: expression_updatergsample/rg_str
       min_alignment_score: min_alignment_score
     scatter: [reads]

diff --git a/subworkflows/rgbam_to_bwa_payload.cwl b/subworkflows/rgbam_to_bwa_payload.cwl
@@ -8,6 +8,8 @@ requirements:
 inputs:
   input_rgbam: File
   sample_name: string
+  cram_reference: { type: 'File?', doc: "Fasta file if input is cram", secondaryFiles: [.fai] }
+
 outputs:
   bwa_payload:
     type:
@@ -42,7 +44,8 @@ steps:
   bamtofastq:
     run: ../tools/biobambam_bamtofastq.cwl
     in:
-      input_bam: input_rgbam
+      input_align: input_rgbam
+      reference: cram_reference
     out: [output]
 
   clt_prepare_bwa_payload:

diff --git a/tools/bamtofastq_chomp.cwl b/tools/bamtofastq_chomp.cwl
@@ -23,33 +23,38 @@ arguments:
     valueFrom: |-
       set -eo pipefail
 
-      samtools view -H $(inputs.input_bam.path) | grep ^@RG > rg.txt
+      samtools view -H $(inputs.input_align.path) | grep ^@RG > rg.txt
 
-      if [ $(inputs.input_bam.size) -gt $(inputs.max_size) ]; then
-        bamtofastq tryoq=1 filename=$(inputs.input_bam.path) | split -dl 680000000 - reads-
+      EXT=$(inputs.input_align.nameext.toLowerCase().substr(1))
+
+      if [ $(inputs.input_align.size) -gt $(inputs.max_size) ]; then
+        bamtofastq tryoq=1 filename=$(inputs.input_align.path) inputformat=$EXT ${
+          if (inputs.reference != null){
+              return "reference=" + inputs.reference.path;
+            }
+          else{
+              return "";
+            }
+          } | split -dl 680000000 - reads-
         ls reads-* | xargs -i mv {} {}.fq
       else
-        bamtofastq tryoq=1 filename=$(inputs.input_bam.path) > reads-00.fq
+        bamtofastq tryoq=1 filename=$(inputs.input_align.path) inputformat=$EXT ${
+          if (inputs.reference != null){
+              return "reference=" + inputs.reference.path;
+            }
+          else{
+              return "";
+            }
+          } > reads-00.fq
       fi
 inputs:
-  input_bam: { type: File, doc: "Input bam file" }
-  max_size: { type: long, default: 20000000000, doc: "The maximum size (in bytes) that an input bam can be before the FASTQ is split" }
-#  sample: { type: string, doc: "String name of the sample used to relabel the rg string" }
+  input_align: { type: File, doc: "Input alignment file" }
+  max_size: { type: 'long?', default: 20000000000, doc: "The maximum size (in bytes) that an input bam can be before the FASTQ is split" }
+  reference: { type: 'File?', doc: "Fasta file if input is cram", secondaryFiles: [.fai] }
+
 outputs:
   output: { type: 'File[]', outputBinding: { glob: '*.fq' } }
   rg_string:
-#    type: string
     type: File
     outputBinding:
       glob: rg.txt
-#      loadContents: true
-#      outputEval:
-#        ${
-#          var arr = self[0].contents.split('\n')[0].split('\t');
-#          for (var i=1; i<arr.length; i++){
-#            if (arr[i].startsWith('SM')){
-#              arr[i] = 'SM:' + inputs.sample;
-#            }
-#          }
-#          return arr.join('\\t');
-#        }
diff --git a/tools/biobambam_bamtofastq.cwl b/tools/biobambam_bamtofastq.cwl
@@ -18,9 +18,15 @@ arguments:
   - position: 0
     shellQuote: false
     valueFrom: |-
-      bamtofastq tryoq=1 filename=$(inputs.input_bam.path) > reads-00.fq
+      bamtofastq tryoq=1 inputformat=$(inputs.input_align.nameext.toLowerCase().substr(1))
+  - position: 2
+    shellQuote: false
+    valueFrom: |-
+      > reads-00.fq
 inputs:
-  input_bam: { type: File, doc: "Input bam file" }
+  input_align: { type: File, doc: "Input alignment file", inputBinding: { position: 1, prefix: "filename=", separate: false } }
+  reference: { type: 'File?', doc: "Fasta file if input is cram", secondaryFiles: [.fai],
+    inputBinding: { position: 1, prefix: "reference=", separate: false } }
 outputs:
   output: { type: 'File', outputBinding: { glob: '*.fq' } }