Merge pull request #81 from visze/feature/NGmerge

visze · web-flow · commit b8c5666f85ae · 2022-11-10T11:35:57.000+01:00
feat: NGmerge
diff --git a/config/sbatch.yml b/config/sbatch.yml
@@ -9,14 +9,11 @@ __default__:
 ##################
 ### ASSIGNMENT ###
 ##################
-assignment_getInputs:
-  time: "0-10:00"
-  queue: medium
 assignment_merge:
   time: "0-08:00"
   queue: medium
 assignment_fastq_split:
-  time: "0-02:00"
+  time: "0-04:00"
   threads: 1
   mem: 10G
   queue: medium
@@ -31,9 +28,9 @@ assignment_collect:
   mem: 10G
   queue: medium
 assignment_getBCs:
-  time: "0-04:00"
+  time: "1-08:00"
   threads: 1
-  queue: short
+  queue: medium
 assignment_statistic_totalCounts:
   time: "0-01:00"
   threads: 1
diff --git a/docs/assignment_example1.rst b/docs/assignment_example1.rst
@@ -123,14 +123,13 @@ You should see a list of rules that will be executed. This is the summary:
    assignment_filter                          2              1              1
    assignment_flagstat                        1              1              1
    assignment_getBCs                          1              1              1
-   assignment_getInputs                       3              1              1
    assignment_idx_bam                         1              1              1
    assignment_mapping                         1              1              1
    assignment_merge                           30             10             10
    assignment_statistic_assignedCounts        2              1              1
    assignment_statistic_assignment            2              1              1
    assignment_statistic_totalCounts           1              1              1
-   total                                     49              1              1
+   total                                     46              1              1
 
 
 When dry-drun does not give any errors we will run the workflow. We use a machine with 30 threads/cores to run the workflow. Therefore :code:`split_number` is set to 30 to parallize the workflow. Also we are using 10 threads for mapping (bwa mem). But snakemake takes care that no more than 30 threads are used.
@@ -142,16 +141,14 @@ When dry-drun does not give any errors we will run the workflow. We use a machin
 
 .. note:: Please modify your code when running in a cluster environment. We have an example SLURM config file here :code:`config/sbatch.yml`.
 
-If everything works fine the 13 rules showed above will run:
+If everything works fine the 12 rules showed above will run:
 
 all
    The overall all rule. Here is defined what final output files are expected.
 assignment_bwa_ref
    Create mapping reference for BWA from design file.
 assignment_fastq_split
    Split the fastq files into n files for parallelisation. N is given by split_read in the configuration file.
-assignment_getInputs
-   Concat the input fastq files per R1,R2,R3. If only single fastq file is provided a symbolic link is created.
 assignment_merge
    Merge the FW,REV and BC fastq files into one. Extract the index sequence from the middle and end of an Illumina run. Separates reads for Paired End runs. Merge/Adapter trim reads stored in BAM.
 assignment_mapping
diff --git a/docs/combined_example1.rst b/docs/combined_example1.rst
@@ -137,7 +137,6 @@ You should see a list of rules that will be executed. This is the summary:
     assignment_filter                                                   1              1              1
     assignment_flagstat                                                 1              1              1
     assignment_getBCs                                                   1              1              1
-    assignment_getInputs                                                3              1              1
     assignment_idx_bam                                                  1              1              1
     assignment_mapping                                                  1             10             10
     assignment_merge                                                   30              1              1
@@ -168,7 +167,7 @@ You should see a list of rules that will be executed. This is the summary:
     statistic_counts_frequent_umis                                      6              1              1
     statistic_counts_stats_merge                                        2              1              1
     statistic_counts_table                                             12              1              1
-    total                                                             139              1             10
+    total                                                             136              1             10
 
 
 When dry-drun does not give any errors we will run the workflow. We use a machine with 30 threads/cores to run the workflow. Therefore :code:`split_number` is set to 30 to parallize the workflow. Also we are using 10 threads for mapping (bwa mem). But snakemake takes care that no more than 30 threads are used.
@@ -180,7 +179,7 @@ When dry-drun does not give any errors we will run the workflow. We use a machin
 
 .. note:: Please modify your code when running in a cluster environment. We have an example SLURM config file here :code:`config/sbatch.yml`.
 
-If everything works fine the 41 rules showed above will run. Please goto the :ref:`Assignment example`_ and the :ref:`Count example`_ 
+If everything works fine the 40 rules showed above will run. Please goto the :ref:`Assignment example`_ and the :ref:`Count example`_ 
 
 Results
 -----------------
diff --git a/workflow/envs/NGmerge.yaml b/workflow/envs/NGmerge.yaml
@@ -0,0 +1,10 @@
+---
+channels:
+  - conda-forge
+  - bioconda
+  - defaults
+dependencies:
+  - ngmerge=0.3
+  - python
+  - click  
+  - htslib
diff --git a/workflow/envs/fastq_join.yaml b/workflow/envs/fastq_join.yaml
@@ -0,0 +1,11 @@
+---
+channels:
+  - conda-forge
+  - bioconda
+  - defaults
+dependencies:
+  - fastq-join=1.3.1
+  - python
+  - click
+  
+  - htslib
diff --git a/workflow/rules/assignment.smk b/workflow/rules/assignment.smk
@@ -1,39 +1,13 @@
-SPLIT_FILES_NUMBER = 1
-
-
 include: "assignment/statistic.smk"
 
 
-rule assignment_getInputs:
-    """
-    Concat the input fastq files per R1,R2,R3. 
-    If only single fastq file is provided a symbolic link is created.
-    """
-    conda:
-        "../envs/default.yaml"
-    input:
-        lambda wc: config["assignments"][wc.assignment][wc.R],
-    output:
-        R1=temp("results/assignment/{assignment}/fastq/{R}.fastq.gz"),
-    log:
-        temp("results/logs/assignment/getInputs.{assignment}.{R}.log"),
-    shell:
-        """
-        if [[ "$(ls {input} | wc -l)" -eq 1 ]]; then 
-            ln -rs {input} {output}; 
-        else
-            zcat {input} | gzip -c > {output};
-        fi &> {log}
-        """
-
-
 rule assignment_fastq_split:
     """
     Split the fastq files into n files for parallelisation. 
     n is given by split_read in the configuration file.
     """
     input:
-        "results/assignment/{assignment}/fastq/{R}.fastq.gz",
+        lambda wc: config["assignments"][wc.assignment][wc.R],
     output:
         temp(
             expand(
@@ -59,54 +33,69 @@ rule assignment_fastq_split:
         ),
     shell:
         """
-        fastqsplitter -i {input} -t 1 {params.files} &> {log}
+        fastqsplitter -i <(zcat {input}) -t 1 {params.files} &> {log}
+        """
+
+
+rule assignment_attach_idx:
+    """
+    Extract the index sequence and add it to the header.
+    """
+    conda:
+        "../envs/NGmerge.yaml"
+    input:
+        read="results/assignment/{assignment}/fastq/splits/R{R}.split{split}.fastq.gz",
+        BC="results/assignment/{assignment}/fastq/splits/R2.split{split}.fastq.gz",
+        script=getScript("attachBCToFastQ.py"),
+    output:
+        read=temp(
+            "results/assignment/{assignment}/fastq/splits/R{R}.split{split}.BCattached.fastq.gz"
+        ),
+    log:
+        temp("results/logs/assignment/attach_idx.{assignment}.{split}.{R}.log"),
+    shell:
+        """
+        python {input.script} -r {input.read} -b {input.BC} | bgzip -c > {output.read}
         """
 
 
 rule assignment_merge:
     """
     Merge the FW,REV and BC fastq files into one. 
-    Extract the index sequence from the middle and end of an Illumina run. 
-    Separates reads for Paired End runs. Merge/Adapter trim reads stored in BAM.
+    Extract the index sequence and add it to the header.
     """
     conda:
-        "../envs/python27.yaml"
+        "../envs/NGmerge.yaml"
     input:
-        R1="results/assignment/{assignment}/fastq/splits/R1.split{split}.fastq.gz",
-        R2="results/assignment/{assignment}/fastq/splits/R2.split{split}.fastq.gz",
-        R3="results/assignment/{assignment}/fastq/splits/R3.split{split}.fastq.gz",
-        script_FastQ2doubleIndexBAM=getScript("count/FastQ2doubleIndexBAM.py"),
-        script_MergeTrimReadsBAM=getScript("count/MergeTrimReadsBAM.py"),
+        R1="results/assignment/{assignment}/fastq/splits/R1.split{split}.BCattached.fastq.gz",
+        R3="results/assignment/{assignment}/fastq/splits/R3.split{split}.BCattached.fastq.gz",
     output:
-        bam=temp("results/assignment/{assignment}/bam/merge_split{split}.bam"),
+        un=temp("results/assignment/{assignment}/fastq/merge_split{split}.un.fastq.gz"),
+        join=temp(
+            "results/assignment/{assignment}/fastq/merge_split{split}.join.fastq.gz"
+        ),
     params:
-        bc_length=lambda wc: config["assignments"][wc.assignment]["bc_length"],
+        min_overlap=lambda wc: config["assignments"][wc.assignment]["NGmerge"][
+            "min_overlap"
+        ],
+        frac_mismatches_allowed=lambda wc: config["assignments"][wc.assignment][
+            "NGmerge"
+        ]["frac_mismatches_allowed"],
+        min_dovetailed_overlap=lambda wc: config["assignments"][wc.assignment][
+            "NGmerge"
+        ]["min_dovetailed_overlap"],
     log:
         temp("results/logs/assignment/merge.{assignment}.{split}.log"),
     shell:
         """
-        set +o pipefail;
-
-        fwd_length=`zcat {input.R1} | head -2 | tail -1 | wc -c`;
-        fwd_length=$(expr $(($fwd_length-1)));
-
-        rev_start=$(expr $(($fwd_length+1+{params.bc_length})));
-
-
-        paste <( zcat {input.R1} ) <( zcat {input.R2} ) <( zcat {input.R3} ) | \
-        awk '{{ 
-            count+=1; 
-            if ((count == 1) || (count == 3)) {{ 
-                print $1 
-            }} else {{
-                print $1$2$3 
-            }}; 
-            if (count == 4) {{
-                count=0 
-            }} 
-        }}' | \
-        python {input.script_FastQ2doubleIndexBAM} -p -s $rev_start -l {params.bc_length} -m 0 | \
-        python {input.script_MergeTrimReadsBAM} -c '' -f CATTGCGTGAACCGACAATTCGTCGAGGGACCTAATAAC -s AGTTGATCCGGTCCTAGGTCTAGAGCGGGCCCTGGCAGA --mergeoverlap -p > {output} 2> {log}
+        NGmerge \
+        -1 {input.R1} \
+        -2 {input.R3} \
+        -m {params.min_overlap} -p {params.frac_mismatches_allowed} -e {params.min_dovetailed_overlap} \
+        -z \
+        -o  {output.join} \
+        -i -f {output.un} \
+        -l {log}
         """
 
 
@@ -144,7 +133,7 @@ rule assignment_mapping:
     Map the reads to the reference and sort.
     """
     input:
-        bams="results/assignment/{assignment}/bam/merge_split{split}.bam",
+        reads="results/assignment/{assignment}/fastq/merge_split{split}.join.fastq.gz",
         reference="results/assignment/{assignment}/reference/reference.fa",
         bwa_index=expand(
             "results/assignment/{{assignment}}/reference/reference.fa.{ext}",
@@ -160,8 +149,7 @@ rule assignment_mapping:
     shell:
         """
         bwa mem -t {threads} -L 80 -M -C {input.reference} <(
-            samtools view -F 513 {input.bams} | \
-            awk 'BEGIN{{ OFS="\\n"; FS="\\t" }}{{ print "@"$1" "$12","$13,$10,"+",$11 }}';
+            gzip -dc {input.reads}
         )  | samtools sort -l 0 -@ {threads} > {output} 2> {log}
         """
 
diff --git a/workflow/rules/common.smk b/workflow/rules/common.smk
@@ -408,7 +408,7 @@ def withoutZeros(project, conf):
 
 
 def getSplitNumber():
-    split = SPLIT_FILES_NUMBER
+    split = 1
 
     if "global" in config:
         if "assignments" in config["global"]:
diff --git a/workflow/schemas/config.schema.yaml b/workflow/schemas/config.schema.yaml
@@ -68,6 +68,24 @@ properties:
               type: string
             minItems: 1
             uniqueItems: true
+          NGmerge:
+            type: object
+            properties:
+              min_overlap:
+                type: integer
+                default: 20
+              frac_mismatches_allowed:
+                type: number
+                default: 0.1
+              min_dovetailed_overlap:
+                type: integer
+                default: 20
+            required:
+              - min_overlap
+              - frac_mismatches_allowed
+              - min_dovetailed_overlap
+            default: {}
+            additionalProperties: false
           reference:
             type: string
           configs:
@@ -106,6 +124,7 @@ properties:
           - configs
           - alignment_start
           - sequence_length
+          - NGmerge
         additionalProperties: false
     additionalProperties: false
     minProperties: 1
diff --git a/workflow/scripts/attachBCToFastQ.py b/workflow/scripts/attachBCToFastQ.py
@@ -0,0 +1,36 @@
+import click
+from common import read_fastq
+import gzip
+
+
+def read_sequence_files(read_file, bc_file):
+    for read, bc in zip(read_fastq(read_file), read_fastq(bc_file)):
+        seqid_read, seq_read, qual_read = read
+        seqid_read = seqid_read.split(" ")[0]
+        seqid_bc, seq_bc, qual_bc = bc
+        seqid_bc = seqid_bc.split(" ")[0]
+        if seqid_read != seqid_bc:
+            raise Exception('Sequence IDs do not match: %s != %s' % (seqid_read, seqid_bc))
+        seqid = "%s XI:Z:%s,YI:Z:%s" % (seqid_read, seq_bc, qual_bc)
+        yield seqid, seq_read, qual_read
+    return
+
+
+@click.command()
+@click.option('--reads', '-r',
+              "read_file",
+              type=click.Path(exists=True, readable=True),
+              required=True)
+@click.option('--barcodes', '-b',
+              "barcode_file",
+              type=click.Path(exists=True, readable=True),
+              required=True)
+def cli(read_file, barcode_file):
+    
+    with gzip.open(read_file, 'rt') as r_file, gzip.open(barcode_file, 'rt') as bc_file:
+        for seqid, seq, qual in read_sequence_files(r_file, bc_file):
+            click.echo("@%s\n%s\n+\n%s" % (seqid, seq, qual))
+
+
+if __name__ == '__main__':
+    cli()
diff --git a/workflow/scripts/common.py b/workflow/scripts/common.py