Skip to content

Commit b8c5666

Browse files
authored
Merge pull request #81 from visze/feature/NGmerge
feat: NGmerge
2 parents 1e262bb + be307da commit b8c5666

File tree

10 files changed

+387
-78
lines changed

10 files changed

+387
-78
lines changed

Diff for: config/sbatch.yml

+3-6
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,11 @@ __default__:
99
##################
1010
### ASSIGNMENT ###
1111
##################
12-
assignment_getInputs:
13-
time: "0-10:00"
14-
queue: medium
1512
assignment_merge:
1613
time: "0-08:00"
1714
queue: medium
1815
assignment_fastq_split:
19-
time: "0-02:00"
16+
time: "0-04:00"
2017
threads: 1
2118
mem: 10G
2219
queue: medium
@@ -31,9 +28,9 @@ assignment_collect:
3128
mem: 10G
3229
queue: medium
3330
assignment_getBCs:
34-
time: "0-04:00"
31+
time: "1-08:00"
3532
threads: 1
36-
queue: short
33+
queue: medium
3734
assignment_statistic_totalCounts:
3835
time: "0-01:00"
3936
threads: 1

Diff for: docs/assignment_example1.rst

+2-5
Original file line numberDiff line numberDiff line change
@@ -123,14 +123,13 @@ You should see a list of rules that will be executed. This is the summary:
123123
assignment_filter 2 1 1
124124
assignment_flagstat 1 1 1
125125
assignment_getBCs 1 1 1
126-
assignment_getInputs 3 1 1
127126
assignment_idx_bam 1 1 1
128127
assignment_mapping 1 1 1
129128
assignment_merge 30 10 10
130129
assignment_statistic_assignedCounts 2 1 1
131130
assignment_statistic_assignment 2 1 1
132131
assignment_statistic_totalCounts 1 1 1
133-
total 49 1 1
132+
total 46 1 1
134133
135134
136135
When dry-drun does not give any errors we will run the workflow. We use a machine with 30 threads/cores to run the workflow. Therefore :code:`split_number` is set to 30 to parallize the workflow. Also we are using 10 threads for mapping (bwa mem). But snakemake takes care that no more than 30 threads are used.
@@ -142,16 +141,14 @@ When dry-drun does not give any errors we will run the workflow. We use a machin
142141
143142
.. note:: Please modify your code when running in a cluster environment. We have an example SLURM config file here :code:`config/sbatch.yml`.
144143

145-
If everything works fine the 13 rules showed above will run:
144+
If everything works fine the 12 rules showed above will run:
146145

147146
all
148147
The overall all rule. Here is defined what final output files are expected.
149148
assignment_bwa_ref
150149
Create mapping reference for BWA from design file.
151150
assignment_fastq_split
152151
Split the fastq files into n files for parallelisation. N is given by split_read in the configuration file.
153-
assignment_getInputs
154-
Concat the input fastq files per R1,R2,R3. If only single fastq file is provided a symbolic link is created.
155152
assignment_merge
156153
Merge the FW,REV and BC fastq files into one. Extract the index sequence from the middle and end of an Illumina run. Separates reads for Paired End runs. Merge/Adapter trim reads stored in BAM.
157154
assignment_mapping

Diff for: docs/combined_example1.rst

+2-3
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,6 @@ You should see a list of rules that will be executed. This is the summary:
137137
assignment_filter 1 1 1
138138
assignment_flagstat 1 1 1
139139
assignment_getBCs 1 1 1
140-
assignment_getInputs 3 1 1
141140
assignment_idx_bam 1 1 1
142141
assignment_mapping 1 10 10
143142
assignment_merge 30 1 1
@@ -168,7 +167,7 @@ You should see a list of rules that will be executed. This is the summary:
168167
statistic_counts_frequent_umis 6 1 1
169168
statistic_counts_stats_merge 2 1 1
170169
statistic_counts_table 12 1 1
171-
total 139 1 10
170+
total 136 1 10
172171
173172
174173
When dry-drun does not give any errors we will run the workflow. We use a machine with 30 threads/cores to run the workflow. Therefore :code:`split_number` is set to 30 to parallize the workflow. Also we are using 10 threads for mapping (bwa mem). But snakemake takes care that no more than 30 threads are used.
@@ -180,7 +179,7 @@ When dry-drun does not give any errors we will run the workflow. We use a machin
180179
181180
.. note:: Please modify your code when running in a cluster environment. We have an example SLURM config file here :code:`config/sbatch.yml`.
182181

183-
If everything works fine the 41 rules showed above will run. Please goto the :ref:`Assignment example`_ and the :ref:`Count example`_
182+
If everything works fine the 40 rules showed above will run. Please goto the :ref:`Assignment example`_ and the :ref:`Count example`_
184183

185184
Results
186185
-----------------

Diff for: workflow/envs/NGmerge.yaml

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
channels:
3+
- conda-forge
4+
- bioconda
5+
- defaults
6+
dependencies:
7+
- ngmerge=0.3
8+
- python
9+
- click
10+
- htslib

Diff for: workflow/envs/fastq_join.yaml

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
channels:
3+
- conda-forge
4+
- bioconda
5+
- defaults
6+
dependencies:
7+
- fastq-join=1.3.1
8+
- python
9+
- click
10+
11+
- htslib

Diff for: workflow/rules/assignment.smk

+51-63
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,13 @@
1-
SPLIT_FILES_NUMBER = 1
2-
3-
41
include: "assignment/statistic.smk"
52

63

7-
rule assignment_getInputs:
8-
"""
9-
Concat the input fastq files per R1,R2,R3.
10-
If only single fastq file is provided a symbolic link is created.
11-
"""
12-
conda:
13-
"../envs/default.yaml"
14-
input:
15-
lambda wc: config["assignments"][wc.assignment][wc.R],
16-
output:
17-
R1=temp("results/assignment/{assignment}/fastq/{R}.fastq.gz"),
18-
log:
19-
temp("results/logs/assignment/getInputs.{assignment}.{R}.log"),
20-
shell:
21-
"""
22-
if [[ "$(ls {input} | wc -l)" -eq 1 ]]; then
23-
ln -rs {input} {output};
24-
else
25-
zcat {input} | gzip -c > {output};
26-
fi &> {log}
27-
"""
28-
29-
304
rule assignment_fastq_split:
315
"""
326
Split the fastq files into n files for parallelisation.
337
n is given by split_read in the configuration file.
348
"""
359
input:
36-
"results/assignment/{assignment}/fastq/{R}.fastq.gz",
10+
lambda wc: config["assignments"][wc.assignment][wc.R],
3711
output:
3812
temp(
3913
expand(
@@ -59,54 +33,69 @@ rule assignment_fastq_split:
5933
),
6034
shell:
6135
"""
62-
fastqsplitter -i {input} -t 1 {params.files} &> {log}
36+
fastqsplitter -i <(zcat {input}) -t 1 {params.files} &> {log}
37+
"""
38+
39+
40+
rule assignment_attach_idx:
41+
"""
42+
Extract the index sequence and add it to the header.
43+
"""
44+
conda:
45+
"../envs/NGmerge.yaml"
46+
input:
47+
read="results/assignment/{assignment}/fastq/splits/R{R}.split{split}.fastq.gz",
48+
BC="results/assignment/{assignment}/fastq/splits/R2.split{split}.fastq.gz",
49+
script=getScript("attachBCToFastQ.py"),
50+
output:
51+
read=temp(
52+
"results/assignment/{assignment}/fastq/splits/R{R}.split{split}.BCattached.fastq.gz"
53+
),
54+
log:
55+
temp("results/logs/assignment/attach_idx.{assignment}.{split}.{R}.log"),
56+
shell:
57+
"""
58+
python {input.script} -r {input.read} -b {input.BC} | bgzip -c > {output.read}
6359
"""
6460

6561

6662
rule assignment_merge:
6763
"""
6864
Merge the FW,REV and BC fastq files into one.
69-
Extract the index sequence from the middle and end of an Illumina run.
70-
Separates reads for Paired End runs. Merge/Adapter trim reads stored in BAM.
65+
Extract the index sequence and add it to the header.
7166
"""
7267
conda:
73-
"../envs/python27.yaml"
68+
"../envs/NGmerge.yaml"
7469
input:
75-
R1="results/assignment/{assignment}/fastq/splits/R1.split{split}.fastq.gz",
76-
R2="results/assignment/{assignment}/fastq/splits/R2.split{split}.fastq.gz",
77-
R3="results/assignment/{assignment}/fastq/splits/R3.split{split}.fastq.gz",
78-
script_FastQ2doubleIndexBAM=getScript("count/FastQ2doubleIndexBAM.py"),
79-
script_MergeTrimReadsBAM=getScript("count/MergeTrimReadsBAM.py"),
70+
R1="results/assignment/{assignment}/fastq/splits/R1.split{split}.BCattached.fastq.gz",
71+
R3="results/assignment/{assignment}/fastq/splits/R3.split{split}.BCattached.fastq.gz",
8072
output:
81-
bam=temp("results/assignment/{assignment}/bam/merge_split{split}.bam"),
73+
un=temp("results/assignment/{assignment}/fastq/merge_split{split}.un.fastq.gz"),
74+
join=temp(
75+
"results/assignment/{assignment}/fastq/merge_split{split}.join.fastq.gz"
76+
),
8277
params:
83-
bc_length=lambda wc: config["assignments"][wc.assignment]["bc_length"],
78+
min_overlap=lambda wc: config["assignments"][wc.assignment]["NGmerge"][
79+
"min_overlap"
80+
],
81+
frac_mismatches_allowed=lambda wc: config["assignments"][wc.assignment][
82+
"NGmerge"
83+
]["frac_mismatches_allowed"],
84+
min_dovetailed_overlap=lambda wc: config["assignments"][wc.assignment][
85+
"NGmerge"
86+
]["min_dovetailed_overlap"],
8487
log:
8588
temp("results/logs/assignment/merge.{assignment}.{split}.log"),
8689
shell:
8790
"""
88-
set +o pipefail;
89-
90-
fwd_length=`zcat {input.R1} | head -2 | tail -1 | wc -c`;
91-
fwd_length=$(expr $(($fwd_length-1)));
92-
93-
rev_start=$(expr $(($fwd_length+1+{params.bc_length})));
94-
95-
96-
paste <( zcat {input.R1} ) <( zcat {input.R2} ) <( zcat {input.R3} ) | \
97-
awk '{{
98-
count+=1;
99-
if ((count == 1) || (count == 3)) {{
100-
print $1
101-
}} else {{
102-
print $1$2$3
103-
}};
104-
if (count == 4) {{
105-
count=0
106-
}}
107-
}}' | \
108-
python {input.script_FastQ2doubleIndexBAM} -p -s $rev_start -l {params.bc_length} -m 0 | \
109-
python {input.script_MergeTrimReadsBAM} -c '' -f CATTGCGTGAACCGACAATTCGTCGAGGGACCTAATAAC -s AGTTGATCCGGTCCTAGGTCTAGAGCGGGCCCTGGCAGA --mergeoverlap -p > {output} 2> {log}
91+
NGmerge \
92+
-1 {input.R1} \
93+
-2 {input.R3} \
94+
-m {params.min_overlap} -p {params.frac_mismatches_allowed} -e {params.min_dovetailed_overlap} \
95+
-z \
96+
-o {output.join} \
97+
-i -f {output.un} \
98+
-l {log}
11099
"""
111100

112101

@@ -144,7 +133,7 @@ rule assignment_mapping:
144133
Map the reads to the reference and sort.
145134
"""
146135
input:
147-
bams="results/assignment/{assignment}/bam/merge_split{split}.bam",
136+
reads="results/assignment/{assignment}/fastq/merge_split{split}.join.fastq.gz",
148137
reference="results/assignment/{assignment}/reference/reference.fa",
149138
bwa_index=expand(
150139
"results/assignment/{{assignment}}/reference/reference.fa.{ext}",
@@ -160,8 +149,7 @@ rule assignment_mapping:
160149
shell:
161150
"""
162151
bwa mem -t {threads} -L 80 -M -C {input.reference} <(
163-
samtools view -F 513 {input.bams} | \
164-
awk 'BEGIN{{ OFS="\\n"; FS="\\t" }}{{ print "@"$1" "$12","$13,$10,"+",$11 }}';
152+
gzip -dc {input.reads}
165153
) | samtools sort -l 0 -@ {threads} > {output} 2> {log}
166154
"""
167155

Diff for: workflow/rules/common.smk

+1-1
Original file line numberDiff line numberDiff line change
@@ -408,7 +408,7 @@ def withoutZeros(project, conf):
408408

409409

410410
def getSplitNumber():
411-
split = SPLIT_FILES_NUMBER
411+
split = 1
412412

413413
if "global" in config:
414414
if "assignments" in config["global"]:

Diff for: workflow/schemas/config.schema.yaml

+19
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,24 @@ properties:
6868
type: string
6969
minItems: 1
7070
uniqueItems: true
71+
NGmerge:
72+
type: object
73+
properties:
74+
min_overlap:
75+
type: integer
76+
default: 20
77+
frac_mismatches_allowed:
78+
type: number
79+
default: 0.1
80+
min_dovetailed_overlap:
81+
type: integer
82+
default: 20
83+
required:
84+
- min_overlap
85+
- frac_mismatches_allowed
86+
- min_dovetailed_overlap
87+
default: {}
88+
additionalProperties: false
7189
reference:
7290
type: string
7391
configs:
@@ -106,6 +124,7 @@ properties:
106124
- configs
107125
- alignment_start
108126
- sequence_length
127+
- NGmerge
109128
additionalProperties: false
110129
additionalProperties: false
111130
minProperties: 1

Diff for: workflow/scripts/attachBCToFastQ.py

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
import click
2+
from common import read_fastq
3+
import gzip
4+
5+
6+
def read_sequence_files(read_file, bc_file):
7+
for read, bc in zip(read_fastq(read_file), read_fastq(bc_file)):
8+
seqid_read, seq_read, qual_read = read
9+
seqid_read = seqid_read.split(" ")[0]
10+
seqid_bc, seq_bc, qual_bc = bc
11+
seqid_bc = seqid_bc.split(" ")[0]
12+
if seqid_read != seqid_bc:
13+
raise Exception('Sequence IDs do not match: %s != %s' % (seqid_read, seqid_bc))
14+
seqid = "%s XI:Z:%s,YI:Z:%s" % (seqid_read, seq_bc, qual_bc)
15+
yield seqid, seq_read, qual_read
16+
return
17+
18+
19+
@click.command()
20+
@click.option('--reads', '-r',
21+
"read_file",
22+
type=click.Path(exists=True, readable=True),
23+
required=True)
24+
@click.option('--barcodes', '-b',
25+
"barcode_file",
26+
type=click.Path(exists=True, readable=True),
27+
required=True)
28+
def cli(read_file, barcode_file):
29+
30+
with gzip.open(read_file, 'rt') as r_file, gzip.open(barcode_file, 'rt') as bc_file:
31+
for seqid, seq, qual in read_sequence_files(r_file, bc_file):
32+
click.echo("@%s\n%s\n+\n%s" % (seqid, seq, qual))
33+
34+
35+
if __name__ == '__main__':
36+
cli()

0 commit comments

Comments
 (0)