Skip to content

Commit d1cb257

Browse files
committed
Adding result parsers and updates throughout
1 parent 363d1eb commit d1cb257

30 files changed

+533
-124
lines changed

README.md

+9-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,12 @@
1-
# Pipeline
1+
# GMO-Check pipeline
2+
3+
This pipeline is being developed to detect GMO content from short-read (amplicon) data. Currently, only the detection of GABA mutations in tomato are supported. As this particular modification is characterized by a specific nucleotide insertion into the SIGAB3 gene, this is what the pipeline will currently try to check for. It does this by two independent approaches - classical variant calling on the one hand and the detection of the insertion in merged and dereplicated amplicon "ZOTUS" against a blast database containing the wild type gene sequence on the other.
4+
5+
The variant calling workflow will first align quality- and adapter trimmed reads against the tomato reference genome (v3.0). It then masks out bases that start and overlap with the curated primer site locations using samtools ampliconclip. No deduplication will be performed to enable the accurate determination of GMO content in mixed samples. Finally, the read alignment is analyzed with Freebayes to determine the presence of any diagnostically relevant variants.
6+
7+
For the amplicon-assembly approach, primer sequences are stripped from the reads using Ptrimmer. The stripped reads are then merged, filtered and reduced to unique "ZOTUS" with Vsearch. The resulting sequences are blasted against a built-in database to check for evidence of the diagnostically relevant insertion. GMO content is determined by extracting the overall raw read count as annotated into the assembled and dereplicated reads for which a positive signal was determined with Blast versus the total number of raw reads represented in the reduced amplicon sequence set.
8+
9+
Using a set of 126 samples, both approaches yielded very similar estimates for % GMO content in given sample (+/- 1%).
210

311
## Documentation
412

assets/genomes/test/amplicon.txt

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
CGAACCCTAGCAGATCGTCT TCAAAACAACCATTAATCCTTCCCT 162 IL.S.tSIGAD3
2+
AAGACAATAGCCTCCACAACG AGTCAGTACAAGACATAATAATACAAAGAG 438 N028_SiGAD3_N-term-seq2
3+
AGGGATATCGAAATGTAATGGAAAATTG CAATTCAATAGAACAAAGGATGATACATTC 510 N029_SiGAD3_N-term-seq1

assets/genomes/test/genome.fa

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
>SiGAD3|NM_001246898.2
2+
ATGGTTCTCTCAAAAACTCCTTCTGATGATTCTGTACACTCCACATTTGCTTCTCGCTATGTTCGAACTT
3+
CACTACCAAGGTTTGAGATGCTAGAGAAGTCTATACCAAAAGAGGCAGCATACCAAATGATTAATGATGA
4+
GTTAATGCTTGATGGGAATCCAAGGTTAAATTTGGCATCATTTGTAACCACATGGATGGAACCAGAATGT
5+
GATAAGCTTATGATGGCTTCAATTAACAAGAATTATGTTGACATGGATGAATACCCTGTCACCACTGAGC
6+
TTCAGAATCGATGTGTAAACATGATAGCGCGTTTATTCAATGCGCCTTTGAAAGAGGAAGAAATAGGAAT
7+
TGGTGTGGGGACAGTGGGGTCATCAGAGGCCATAATGTTAGCAGGGCTGGCCTTCAAGAGGAACTGGCAA
8+
AACAAACGCAAAGCTGAGGGAAAGCCTTATGATAAGCCCAACATTGTCACTGGTGCTAATGTTCAGGTGT
9+
GTTGGGAGAAATTTGCAAACTACTTTGAAGTGGAATTGAAACAAGTCAAGTTAAGTGAAGGGTACTATGT
10+
GATGGACCCAATCAAAGCTGTGGAAATGGTAGATGACAACACTATTTGTGTTGCTGCTATTTTGGGTTCA
11+
ACACTTAATGGAGAATTTGAAGATGTCAAACTCTTGAATGATCTTTTGATTGAAAAGAATAAACAAACTG
12+
GATGGGACACACCTATTCATGTGGATGCAGCAAGTGGTGGATTCATTGCACCATTTATCTATCCAGAGTT
13+
GGAATGGGATTTTAGGCTTCCTTTAGTGAAAAGTATTAATGTGAGTGGACACAAATATGGGCTTGTTTAT
14+
GCTGGTATTGGTTGGGTTATTTGGAGAACTAAACAAGACTTGCCTCAACAACTCATTTTTCATATCAATT
15+
ATCTTGGTGCTGATCAGCCTACTTTTACTCTCAATTTCTCTAAAGGTTCAAGTCAAGTCATTGCTCAATA
16+
TTATCAGCTTATCCGCTTGGGCTATGAGGGATATCGAAATGTAATGGAAAATTGTCGTGAAAATGCAATT
17+
GTGCTAAGAAAAGGACTTGAAAAAACAGGACGTTTCAATATAATCTCCAAAGATGAAGGTATACCCTTGG
18+
TGGCATTTTCCCTCAAAGACAATAGCCTCCACAACGAATTCGAGGTCTCTGAGACCCTCCGTAGGTTTGG
19+
GTGGATTGTCCCAGCCTACACTATGCCAGCTGACCTGCAACATGTTACAGTGTTGCGCGTTGTGATTAGA
20+
GAGGACTTCTCCCGAACCCTAGCAGATCGTCTTGTCTCTGACATCGTCAAGGTCCTCCACGAGCTCCCGA
21+
ATGCCAAAAAAGTGGAGGATAATTTGATGATCAATAATGAGAAGAAAACAGAAATTGAAGTTCAAAGGGC
22+
AATTGCTGAGTTTTGGAAGAAATATGTTTTAGCTAGGAAAGCATCTATTTGTTAGGGAAGGATTAATGGT
23+
TGTTTTGAAGGAATAAGTATTAATTACTAGTAACGTTTTGGTATTAATTATAAAAAATGTG

assets/genomes/test/primers.bed

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
1 14628 14648 N028_SiGAD3_N-term-seq2_LEFT 60 +
2+
1 15086 15115 N030_SiGAD3_N-term-seq2_RIGHT 60 -
3+
1 14499 14526 N027_SiGAD3_N-term-seq1_LEFT 60 +
4+
1 15036 15065 N029_SiGAD3_N-term-seq1_RIGHT 60 -
5+
1 14765 14784 IL.S.t.SIGAD3.5s 60 +
6+
1 14946 14970 IL.S.t.SIGAD3.4as 60 -
7+

assets/genomes/test/rules.json

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
{
2+
"rules": {
3+
"vsearch-blast": {
4+
"payload": [
5+
{
6+
"format": "JSON",
7+
"name": "GABA Mutation in SIGAD3",
8+
"target": "SiGAD3|NM_001246898.2",
9+
"matcher": "AAAG-TGGA",
10+
"positive_report": "Diese Probe enthält eine GABA Mutation in SIGAD3. Nachweis erbraucht über: Amplicon Analyse.",
11+
"negative_report": "Für diese Probe konnte keine GABA Mutation in SIGAD3 nachgewiesen werden."
12+
}
13+
]
14+
15+
},
16+
"bwa-freebayes": {
17+
"payload": [
18+
{
19+
"format": "VCF",
20+
"target": "1:14834",
21+
"name": "GABA Mutation in SIGAD3",
22+
"matcher": "1\t14834\t.\tGTG\tGTTG",
23+
"positive_report": "Diese Probe enthält eine GABA Mutation in SIGAD3. Nachweis erbracht über: Varianten Analyse.",
24+
"negative_report": "Für diese Probe konnte keine GABA Mutation in SIGAD3 nachgewiesen werden."
25+
}
26+
]
27+
}
28+
}
29+
}

assets/genomes/test/targets.bed

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
1 14784 14946 SIGAD3

assets/genomes/tomato/rules.json

+7-3
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,12 @@
33
"vsearch-blast": {
44
"payload": [
55
{
6-
"format": "XML",
6+
"format": "JSON",
7+
"name": "GABA Mutation in SIGAD3",
78
"target": "SiGAD3|NM_001246898.2",
89
"matcher": "AAAG-TGGA",
9-
"yields": "Diese Probe enthält eine GABA Mutation in SIGAD3. Nachweis erbraucht über: Amplicon Analyse."
10+
"positive_report": "Diese Probe enthält eine GABA Mutation in SIGAD3. Nachweis erbraucht über: Amplicon Analyse.",
11+
"negative_report": "Für diese Probe konnte keine GABA Mutation in SIGAD3 nachgewiesen werden."
1012
}
1113
]
1214

@@ -16,8 +18,10 @@
1618
{
1719
"format": "VCF",
1820
"target": "1:14834",
21+
"name": "GABA Mutation in SIGAD3",
1922
"matcher": "1\t14834\t.\tGTG\tGTTG",
20-
"yields": "Diese Probe enthält eine GABA Mutation in SIGAD3. Nachweis erbracht über: Varianten Analyse."
23+
"positive_report": "Diese Probe enthält eine GABA Mutation in SIGAD3. Nachweis erbracht über: Varianten Analyse.",
24+
"negative_report": "Für diese Probe konnte keine GABA Mutation in SIGAD3 nachgewiesen werden."
2125
}
2226
]
2327
}

bin/analyze_blast.rb

100644100755
+69-4
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
require 'optparse'
44
require 'ostruct'
5-
require 'nokogiri'
5+
require 'json'
66

77
### Define modules and classes here
88

@@ -11,18 +11,83 @@
1111
opts = OptionParser.new()
1212
opts.banner = "Reads Fastq files from a folder and writes a sample sheet to STDOUT"
1313
opts.separator ""
14-
opts.on("-b","--blast", "=BLAST","Blast report to read") {|argument| options.vcf = argument }
14+
opts.on("-b","--blast", "=BLAST","Blast report to read") {|argument| options.blast = argument }
1515
opts.on("-j","--json", "=JSON","JSON to read") {|argument| options.json = argument }
16+
opts.on("-s","--sample", "=SAMPLE","Sample name") {|argument| options.sample = argument }
1617
opts.on("-h","--help","Display the usage information") {
1718
puts opts
1819
exit
1920
}
2021

2122
opts.parse!
2223

24+
min_coverage = 100
25+
26+
output = { "sample" => options.sample, "matches" => [] }
27+
2328
date = Time.now.strftime("%Y-%m-%d")
2429

25-
file = File.open(options.blast)
30+
json = JSON.parse(IO.readlines(options.json).join)
31+
32+
rules = json["rules"]["vsearch-blast"]["payload"]
33+
34+
blast = JSON.parse(IO.readlines(options.blast).join)
35+
36+
reports = blast["BlastOutput2"]
37+
38+
findings = []
39+
total_cov = 0
40+
carrier_cov = 0
41+
42+
rules.each do |rule|
43+
44+
total_cov = 0
45+
query_cov = 0
46+
47+
rule_name = rule["name"]
48+
rule_string = rule["matcher"]
49+
50+
has_matched = false
51+
reports.each do |r|
52+
53+
report = r["report"]
54+
results = report["results"]["search"]
55+
56+
query_string = results["query_title"]
57+
58+
query,coverage = query_string.split(";")
59+
coverage = coverage.gsub("size=", "").to_i
60+
61+
total_cov += coverage
62+
63+
hits = results["hits"]
64+
65+
hits.each do |hit|
66+
67+
target = hit["description"][0]["title"]
68+
69+
hsps = hit["hsps"]
70+
71+
hsps.each do |hsp|
72+
target_seq = hsp["hseq"]
73+
74+
if target_seq.include?(rule_string)
75+
has_matched = true
76+
carrier_cov += coverage
77+
end
78+
79+
end
80+
end
81+
82+
end
83+
84+
if has_matched
85+
perc = (carrier_cov.to_f / total_cov.to_f) * 100
86+
output["matches"] << { "rule" => rule_name , "Befund" => rule["positive_report"], "Anteil Variante %" => perc.round(2), "Abdeckung Referenzallel" => total_cov-carrier_cov, "Abdeckung Variantenallel" => carrier_cov }
87+
else
88+
output["matches"] << { "rule" => rule_name , "Befund" => rule["negative_report"] }
89+
end
2690

27-
xml = Nokogiri::XML(file)
91+
end
2892

93+
puts output.to_json

bin/analyze_vcf.rb

100644100755
+37-21
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ def parse_vcf(file)
7272
opts.separator ""
7373
opts.on("-v","--vcf", "=VCF","VCF to read") {|argument| options.vcf = argument }
7474
opts.on("-j","--json", "=JSON","JSON to read") {|argument| options.json = argument }
75+
opts.on("-s","--sample", "=SAMPLE","Sample name") {|argument| options.sample = argument }
7576
opts.on("-h","--help","Display the usage information") {
7677
puts opts
7778
exit
@@ -81,43 +82,58 @@ def parse_vcf(file)
8182

8283
date = Time.now.strftime("%Y-%m-%d")
8384

85+
result = { "sample" => options.sample, "matches" => [] }
86+
8487
json = JSON.parse(IO.readlines(options.json).join)
8588

8689
rules = json["rules"]["bwa-freebayes"]["payload"]
8790

8891
vcf = parse_vcf(options.vcf)
8992

90-
vcf.each do |entry|
91-
92-
allele = entry.allele_string
93+
rules.each do |rule|
94+
95+
this_match = {}
9396

94-
sample_name = entry.sample_names[0]
95-
puts ">>>" + sample_name + "<<<"
97+
rule_name = rule["name"]
98+
rule_string = rule["matcher"]
9699

97100
has_matched = false
98101

99-
rules.each do |rule|
100-
string = rule["matcher"]
101-
if string == allele
102+
vcf.each do |entry|
103+
104+
allele_string = entry.allele_string
105+
this_sample = entry.samples[0]
106+
107+
# A match, presumably
108+
if rule_string == allele_string
109+
102110
has_matched = true
103-
104-
puts rule["yields"]
105-
sample = entry.samples[0]
106-
genotype = sample["GT"]
111+
112+
this_match["rule"] = rule_name
113+
this_match["Befund"] = rule["positive_report"]
114+
115+
genotype = this_sample["GT"]
116+
107117
if genotype == "0/0"
108-
puts "Varianten Frequenz unter Detektierungsschwelle!"
118+
this_match["Anmerkung"] = "Variantenfrequenz unter Call-Schwelle!"
109119
end
110-
rcov,acov = sample["AD"].split(",")
111-
perc = (acov.to_f / rcov.to_f)*100.0
112-
puts "\tGenotyp: #{sample["GT"]}\tAnteil: #{perc.round(2)}%\tRef: #{rcov}\tAlt: #{acov}\t"
120+
121+
rcov,acov = this_sample["AD"].split(",")
122+
cov_sum = acov.to_i + rcov.to_i
123+
perc = (acov.to_f / cov_sum.to_f)*100.0
124+
this_match["Anteil Variante %"] = perc.round(2)
125+
this_match["Abdeckung Referenzallel"] = rcov
126+
this_match["Abdeckung Variantenallel"] = acov
127+
128+
result["matches"] << this_match
113129

114130
end
115131
end
116-
117-
if !has_matched
118-
puts "Keine GABA Mutation nachgewiesen!"
132+
133+
unless has_matched
134+
result["matches"] << { "rule" => rule_name, "Befund" => rule["negative_report"]}
119135
end
120136

121-
puts "==============================================================================="
137+
end
122138

123-
end
139+
puts result.to_json

bin/reports_to_table.rb

+70
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
#!/bin/env ruby
2+
3+
require 'optparse'
4+
require 'ostruct'
5+
require 'json'
6+
require 'csv'
7+
8+
### Define modules and classes here
9+
10+
def parse_json(filename)
11+
12+
return JSON.parse(IO.readlines(filename).join)
13+
14+
end
15+
16+
### Get the script arguments and open relevant files
17+
options = OpenStruct.new()
18+
opts = OptionParser.new()
19+
opts.banner = "Reads reports and makes a table"
20+
opts.separator ""
21+
opts.on("-h","--help","Display the usage information") {
22+
puts opts
23+
exit
24+
}
25+
26+
opts.parse!
27+
28+
files = Dir["*.json"]
29+
30+
rows = []
31+
32+
header = [ "Sample", "Blast", "Freebayes" ]
33+
34+
rows << header
35+
36+
files.group_by{|f| f.split(".")[0..-3].join}.each do |group,reports|
37+
38+
blast = reports.find {|r| r.include?("blast.json")}
39+
freebaytes = reports.find { |r| r.include?("freebayes.json")}
40+
41+
this_data = []
42+
43+
sample = group
44+
this_data << sample
45+
46+
if blast
47+
json = parse_json(blast)
48+
matches = json["matches"]
49+
this_data << matches[0]["Befund"]
50+
else
51+
this_data << ""
52+
end
53+
54+
if freebayes
55+
json = parse_json(freebayes)
56+
matches = json["matches"]
57+
this_data << matches[0]["Befund"]
58+
else
59+
this_data << ""
60+
end
61+
62+
rows << this_data
63+
64+
end
65+
66+
File.write("summary_mqc.csv", rows.map(&:to_csv).join)
67+
68+
69+
70+

0 commit comments

Comments
 (0)