-
Notifications
You must be signed in to change notification settings - Fork 265
Description
Hi,
I am encountering unexpected behaviour when using bcftools annotate with an additional INFO field as a matching key.
Background
I decomposed and normalised a multi-allelic VCF using bcftools norm with the –old-rec-tag option, specifying the tag name SOURCE_RECORD. After normalisation, some variants become identical in terms of CHROM, POS, REF and ALT. However, they remain distinguishable by SOURCE_RECORD because they originate from different sources.
I then generated a TSV file of per-variant metrics from the normalised VCF. I want to annotate the original VCF with these metrics using bcftools annotate. Because there are duplicate CHROM, POS, REF, ALT entries after normalisation, I followed the approach suggested in issue #2151, where an additional INFO field can be used as a matching key to disambiguate records.
Expected behaviour
When including SOURCE_RECORD as an additional key, I expect bcftools annotate to:
- Match records using
CHROM,POS,REF,ALTandSOURCE_RECORD - Treat
SOURCE_RECORDas a literal string - Only annotate records where the full
SOURCE_RECORDvalue matches exactly
Observed behaviour
SOURCE_RECORD values have the format CHROM|POS|REF|ALT|USED_ALT_INDEX. When this field this is used as an additional key, bcftools annotate does not appear to treat it as a strict literal string. Instead, it behaves as though the pipe characters are interpreted as OR separators. As a result:
- Records are matched if any substring between pipes matches
- The first duplicate variant receives the correct annotations
- Subsequent duplicates (same
CHROM,POS,REF,ALTbut differentSOURCE_RECORD) incorrectly receive the annotations from the first occurrence, leading to incorrect assignment of metrics.
Example VCF:
#CHROM POS ID REF ALT QUAL FILTER INFO
chr5 123456 . C CT 35.2 . SOURCE_RECORD=chr5|123456|C|A,CT,G|2
chr5 123456 . C CT 21.6 . SOURCE_RECORD=chr5|123457|TA|AA,CA,TGA|3
Example TSV:
CHR POS REF ALT VARIANT SOURCE_RECORD METRIC1 METRIC2
chr1 123660895 C CT chr1:123660895-C/CT chr5|123456|C|A,CT,G|2 1.0 700
chr1 123660895 C CT chr1:123660895-C/CT chr5|123457|TA|AA,CA,TGA|3 0.0 70
Annotation command:
bcftools annotate ${vcf} \
--annotations ${tsv} \
--columns CHROM,POS,REF,ALT,-,SOURCE_RECORD,METRIC1,METRIC2 `
--include 'SOURCE_RECORD={SOURCE_RECORD}' \
--keep-sites \
--header-lines ${header}
Resulting VCF:
#CHROM POS ID REF ALT QUAL FILTER INFO
chr5 123456 . C CT 35.2 . SOURCE_RECORD=chr5|123456|C|A,CT,G|2;METRIC1=1.0;METRIC2=700
chr5 123456 . C CT 21.6 . SOURCE_RECORD=chr5|123457|TA|AA,CA,TGA|3;METRIC1=1.0,METRIC2=700
Note the second variant is assigned the wrong metrics.
If I remove the pipe characters in SOURCE_RECORD, the annotation behaves as expected and matching is correct, suggesting the issue is specifically related to how bcftools annotate interprets pipe characters in INFO fields used as matching keys.
Questions
- Is there a way to force
bcftools annotateto treat theINFOfield as a strict literal string when used as a key? - Alternatively (or additionally), would it be possible to make the format of
–old-rec-tagcustomisable (for example, allowing the delimiter to be specified by the user)?
Thank you for your time.
Lisa