Skip to content

Bcftools annotate does not treat INFO field containing pipes (|) as literal string when used as additional matching key #2506

@lisam-02

Description

@lisam-02

Hi,
I am encountering unexpected behaviour when using bcftools annotate with an additional INFO field as a matching key.

Background

I decomposed and normalised a multi-allelic VCF using bcftools norm with the –old-rec-tag option, specifying the tag name SOURCE_RECORD. After normalisation, some variants become identical in terms of CHROM, POS, REF and ALT. However, they remain distinguishable by SOURCE_RECORD because they originate from different sources.

I then generated a TSV file of per-variant metrics from the normalised VCF. I want to annotate the original VCF with these metrics using bcftools annotate. Because there are duplicate CHROM, POS, REF, ALT entries after normalisation, I followed the approach suggested in issue #2151, where an additional INFO field can be used as a matching key to disambiguate records.

Expected behaviour

When including SOURCE_RECORD as an additional key, I expect bcftools annotate to:

  • Match records using CHROM, POS, REF, ALT and SOURCE_RECORD
  • Treat SOURCE_RECORD as a literal string
  • Only annotate records where the full SOURCE_RECORD value matches exactly

Observed behaviour

SOURCE_RECORD values have the format CHROM|POS|REF|ALT|USED_ALT_INDEX. When this field this is used as an additional key, bcftools annotate does not appear to treat it as a strict literal string. Instead, it behaves as though the pipe characters are interpreted as OR separators. As a result:

  • Records are matched if any substring between pipes matches
  • The first duplicate variant receives the correct annotations
  • Subsequent duplicates (same CHROM, POS, REF, ALT but different SOURCE_RECORD) incorrectly receive the annotations from the first occurrence, leading to incorrect assignment of metrics.

Example VCF:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr5    123456       .       C       CT      35.2    .       SOURCE_RECORD=chr5|123456|C|A,CT,G|2
chr5    123456       .       C       CT      21.6    .       SOURCE_RECORD=chr5|123457|TA|AA,CA,TGA|3

Example TSV:

CHR     POS     REF     ALT     VARIANT SOURCE_RECORD   METRIC1        METRIC2
chr1    123660895       C       CT      chr1:123660895-C/CT     chr5|123456|C|A,CT,G|2      1.0     700
chr1    123660895       C       CT      chr1:123660895-C/CT     chr5|123457|TA|AA,CA,TGA|3  0.0     70

Annotation command:

bcftools annotate ${vcf} \
    --annotations ${tsv} \
    --columns CHROM,POS,REF,ALT,-,SOURCE_RECORD,METRIC1,METRIC2 `
    --include 'SOURCE_RECORD={SOURCE_RECORD}' \
    --keep-sites \
    --header-lines ${header}

Resulting VCF:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr5    123456       .       C       CT      35.2    .       SOURCE_RECORD=chr5|123456|C|A,CT,G|2;METRIC1=1.0;METRIC2=700
chr5    123456       .       C       CT      21.6    .       SOURCE_RECORD=chr5|123457|TA|AA,CA,TGA|3;METRIC1=1.0,METRIC2=700

Note the second variant is assigned the wrong metrics.

If I remove the pipe characters in SOURCE_RECORD, the annotation behaves as expected and matching is correct, suggesting the issue is specifically related to how bcftools annotate interprets pipe characters in INFO fields used as matching keys.

Questions

  • Is there a way to force bcftools annotate to treat the INFO field as a strict literal string when used as a key?
  • Alternatively (or additionally), would it be possible to make the format of –old-rec-tagcustomisable (for example, allowing the delimiter to be specified by the user)?

Thank you for your time.

Lisa

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions