Bug genotyping low coverage loci in genotyper_cluster.rs #58

VCMason · 2025-02-20T15:35:56Z

Here is a locus with read depth of 4. The locus has an unbalanced distribution of reads: three from parent1 and one from parent2. The one read from parent2 is exactly like the reference, while the 3 others from parent1 are exactly the same, but different from the reference in length and sequence.

The true genotype should be 0/1 (reference/non-referenceTRUE), but the cluster genotyper(v1.4.1) gives 1/2 (non-referenceTRUE/non-referenceFAKE). What ended up happening is the genotyper put the reference sequence in the same group as one of the very different sequences (which has the 158 base deletion and SNPs), then called a consensus. This created a fake allele that doesn't resemble either parent (non-referenceFAKE). The fake allele included the deletions and some (but not all SNPs) so the fake read looked nothing like the reference sequence, but also not the same as the 3 other reads from parent1.

I looked into the source code (of v1.5.1). I found in the cluster function of the genotyper_cluster.rs script that clusters of 1 read are never allowed to be considered truthful, and when a cluster of 1 read does exist, the code treats the locus like a homozygous locus (by never allowing a cutoff value to be defined > 0.0). This results into the 4 reads being sorted into two groups evenly ('even' reads to one group, and 'odd' reads to the other group). By not allowing these loci to act as heterozygotes and therefore not trusting the one sequence, this causes hybrid alleles to be produced by making a consensus of the one 'different' read in a cluster by itself (in this case the reference read) and the read with the deletion and SNPs. Note that this results in false read depth support values in the final VCF which reported 2,2 when it should have been 1/3. Also note that i can genotype this locus with the size genotyper and it gets the answer correct because it can recognize the difference in size between the 3 reads and the 1 different one.

I changed the code in the cluster function
from (v1.5.1):
if min_size >= min_cluster_size {
to:
if min_size >= min_cluster_size || (num_seqs < 6 && min_size > 0) {

I recompiled and re-genotyped this same locus, I got the correct genotype of 0/1 and the correct read depth values of 1,3.

This change just allows low coverage loci < 6 to have a cutoff defined and therefore be considered heterozygous when unbalanced parental reads are present (this allows a cluster to be supported by 1 read (ex: 1 & 3, 1 & 4)). 6 is certainly not the correct value to be used, but it would at least avoid the worst cases where in read depths of 4 and 5 with unbalanced parental reads causes a consensus sequence to be created from two reads that can be quite different...

Why is 1 read trusted by the size genotyper while 1 read not allowed to represent a cluster in the cluster genotyper? I understand the logic of iterating backwards through the nodes step by step and saying that the 1 read being different from all others could be generated from technical artifacts like sequencing error or poor alignment... but in low depth situations 1 read truthfully representing a parent will be common and needs to be accounted for.

Unfortunately at low coverage the probability of having 1 read represent the allele of a parent is very high (depth 4 = prob 0.25, 5, . Which is why high depth is good... i know but i don't have high depth across all samples.

A couple solutions could be:

to flag the 'very different' read as untrustworthy and remove it from the consensus calling of each cluster in homozygous or heterozygous situtations. This avoids making a hybrid allele and being 'risky' and trusting 1 sequence to be true.
just put a . for the allele if you don't trust the consensus product. Or have a metadata flagging the untrustworthy allele
find a value that allows for single reads to represent a cluster in 'low covarege' loci and just trust the sequence (like mentioned above)
- maybe 15?, 20?, 25? The probabilities to have 1 read representing a cluster and be true ~ 1 in 2184, 1 in 52,429, 1 in 1,342,177 respectively
provide an argument that the user could set to what they consider 'low coverage' (replace 6 with this variable)
if multiple samples are being genotyped, the other high depth samples could be checked and see if the sequence of the 1 read cluster is present in the population

Note: i do not want 1 sequence to represent a cluster at 'high' depths because the probability that 34 reads sequenced from one parent and only 1 read was sequenced from the other is extremely low...

I haven't read through all the code, maybe you have better ideas on how to solve this?

The text was updated successfully, but these errors were encountered:

pbsena · 2025-02-21T18:09:45Z

Hello,

The cluster genotyper was designed to predominantly be used for high-depth targeted assays, so we set a minimum number of reads an allele needs to have to identify a genotype. We will adjust the cut-offs for low coverages in the next iteration to accommodate these cases.

May I ask why use the cluster genotyper in this sample as opposed to size?

VCMason · 2025-02-24T13:48:52Z

Hello,

Great! I'm looking forward to the changes.

I am looking for sequence/motif based changes in addition to length based changes in VNTRs in a population. The cluster genotyper seems like a good choice to sort reads by size and sequence, although slower.

Is it better to rely on the size genotyper when the length of sequence is different? In the paper, there was a bit of a problem with 'off-by-one' genotypes in regards to mendelian accuracy, any idea if the cluster genotyper handles these situations better/worse/the same?

I found cases like the example above by comparing the resulting genotypes between size and cluster.

Another note about the example above. I had another sample with slightly higher depth with the same parental alleles (as the example) and the cluster genotyper (v1.4.1) genotyped the locus correctly (like the size genotyper). So it really seems to be just the depth issue.

egor-dolzhenko · 2025-02-25T00:02:22Z

It sounds like you are working on a very interesting project! May I ask which VNTR definitions you are using? And how are you
planning to characterize sequence-level changes in VNTR regions? We are working on a new method (that will run on the output of TRGT) that might be useful for this type of analysis. If you'd like to try it out, please reach out to us by email.

Best wishes,
Egor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug genotyping low coverage loci in genotyper_cluster.rs #58

Bug genotyping low coverage loci in genotyper_cluster.rs #58

VCMason commented Feb 20, 2025 •

edited

Loading

pbsena commented Feb 21, 2025

VCMason commented Feb 24, 2025

egor-dolzhenko commented Feb 25, 2025

Bug genotyping low coverage loci in genotyper_cluster.rs #58

Bug genotyping low coverage loci in genotyper_cluster.rs #58

Comments

VCMason commented Feb 20, 2025 • edited Loading

pbsena commented Feb 21, 2025

VCMason commented Feb 24, 2025

egor-dolzhenko commented Feb 25, 2025

VCMason commented Feb 20, 2025 •

edited

Loading