Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug genotyping low coverage loci in genotyper_cluster.rs #58

Open
VCMason opened this issue Feb 20, 2025 · 3 comments
Open

Bug genotyping low coverage loci in genotyper_cluster.rs #58

VCMason opened this issue Feb 20, 2025 · 3 comments

Comments

@VCMason
Copy link

VCMason commented Feb 20, 2025

Here is a locus with read depth of 4. The locus has an unbalanced distribution of reads: three from parent1 and one from parent2. The one read from parent2 is exactly like the reference, while the 3 others from parent1 are exactly the same, but different from the reference in length and sequence.

Image

The true genotype should be 0/1 (reference/non-referenceTRUE), but the cluster genotyper(v1.4.1) gives 1/2 (non-referenceTRUE/non-referenceFAKE). What ended up happening is the genotyper put the reference sequence in the same group as one of the very different sequences (which has the 158 base deletion and SNPs), then called a consensus. This created a fake allele that doesn't resemble either parent (non-referenceFAKE). The fake allele included the deletions and some (but not all SNPs) so the fake read looked nothing like the reference sequence, but also not the same as the 3 other reads from parent1.

I looked into the source code (of v1.5.1). I found in the cluster function of the genotyper_cluster.rs script that clusters of 1 read are never allowed to be considered truthful, and when a cluster of 1 read does exist, the code treats the locus like a homozygous locus (by never allowing a cutoff value to be defined > 0.0). This results into the 4 reads being sorted into two groups evenly ('even' reads to one group, and 'odd' reads to the other group). By not allowing these loci to act as heterozygotes and therefore not trusting the one sequence, this causes hybrid alleles to be produced by making a consensus of the one 'different' read in a cluster by itself (in this case the reference read) and the read with the deletion and SNPs. Note that this results in false read depth support values in the final VCF which reported 2,2 when it should have been 1/3. Also note that i can genotype this locus with the size genotyper and it gets the answer correct because it can recognize the difference in size between the 3 reads and the 1 different one.

I changed the code in the cluster function
from (v1.5.1):
if min_size >= min_cluster_size {
to:
if min_size >= min_cluster_size || (num_seqs < 6 && min_size > 0) {

I recompiled and re-genotyped this same locus, I got the correct genotype of 0/1 and the correct read depth values of 1,3.

This change just allows low coverage loci < 6 to have a cutoff defined and therefore be considered heterozygous when unbalanced parental reads are present (this allows a cluster to be supported by 1 read (ex: 1 & 3, 1 & 4)). 6 is certainly not the correct value to be used, but it would at least avoid the worst cases where in read depths of 4 and 5 with unbalanced parental reads causes a consensus sequence to be created from two reads that can be quite different...

Why is 1 read trusted by the size genotyper while 1 read not allowed to represent a cluster in the cluster genotyper? I understand the logic of iterating backwards through the nodes step by step and saying that the 1 read being different from all others could be generated from technical artifacts like sequencing error or poor alignment... but in low depth situations 1 read truthfully representing a parent will be common and needs to be accounted for.

Unfortunately at low coverage the probability of having 1 read represent the allele of a parent is very high (depth 4 = prob 0.25, 5, . Which is why high depth is good... i know but i don't have high depth across all samples.

A couple solutions could be:

  • to flag the 'very different' read as untrustworthy and remove it from the consensus calling of each cluster in homozygous or heterozygous situtations. This avoids making a hybrid allele and being 'risky' and trusting 1 sequence to be true.
  • just put a . for the allele if you don't trust the consensus product. Or have a metadata flagging the untrustworthy allele
  • find a value that allows for single reads to represent a cluster in 'low covarege' loci and just trust the sequence (like mentioned above)
    • maybe 15?, 20?, 25? The probabilities to have 1 read representing a cluster and be true ~ 1 in 2184, 1 in 52,429, 1 in 1,342,177 respectively
  • provide an argument that the user could set to what they consider 'low coverage' (replace 6 with this variable)
  • if multiple samples are being genotyped, the other high depth samples could be checked and see if the sequence of the 1 read cluster is present in the population

Note: i do not want 1 sequence to represent a cluster at 'high' depths because the probability that 34 reads sequenced from one parent and only 1 read was sequenced from the other is extremely low...

I haven't read through all the code, maybe you have better ideas on how to solve this?

@pbsena
Copy link
Contributor

pbsena commented Feb 21, 2025

Hello,

The cluster genotyper was designed to predominantly be used for high-depth targeted assays, so we set a minimum number of reads an allele needs to have to identify a genotype. We will adjust the cut-offs for low coverages in the next iteration to accommodate these cases.

May I ask why use the cluster genotyper in this sample as opposed to size?

@VCMason
Copy link
Author

VCMason commented Feb 24, 2025

Hello,

Great! I'm looking forward to the changes.

I am looking for sequence/motif based changes in addition to length based changes in VNTRs in a population. The cluster genotyper seems like a good choice to sort reads by size and sequence, although slower.

Is it better to rely on the size genotyper when the length of sequence is different? In the paper, there was a bit of a problem with 'off-by-one' genotypes in regards to mendelian accuracy, any idea if the cluster genotyper handles these situations better/worse/the same?

I found cases like the example above by comparing the resulting genotypes between size and cluster.

Another note about the example above. I had another sample with slightly higher depth with the same parental alleles (as the example) and the cluster genotyper (v1.4.1) genotyped the locus correctly (like the size genotyper). So it really seems to be just the depth issue.

@egor-dolzhenko
Copy link
Collaborator

It sounds like you are working on a very interesting project! May I ask which VNTR definitions you are using? And how are you
planning to characterize sequence-level changes in VNTR regions? We are working on a new method (that will run on the output of TRGT) that might be useful for this type of analysis. If you'd like to try it out, please reach out to us by email.

Best wishes,
Egor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants