-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug genotyping low coverage loci in genotyper_cluster.rs #58
Comments
Hello, The cluster genotyper was designed to predominantly be used for high-depth targeted assays, so we set a minimum number of reads an allele needs to have to identify a genotype. We will adjust the cut-offs for low coverages in the next iteration to accommodate these cases. May I ask why use the cluster genotyper in this sample as opposed to size? |
Hello, Great! I'm looking forward to the changes. I am looking for sequence/motif based changes in addition to length based changes in VNTRs in a population. The cluster genotyper seems like a good choice to sort reads by size and sequence, although slower. Is it better to rely on the size genotyper when the length of sequence is different? In the paper, there was a bit of a problem with 'off-by-one' genotypes in regards to mendelian accuracy, any idea if the cluster genotyper handles these situations better/worse/the same? I found cases like the example above by comparing the resulting genotypes between size and cluster. Another note about the example above. I had another sample with slightly higher depth with the same parental alleles (as the example) and the cluster genotyper (v1.4.1) genotyped the locus correctly (like the size genotyper). So it really seems to be just the depth issue. |
It sounds like you are working on a very interesting project! May I ask which VNTR definitions you are using? And how are you Best wishes, |
Here is a locus with read depth of 4. The locus has an unbalanced distribution of reads: three from parent1 and one from parent2. The one read from parent2 is exactly like the reference, while the 3 others from parent1 are exactly the same, but different from the reference in length and sequence.
The true genotype should be 0/1 (reference/non-referenceTRUE), but the cluster genotyper(v1.4.1) gives 1/2 (non-referenceTRUE/non-referenceFAKE). What ended up happening is the genotyper put the reference sequence in the same group as one of the very different sequences (which has the 158 base deletion and SNPs), then called a consensus. This created a fake allele that doesn't resemble either parent (non-referenceFAKE). The fake allele included the deletions and some (but not all SNPs) so the fake read looked nothing like the reference sequence, but also not the same as the 3 other reads from parent1.
I looked into the source code (of v1.5.1). I found in the cluster function of the genotyper_cluster.rs script that clusters of 1 read are never allowed to be considered truthful, and when a cluster of 1 read does exist, the code treats the locus like a homozygous locus (by never allowing a cutoff value to be defined > 0.0). This results into the 4 reads being sorted into two groups evenly ('even' reads to one group, and 'odd' reads to the other group). By not allowing these loci to act as heterozygotes and therefore not trusting the one sequence, this causes hybrid alleles to be produced by making a consensus of the one 'different' read in a cluster by itself (in this case the reference read) and the read with the deletion and SNPs. Note that this results in false read depth support values in the final VCF which reported 2,2 when it should have been 1/3. Also note that i can genotype this locus with the size genotyper and it gets the answer correct because it can recognize the difference in size between the 3 reads and the 1 different one.
I changed the code in the cluster function
from (v1.5.1):
if min_size >= min_cluster_size {
to:
if min_size >= min_cluster_size || (num_seqs < 6 && min_size > 0) {
I recompiled and re-genotyped this same locus, I got the correct genotype of 0/1 and the correct read depth values of 1,3.
This change just allows low coverage loci < 6 to have a cutoff defined and therefore be considered heterozygous when unbalanced parental reads are present (this allows a cluster to be supported by 1 read (ex: 1 & 3, 1 & 4)). 6 is certainly not the correct value to be used, but it would at least avoid the worst cases where in read depths of 4 and 5 with unbalanced parental reads causes a consensus sequence to be created from two reads that can be quite different...
Why is 1 read trusted by the size genotyper while 1 read not allowed to represent a cluster in the cluster genotyper? I understand the logic of iterating backwards through the nodes step by step and saying that the 1 read being different from all others could be generated from technical artifacts like sequencing error or poor alignment... but in low depth situations 1 read truthfully representing a parent will be common and needs to be accounted for.
Unfortunately at low coverage the probability of having 1 read represent the allele of a parent is very high (depth 4 = prob 0.25, 5, . Which is why high depth is good... i know but i don't have high depth across all samples.
A couple solutions could be:
Note: i do not want 1 sequence to represent a cluster at 'high' depths because the probability that 34 reads sequenced from one parent and only 1 read was sequenced from the other is extremely low...
I haven't read through all the code, maybe you have better ideas on how to solve this?
The text was updated successfully, but these errors were encountered: