Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variant Filtering for input #57

Closed
jennyp76 opened this issue Jan 18, 2025 · 6 comments
Closed

Variant Filtering for input #57

jennyp76 opened this issue Jan 18, 2025 · 6 comments

Comments

@jennyp76
Copy link

Thank you for providing such an amazing tool!

I am currently working with HiFi reads and using DeepVariant to call germline variants. From one of your previous issues, I learned that Hiphase does not consider the "FILTER" column in the VCF file but instead uses the GQ value for filtering variants.

However, I have a question regarding the use of RefCall in phasing. If RefCall variants are included in the phasing process, doesn’t that mean Hiphase is incorporating non-PASS germline variants for phasing?

From my understanding, shouldn't Hiphase prioritize variants that are confidently classified as germline variants and have a considerable GQ value? May I know reasons for considering only GQ value? For instance, some RefCall variants have very high GQ values, whereas some PASS variants have low GQ values, such as 5.

Wouldn’t this variability potentially introduce confusion or reduce the accuracy of phasing in Hiphase?
Thank you for your time and support!

Sincerely
Jen

@holtjma
Copy link
Collaborator

holtjma commented Jan 18, 2025

Hi Jen,

I think there's two main parts to your question, one caused by a misconception and one a philosophical one.

First, the misconception: My understanding of the RefCall filter is that it specifically indicates a VCF entry that is not a variant from DeepVariant output. To quote the docs, "RefCalls are candidates that were determined to match the reference and are therefore not variants, although they are included in the VCF file (see FILTER column)." HiPhase will correctly ignore these non-variant VCF entries for phasing, and simply pass them through to the output VCF.

Second, the philosophical question which I will simplify to, "Why didn't you use these other filter criteria XYZ?" In our experience, no single set of filters will work perfectly in all scenarios. Additionally, users (such as yourself) each have different preferences for how filters should be applied, usually based on the problem they're trying to solve (which in some instances, can be very niche).

We selected filters that:

  1. Broadly work across the whole genome based on established benchmark datasets (see our paper for accuracy results), and
  2. Are easily disabled in our CLI for users who are willing to risk more false positives

If you find that the HiPhase defaults are not working well enough for your use case, then I would encourage you to do a parameter sweep on filters you wish to apply. bcftools is useful for applying any pre-filters you might desire.

Happy to answer any follow-ups,
Matt

@jennyp76
Copy link
Author

Thank you for the prompt response!
I completely understand everything you explained.

Could you share some of the criteria you use to evaluate whether HiPhase is working well?
I’m new to the concept of phasing and would like to better understand how progress is measured.

Thank you
Jen

@holtjma
Copy link
Collaborator

holtjma commented Jan 21, 2025

I would recommend reading through the HiPhase paper to understand the metrics we use to measure accuracy and performance. I can answer any specific follow-ups, but that's the best place to start. The WhatsHap paper will also be useful if you want to go back a bit more.

EDIT: I forgot we also have a tiny primer in our performance file on GitHub.

Matt

@jennyp76
Copy link
Author

jennyp76 commented Jan 23, 2025

Thank you!
I’ll make sure to read through the HiPhase paper.
I have couple of question I would really appreciate if you could help me.

  1. I want to calculate the percentage of reads that were able to be assign to its haplotype.
    For the example I only confined it to chromsome 11.
    When I run samtools view -c 231222.pbmm2.bam chr11 , I get 27,238 reads and less 231222.haplotags | grep 'chr11' | wc -l gives me 22,906 reads. Can I (22906/27238)*100 = 84% of reads were phased?
    Then, what is it different from "num_reads" - "skipped reads" from stats.csv?

  2. Would there be a method to calculate the percentage of phased regions. I noticed several unphased region in the IGV of the haplotagged.bam where not a single read was phased. How can such a large region be unable to be phased????

Image

Thanks!
Sincerely
Jen

@holtjma
Copy link
Collaborator

holtjma commented Jan 23, 2025

  1. This question is more complicated than it seems. HiPhase assigns phase tags to mappings of reads A single read can have multiple mappings within a BAM file. If you consider mappings (which may double-count reads), then you can do some relatively simple BAM filtering based on the "HP" or "PS" tags. If you want to actually look at reads, then you'll need to somehow count up the number of unique read IDs and then filter out any that are assigned to at least one haplotype across all phase blocks. It's a bit complicated, and I do not have a script set up to do this.
  2. Phasing is the process of assigning heterozygous variants to a haplotype. If there are no heterozygous variants, then there is nothing to phase. There are several biological reasons this might happen, but the most common are: 1) chrX in an XY or XO individual, 2) loss of heterozygosity (LOH) due to inheritance (happens frequently in isolated communities), 3) large scale deletions or aneuploidies.

Matt

@holtjma
Copy link
Collaborator

holtjma commented Feb 3, 2025

Closing due to inactivity, feel free to re-open if there are follow up questions.

@holtjma holtjma closed this as completed Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants