Skip to content

feat: add support for over-writing genotypes for males on x non par #1030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Feb 18, 2025

Conversation

bpblanken
Copy link
Collaborator

@bpblanken bpblanken commented Feb 5, 2025

Implementation of:

Screenshot 2025-02-05 at 5 28 50 PM

@bpblanken bpblanken changed the title Add support for over-writing genotypes for males on x non par feat: add support for over-writing genotypes for males on x non par Feb 5, 2025
@bpblanken bpblanken marked this pull request as ready for review February 6, 2025 19:45
@bpblanken bpblanken requested a review from a team as a code owner February 6, 2025 19:45
@@ -24,5 +27,6 @@ class FeatureFlag:
EXPECT_TDR_METRICS: bool = EXPECT_TDR_METRICS
EXPECT_WES_FILTERS: bool = EXPECT_WES_FILTERS
INCLUDE_PIPELINE_VERSION_IN_PREFIX: bool = INCLUDE_PIPELINE_VERSION_IN_PREFIX
OVERWRITE_SV_MALE_NON_PAR_CALLS: bool = OVERWRITE_SV_MALE_NON_PAR_CALLS
Copy link
Collaborator Author

@bpblanken bpblanken Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of this logic is it will only apply for internal SV callsets. From notes:

 cohort specific of the GREGoR project

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we wouldn't apply this to all SVs in seqr, presumably we are doing this because of desired seqr search behavior. I don't think its rreasonable for us to write search queries that are able to handle branching logic for determining what a genotype for a sample is based on whether its internal or external data

Copy link
Contributor

@matren395 matren395 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👁️ 👁️

Comment on lines 36 to 50
mt = mt.annotate_entries(
GT=hl.if_else(
(
male_sample_ids.contains(mt.s)
& non_par_interval.overlaps(
hl.interval(
mt.start_locus,
mt.end_locus,
),
)
),
hl.Call([1, 1], phased=False),
mt.GT,
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would this handle SVs overlapping the start of a non-PAR regions, and do we want to be handling them this way ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and what do the GTs look like elsewise? are they left haploid ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe overlaps handles cases where the end of the SV is > the start of the non-Par region.

@matren395
Copy link
Contributor

I guess the broader question is what is that code ur implementing in the first place

)
& mt.GT.is_het()
),
hl.Call([1], phased=False),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tweaked this to 1) only change het diploid calls and 2) replace with a haploid call.

@@ -121,6 +122,10 @@ def create_table(self) -> hl.MatrixTable:
if self.dataset_type.has_multi_allelic_variants:
# NB: throws SeqrValidationError
mt = split_multi_hts(mt, self.skip_validation)

if self.dataset_type.overwrite_male_non_par_calls:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no longer feature flagged and will run for all SVs.

@bpblanken bpblanken requested a review from hanars February 18, 2025 19:15
GT=hl.if_else(
(
male_sample_ids.contains(mt.s)
& non_par_interval.overlaps(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct - if a single SV call overlaps both non-par and par regions I think we want to keep the diploid call because its real for the par region. I think we only want to change it if the SV is 100% overlapped by non-par region

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm, I think the current behavior is to "obtain those calls overlapping non-PAR regions and update the GT to 1/1 in males."

I think your logic to keep as diploid makes sense though, if we're ok with the behavior change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not like super confident in my knowledge of this data/biology, so it might be better to maintain parity

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth asking lynn + alba on this. I think we can get a fast answer.

@bpblanken bpblanken merged commit 4ea8740 into main Feb 18, 2025
3 checks passed
@bpblanken bpblanken deleted the benb/x_males_non_par branch February 18, 2025 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants