Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analysis: re-run civic/moa variant analysis + add zip for moa features to reuse #25

Merged
merged 4 commits into from
Aug 21, 2023

Conversation

korikuzma
Copy link
Member

There are some VRS ID changes. Here's an issue I made for gnomad vcf queries in the variation normalizer. I'm going to keep this as a draft PR for now to investigate why the normalization rate went down

@korikuzma korikuzma requested a review from wesleygoar August 10, 2023 14:56
@korikuzma korikuzma self-assigned this Aug 10, 2023
@korikuzma
Copy link
Member Author

@wesleygoar Some reasons why the normalization rate dropped is because our regex is case sensitive in the refactor. Here are some examples from CIViC that fail due to case sensitivity:

E588_Y589INSKYFYVDFRE
M552_W557DEL

We added case sensitivity to make it clear what the variant passed. For example, if E588_Y589DELINSKYFYVDFRE was passed would this be a deletion (with invalid deleted sequence at the end) or a delins?

@korikuzma
Copy link
Member Author

korikuzma commented Aug 10, 2023

@wesleygoar we dropped in both civic and moa but not by much (see analysis files for percent changes), so let me know if you want me to merge these changes or if you'd like us to come up with solutions in the normalizer refactor.

Copy link
Collaborator

@wesleygoar wesleygoar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@korikuzma The VRS ID changes were due to the variant being valid in both builds?

@korikuzma
Copy link
Member Author

@wesleygoar Yup

@wesleygoar
Copy link
Collaborator

Yeah we should fix that.

@korikuzma
Copy link
Member Author

I know, we discussed me solving the linked issue before via slack

@wesleygoar
Copy link
Collaborator

I know. ; ) Just doing my part to keep provenance via github like you wanted.

@ahwagner
Copy link
Member

ahwagner commented Aug 15, 2023

I think we can resolve this without case delineation. For any given expression of this form, there can only be:
1 valid interpretation
0 valid interpretation

For example, E588_Y589DELINSKYFYVDFRE is not a valid deletion (as Kori pointed out) but could be a valid delins, and therefore we would resolve as DELINS.

If a string has no valid interpretation, then it is simply invalid and we move on.

@korikuzma
Copy link
Member Author

@ahwagner What if we had a case like I588_I591DELINSI. Where I does exist at pos 588 and I does exist at pos 591. This could be a deletion of 588-591 where the sequence is INSI. Or it could be a delins where 588-591 is deleted and I is inserted. Would we want to use delins or deletion?

@ahwagner
Copy link
Member

Yeah. Wes and I were discussing this down here while you typed up this example. This is part of the narrow class of n-3 length expressions where we have this challenge. I would say we just attempt to interpret both ways, and if it is ambiguous we say we can't normalize due to ambiguity.

@korikuzma
Copy link
Member Author

Yeah we should fix that.

@wesleygoar PR is made here.

@korikuzma
Copy link
Member Author

@wesleygoar re-ran with input_assembly parameter for civic.

@wesleygoar
Copy link
Collaborator

@korikuzma thanks. Ill take a gander at the data.

@korikuzma korikuzma marked this pull request as ready for review August 21, 2023 16:16
@korikuzma korikuzma merged commit 9fa9e8f into main Aug 21, 2023
@korikuzma korikuzma deleted the refactor-run branch August 21, 2023 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants