-
Notifications
You must be signed in to change notification settings - Fork 14
make gene name sanitation optional #1084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/utils/set_var_index.py
Outdated
| # then an eleven digit number, optionally followed by .version_number | ||
| ensembl_pattern = re.compile(r"^(ENS.*\d{11})(?:\.\d+)?$") | ||
|
|
||
| return [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to use the index as input here and use index.to_series().str.startswith()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.startswith.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rule out that there are no genes that start with ENS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can inverse the boolean mask returned by startswith using ~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant https://pandas.pydata.org/docs/reference/api/pandas.Series.str.match.html, sorry about that ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, it's adjusted now!
Changelog
Make gene name sanitation optional:
recommended for removing versions from ensembleid's (e.g. ENSMUSG00000017167.6), but not for gene names with splice variants (e.g. AL627309.1)
Note: #1083 needs to be merged first
Issue ticket number and link
Closes #xxxx (Replace xxxx with the GitHub issue number)
Checklist before requesting a review
I have performed a self-review of my code
Conforms to the Contributor's guide
Check the correct box. Does this PR contain:
Proposed changes are described in the CHANGELOG.md
CI tests succeed!