-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate Metadome #634
Comments
Have fired off some questions to Peer Fri Jun 10: Read the paper + checked out the site and it looks like a good idea. Few issues: Genome build The data is only available for GRCh37 not GRCh38. I searched Zenodo and there are no other data releases for "MetaDome" It's possible to convert the data to a BED file then use NCBI remap but ideally the Metadome team could provide it in GRCh38 coordinates as well? Other data? It only has sw_dn_ds and associated data - ie the ClinVar count which is displayed on the site is not available in the TSV data. Not sure if they need this? Transcript choice As it's protein domain based - the usage of the correct transcript seems to be important. If we want to do per-transcript annotation in VariantGrid, then we may run into trouble as eg the data may be for "ENST00000420083.1" but we may use a different version of that transcript, or use RefSeq annotation (it is quite hard to map between Ensembl/RefSeq transcripts) The number of transcripts over a coordinate is: Transcript count count 1.088185e+07 However the number of different sw_dn_ds scores is much lower Score count count 1.088185e+07 So only just under 2% of coordinates have >1 score Looking at them, though, the scores for a coordinate can vary quite a bit (as you'd expect) Dealing with Indels If an indel spans multiple coordinates, how do we deal with it? I don't want to show all scores (eg "0.1875&0.2005813953488372") as then we can't store it as a floating point number and easily do less than/greater than searches. So possibly you could take the highest/lowest/average for a particular score |
|
I've requested the GRCh38 data After talking to Peer I think the most value will come from just having the highest number per transcript It's designed for SNVs so no need to worry about indels |
Metadome raised an issue in 2020 about GRCh38 support - cmbi/metadome#62 I pinged it asking about an update |
Another thing to think about is just running it through a liftover ourselves, or just make it GRCh37 only |
Peer says:
We use the Metadome web app for regional constraints within genes all the time in research and diagnostics. Now the data has become available, is that something you could add to VG?
Or is that something that should be included earlier in the pipeline?
Data is available:
https://zenodo.org/record/6625251#.YqHL5BbSWEc
Data looks like:
The transcripts per gene looks like:
And it's all Ensembl
The text was updated successfully, but these errors were encountered: