Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Metadome #634

Open
davmlaw opened this issue Jun 10, 2022 · 5 comments
Open

Investigate Metadome #634

davmlaw opened this issue Jun 10, 2022 · 5 comments

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Jun 10, 2022

Peer says:

We use the Metadome web app for regional constraints within genes all the time in research and diagnostics. Now the data has become available, is that something you could add to VG?
Or is that something that should be included earlier in the pipeline?

Data is available:

https://zenodo.org/record/6625251#.YqHL5BbSWEc

Data looks like:

chrom,pos_start,pos_stop,strand,symbol,gencode_transcription_id,sw_dn_ds,sw_coverage,sw_size,domain_id,consensus_pos
chr19,10781282,10781284,+,ILF3,ENST00000420083.1,0.1875,0.5238095238095238,10,,
chr19,10781650,10781652,+,ILF3,ENST00000420083.1,0.2005813953488372,0.5714285714285714,10,,
chr19,10781653,10781655,+,ILF3,ENST00000420083.1,0.1914893617021277,0.6190476190476191,10,,
chr19,10781656,10781658,+,ILF3,ENST00000420083.1,0.162,0.6666666666666666,10,,
chr19,10781659,10781661,+,ILF3,ENST00000420083.1,0.14862385321100918,0.7142857142857143,10,,
chr19,10781662,10781664,+,ILF3,ENST00000420083.1,0.13043478260869565,0.7619047619047619,10,,
chr19,10781665,10781667,+,ILF3,ENST00000420083.1,0.10889929742388757,0.8095238095238095,10,,

The transcripts per gene looks like:

count    18541.000000
mean         2.252953
std          1.726056
min          1.000000
25%          1.000000
50%          2.000000
75%          3.000000
max         30.000000

And it's all Ensembl

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 10, 2022

Have fired off some questions to Peer Fri Jun 10:

Read the paper + checked out the site and it looks like a good idea. Few issues:

Genome build

The data is only available for GRCh37 not GRCh38. I searched Zenodo and there are no other data releases for "MetaDome"

It's possible to convert the data to a BED file then use NCBI remap but ideally the Metadome team could provide it in GRCh38 coordinates as well?

Other data?

It only has sw_dn_ds and associated data - ie the ClinVar count which is displayed on the site is not available in the TSV data. Not sure if they need this?

Transcript choice

As it's protein domain based - the usage of the correct transcript seems to be important.

If we want to do per-transcript annotation in VariantGrid, then we may run into trouble as eg the data may be for "ENST00000420083.1" but we may use a different version of that transcript, or use RefSeq annotation (it is quite hard to map between Ensembl/RefSeq transcripts)

The number of transcripts over a coordinate is:

Transcript count

count 1.088185e+07
mean 2.134153e+00
std 1.615101e+00
min 1.000000e+00
25% 1.000000e+00
50% 2.000000e+00
75% 3.000000e+00
max 3.000000e+01

However the number of different sw_dn_ds scores is much lower

Score count

count 1.088185e+07
mean 1.021144e+00
std 1.588360e-01
min 1.000000e+00
25% 1.000000e+00
50% 1.000000e+00
75% 1.000000e+00
max 2.200000e+01

So only just under 2% of coordinates have >1 score

Looking at them, though, the scores for a coordinate can vary quite a bit (as you'd expect)

Dealing with Indels

If an indel spans multiple coordinates, how do we deal with it?

I don't want to show all scores (eg "0.1875&0.2005813953488372") as then we can't store it as a floating point number and easily do less than/greater than searches.

So possibly you could take the highest/lowest/average for a particular score

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 15, 2022

// indicates the various colors to indicate the tolerance
var toleranceColorGradient = [ {
	offset : "0%",
	color : "#d7191c"
}, {
	offset : "12.5%",
	color : "#e76818"
}, {
	offset : "25%",
	color : "#f29e2e"
}, {
	offset : "37.5%",
	color : "#f9d057"
}, {
	offset : "50%",
	color : "#ffff8c"
}, {
	offset : "62.5%",
	color : "#90eb9d"
}, {
	offset : "75%",
	color : "#00ccbc"
}, {
	offset : "87.5%",
	color : "#00a6ca"
}, {
	offset : "100%",
	color : "#2c7bb6"
} ]

// the color coding for specific tolerance scores
// color #f29e2e indicates the average dn/ds tolerance score over all genes
function tolerance_color(score) {
	if (score <= 0.175) {
		return toleranceColorGradient[0].color;
	} else if (score <= 0.35) {
		return toleranceColorGradient[1].color;
	} else if (score <= 0.525) {
		return toleranceColorGradient[2].color;
	} else if (score <= 0.7) {
		return toleranceColorGradient[3].color;
	} else if (score <= 0.875) {
		return toleranceColorGradient[4].color;
	} else if (score <= 1.025) {
		return toleranceColorGradient[5].color;
	} else if (score <= 1.2) {
		return toleranceColorGradient[6].color;
	} else if (score <= 1.375) {
		return toleranceColorGradient[7].color;
	} else {
		return toleranceColorGradient[8].color;
	}
}

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 20, 2022

I've requested the GRCh38 data

After talking to Peer I think the most value will come from just having the highest number per transcript

It's designed for SNVs so no need to worry about indels

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 15, 2023

Metadome raised an issue in 2020 about GRCh38 support - cmbi/metadome#62 I pinged it asking about an update

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 17, 2024

Another thing to think about is just running it through a liftover ourselves, or just make it GRCh37 only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant