-
Notifications
You must be signed in to change notification settings - Fork 2
Transcript Versions and Python HGVS library discussion
Both Ensembl and RefSeq transcripts are versioned - this is the dot after the transcript ID, eg "NM_000124.2"
An example is the transcript NM_000124 (ERCC6) changed from NM_000124.2 in GRCh37.p9 to NM_000124.3 in GRCh37.p10.
Using versions is critical for resolving HGVS back to variant coordinates, as the exon boundaries may have changed with versions.
Transcript versions are a sequence (and annotations marking coding start/end) - and they are aligned to reference genomes (eg using Splign) to produce exons coordinates for that build (in GTF/GFF)
Not all transcript versions have been officially released by RefSeq/Ensembl for all genome builds
Ensembl transcripts always match the reference sequence, but some RefSeq transcript sequences differ from the reference genome, so the alignment can have gaps.
We can adjust for these gaps if we have "cDNA_match" alignment information, but cannot currently handle partial alignments.
https://github.com/biocommons/hgvs/
Makes use of Universal Transcript Archive - a collection of transcript versions aligned to different genomes (so may contain )
Pros:
- Handles alignment gaps
- UTA does its own alignments, meaning there are many transcript versions aligned to genome builds not in the "official" releases
Cons:
- Doesn't support Ensembl (very old data pre-versions), See issue
- UTA is missing many transcripts including the latest ones
- Local install didn't work on my machine Issue and hosted database uses Postgres so is firewalled in many of our intranet installations
- UTA is complicated and thus it seems difficult to generate the annotation data ourselves
https://github.com/counsyl/hgvs/
Pros:
- Simpler - annotations use simple dictionary structure so we can easily generate from latest GTFs
Cons:
- Doesn't support alignment gaps
- Doesn't support a few HGVS features (n./m.)
Critically, we needed Ensembl transcripts, so had to go with PyHGVS.
We fixed a number of bugs, added features and have generated as many transcripts as we can
We intend to push our fixes and generated transcripts back out to the community (both Python libraries above)
Later, we decided to create a way to distribute transcripts as gzipped JSON, and made loader programs for both libraries - see the https://github.com/SACGF/cdot/ project