Skip to content

Transcript Versions and Python HGVS library discussion

Dave Lawrence edited this page Apr 27, 2022 · 4 revisions

Background

Both Ensembl and RefSeq transcripts are versioned - this is the dot after the transcript ID, eg "NM_000124.2"

An example is the transcript NM_000124 (ERCC6) changed from NM_000124.2 in GRCh37.p9 to NM_000124.3 in GRCh37.p10.

Using versions is critical for resolving HGVS back to variant coordinates, as the exon boundaries may have changed with versions.

Transcript version coordinates

Transcript versions are a sequence (and annotations marking coding start/end) - and they are aligned to reference genomes (eg using Splign) to produce exons coordinates for that build (in GTF/GFF)

Not all transcript versions have been officially released by RefSeq/Ensembl for all genome builds

Alignment gaps

Ensembl transcripts always match the reference sequence, but some RefSeq transcript sequences differ from the reference genome, so the alignment can have gaps.

We can adjust for these gaps if we have "cDNA_match" alignment information, but cannot currently handle partial alignments.

Python HGVS library comparison

BioCommons HGVS

https://github.com/biocommons/hgvs/

Makes use of Universal Transcript Archive - a collection of transcript versions aligned to different genomes (so may contain )

Pros:

  • Handles alignment gaps
  • UTA does its own alignments, meaning there are many transcript versions aligned to genome builds not in the "official" releases

Cons:

  • Doesn't support Ensembl (very old data pre-versions), See issue
  • UTA is missing many transcripts including the latest ones
  • Local install didn't work on my machine Issue and hosted database uses Postgres so is firewalled in many of our intranet installations
  • UTA is complicated and thus it seems difficult to generate the annotation data ourselves

Counsyl PyHGVS

https://github.com/counsyl/hgvs/

Pros:

  • Simpler - annotations use simple dictionary structure so we can easily generate from latest GTFs

Cons:

  • Doesn't support alignment gaps
  • Doesn't support a few HGVS features (n./m.)

Decision

Critically, we needed Ensembl transcripts, so had to go with PyHGVS.

We fixed a number of bugs, added features and have generated as many transcripts as we can

We intend to push our fixes and generated transcripts back out to the community (both Python libraries above)

cdot

Later, we decided to create a way to distribute transcripts as gzipped JSON, and made loader programs for both libraries - see the https://github.com/SACGF/cdot/ project

Clone this wiki locally