Skip to content

Gene and Transcript Version Data

Dave Lawrence edited this page Aug 15, 2024 · 14 revisions

Background

See Transcript Versions and Python HGVS library discussion - we use a modified version of PyHGVS for HGVS resolution and have a spin-off project https://github.com/SACGF/cdot/ which collects transcripts in a JSON.gz format, and has loaders for both Python HGVS libraries

Obtaining annotation files

Download Transcript Version Data - quicker install from pre-generated files Generating Transcript Version Data - generate from public data

Gene Annotation - install

The gene/transcript and gene/symbol or symbol for HGNC changes over time. When we merge data from historical files, we always keep the latest one. See cdot instructions for creating transcript JSON.gz files

Run the following to download the latest cdot data and insert into the database:

python3 manage import_cdot_latest

Gene Annotation Release

In the step above, we keep the latest versions after merging many GTFs, eg the final transcript/gene/symbol relationship will be from the most recent file containing the transcript.

For consistent analyses, we need to keep a snapshot of a certain state, eg the gene/symbol mappings for Ensembl release 100 (the release used by a GRCh38 Ensembl variant annotation) so a gene list filter always returns the same results.

To do this, we create a GeneAnnotationRelease which is loaded from a single RefSeq/Ensembl GTF for each Variant Annotation Version. You need to do this AFTER the normal annotations have been imported (eg the merged file that also contains the individual release)

Gene Annotation Release - find version

In VariantGrid, go to the annotation page, then version. In the grid at the bottom, the 2nd column "VEP" will be eg 100, 103 etc and 3rd column "Annotation Consortium" will be Ensembl or RefSeq. You only need to do the following for each VEP annotation version you use (if you are using both 37/38 you will need to do 2 gene annotation releases)

ENSEMBL:

RefSeq:

  • In "Annotation Version" page, expand version and look for "Refseq" entry, eg: 105.20220307 - GCF_000001405.25_GRCh37.p13_genomic.gff

Gene Annotation Release - install for VEP 110

# RefSeq GRCh37
cd /tmp
wget https://github.com/SACGF/cdot/releases/download/data_v0.2.26/cdot-0.2.26.GCF_000001405.25_GRCh37.p13_genomic.105.20220307.gff.json.gz
python3 manage.py import_gene_annotation --annotation-consortium=RefSeq --genome-build=GRCh37 --json-file cdot-0.2.26.GCF_000001405.25_GRCh37.p13_genomic.105.20220307.gff.json.gz --release=GRCh37_refseq_105_20220307

# RefSeq GRCh38
cd /tmp
wget https://github.com/SACGF/cdot/releases/download/data_v0.2.26/cdot-0.2.26.GCF_000001405.40_GRCh38.p14_genomic.RS_2023_10.gff.json.gz
python3 manage.py import_gene_annotation --annotation-consortium=RefSeq --genome-build=GRCh38 --json-file /tmp/cdot-0.2.26.GCF_000001405.40_GRCh38.p14_genomic.RS_2023_10.gff.json.gz --release=GRCh38_refseq_40_RS_2023_10

# Ensembl GRCh37 - 37 has been staying on release 87
cd /tmp
wget https://github.com/SACGF/cdot/releases/download/data_v0.2.26/cdot-0.2.26.ensembl.Homo_sapiens.GRCh37.87.gff3.json.gz
python3 manage.py import_gene_annotation --annotation-consortium=Ensembl --genome-build=GRCh37 --json-file /tmp/cdot-0.2.26.ensembl.Homo_sapiens.GRCh37.87.gff3.json.gz --release=GRCh37_ensembl_87

# Ensembl GRCh38 - 38 updates with each Ensembl Release
cd /tmp
wget https://github.com/SACGF/cdot/releases/download/data_v0.2.26/cdot-0.2.26.ensembl.Homo_sapiens.GRCh38.112.gff3.json.gz
python3 manage.py import_gene_annotation --annotation-consortium=Ensembl --genome-build=GRCh38 --json-file 
/tmp/cdot-0.2.26.ensembl.Homo_sapiens.GRCh38.112.gff3.json.gz --release=GRCh38_ensembl_112

Django Admin

In Django Admin, open "variant_annotation_version" then set "gene_annotation_release" to what was uploaded in the previous step, then save.

Clone this wiki locally