-
Notifications
You must be signed in to change notification settings - Fork 2
Gene and Transcript Version Data
See Transcript Versions and Python HGVS library discussion - we use a modified version of PyHGVS for HGVS resolution and have a spin-off project https://github.com/SACGF/cdot/ which collects transcripts in a JSON.gz format, and has loaders for both Python HGVS libraries
Download Transcript Version Data - quicker install from pre-generated files Generating Transcript Version Data - generate from public data
The gene/transcript and gene/symbol or symbol for HGNC changes over time. When we merge data from historical files, we always keep the latest one. See cdot instructions for creating transcript JSON.gz files
Run the following to download the latest cdot data and insert into the database:
python3 manage import_cdot_latest
In the step above, we keep the latest versions after merging many GTFs, eg the final transcript/gene/symbol relationship will be from the most recent file containing the transcript.
For consistent analyses, we need to keep a snapshot of a certain state, eg the gene/symbol mappings for Ensembl release 100 (the release used by a GRCh38 Ensembl variant annotation) so a gene list filter always returns the same results.
To do this, we create a GeneAnnotationRelease which is loaded from a single RefSeq/Ensembl GTF for each Variant Annotation Version. You need to do this AFTER the normal annotations have been imported (eg the merged file that also contains the individual release)
In VariantGrid, go to the annotation page, then version. In the grid at the bottom, the 2nd column "VEP" will be eg 100, 103 etc and 3rd column "Annotation Consortium" will be Ensembl or RefSeq. You only need to do the following for each VEP annotation version you use (if you are using both 37/38 you will need to do 2 gene annotation releases)
ENSEMBL:
- GRCh37 will probably remain frozen at v87 - eg see GRCh37 release 112
- GRCh38, the release equals the VEP version GRCh38 release 112
RefSeq:
- In "Annotation Version" page, expand version and look for "Refseq" entry, eg:
105.20220307 - GCF_000001405.25_GRCh37.p13_genomic.gff
# RefSeq GRCh37
cd /tmp
wget https://github.com/SACGF/cdot/releases/download/data_v0.2.26/cdot-0.2.26.GCF_000001405.25_GRCh37.p13_genomic.105.20220307.gff.json.gz
python3 manage.py import_gene_annotation --annotation-consortium=RefSeq --genome-build=GRCh37 --json-file cdot-0.2.26.GCF_000001405.25_GRCh37.p13_genomic.105.20220307.gff.json.gz --release=GRCh37_refseq_105_20220307
# RefSeq GRCh38
cd /tmp
wget https://github.com/SACGF/cdot/releases/download/data_v0.2.26/cdot-0.2.26.GCF_000001405.40_GRCh38.p14_genomic.RS_2023_10.gff.json.gz
python3 manage.py import_gene_annotation --annotation-consortium=RefSeq --genome-build=GRCh38 --json-file /tmp/cdot-0.2.26.GCF_000001405.40_GRCh38.p14_genomic.RS_2023_10.gff.json.gz --release=GRCh38_refseq_40_RS_2023_10
# Ensembl GRCh37 - 37 has been staying on release 87
cd /tmp
wget https://github.com/SACGF/cdot/releases/download/data_v0.2.26/cdot-0.2.26.ensembl.Homo_sapiens.GRCh37.87.gff3.json.gz
python3 manage.py import_gene_annotation --annotation-consortium=Ensembl --genome-build=GRCh37 --json-file /tmp/cdot-0.2.26.ensembl.Homo_sapiens.GRCh37.87.gff3.json.gz --release=GRCh37_ensembl_87
# Ensembl GRCh38 - 38 updates with each Ensembl Release
cd /tmp
wget https://github.com/SACGF/cdot/releases/download/data_v0.2.26/cdot-0.2.26.ensembl.Homo_sapiens.GRCh38.112.gff3.json.gz
python3 manage.py import_gene_annotation --annotation-consortium=Ensembl --genome-build=GRCh38 --json-file
/tmp/cdot-0.2.26.ensembl.Homo_sapiens.GRCh38.112.gff3.json.gz --release=GRCh38_ensembl_112
Django Admin
In Django Admin, open "variant_annotation_version" then set "gene_annotation_release" to what was uploaded in the previous step, then save.