|
| 1 | +Release Notes |
| 2 | +============= |
| 3 | + |
| 4 | +This is a summary of changes to the internals of the database and how we |
| 5 | +process data. The format is based off of `Keep a Changelog |
| 6 | +<http://keepachangelog.com/en/1.0.0/>`_. Restructured text is being used because |
| 7 | +it provides more structure to the text. |
| 8 | + |
| 9 | +Release 9 |
| 10 | +--------- |
| 11 | + |
| 12 | +Added |
| 13 | +````` |
| 14 | + |
| 15 | +- Imported RGD data. |
| 16 | + |
| 17 | + This import may be a one off task. One issue with this is that they do not |
| 18 | + provide their sequences on their FTP site. These sequences can be extracted |
| 19 | + via the gff files and chromosomes they provide, however I didn't want to add |
| 20 | + that complexity. I asked them to provide sequences on an on going basis, but |
| 21 | + they may have considered it a one off task. We will have to follow up with |
| 22 | + them on this for long term updates. |
| 23 | + |
| 24 | +- Create Rfam annotation export. This is now officially part of our FTP export. |
| 25 | + |
| 26 | +- Added ``rnc_genomic_mapping`` table. This table stores the infered locations |
| 27 | + for particular sequences. |
| 28 | + |
| 29 | +- Added a ``has_coordinates`` column to ``rnc_rna_precomputed``. This column is |
| 30 | + meant to reflect the if a UPI/taxid pair has any known genomic locations. It |
| 31 | + summarizes if there any entries in ``rnc_coordinates`` and |
| 32 | + ``rnc_genomic_mapping``. It defaults to ``false`` and isn't currently updated |
| 33 | + for entries where taxid is null. |
| 34 | + |
| 35 | +Changed |
| 36 | +``````` |
| 37 | + |
| 38 | +- GENCODE accessions are now namespaced with ``GENCODE:`` prefixed to the |
| 39 | + corresponding Ensembl id to produce the GENCODE accession. |
| 40 | + |
| 41 | +- All ENA parsing is done in python. This has several effects: |
| 42 | + |
| 43 | + 1. We do not write out both the original ENA entry and the expert database |
| 44 | + entry for sequences that come from an expert database. We modify the |
| 45 | + original ENA entry and only produce data for the expert database. |
| 46 | + |
| 47 | + 2. We determine what sequences come from expert databases via the `XREF |
| 48 | + service <https://www.ebi.ac.uk/ena/browse/xref-service-rest>`_ instead of |
| 49 | + parsing ``DR`` lines. We think XREF service more reliable than ``DR`` |
| 50 | + lines as it seems to be updated automatically and much faster, while |
| 51 | + historically ``DR`` lines do not. |
| 52 | + |
| 53 | + 3. This may change how some columns are populated. The ``locus_tag`` field |
| 54 | + will now be populated with data from the ``locus`` qualifier from the EMBL |
| 55 | + file. For TAIR the value in the ``locus`` qualifier was used for |
| 56 | + ``optional_id`` in the database. |
| 57 | + |
| 58 | + 4. Fields in ``rnc_accessions`` which used to be the empty string are now |
| 59 | + ``NULL``. |
| 60 | + |
| 61 | + 5. The data in the ``note`` and ``db_xref`` are now JSON data structures. For |
| 62 | + ``note`` the data structure will have ``text`` and ``ontology`` fields. The |
| 63 | + ``ontology`` field is a mapping from a selected ontology (GO, SO, ECO) to a |
| 64 | + list of the terms that have been extracted. The ``text`` field contains a |
| 65 | + list of all strings of all text that could not be recognized as an from an |
| 66 | + ontology. As an example: |
| 67 | + |
| 68 | + .. code:: json |
| 69 | +
|
| 70 | + { |
| 71 | + "ontology": [ |
| 72 | + "ECO:0000202", |
| 73 | + "GO:0030533", |
| 74 | + "SO:0000253" |
| 75 | + ], |
| 76 | + "text": [ |
| 77 | + "Covariance Model: Bacteria; CM Score: 87.61", |
| 78 | + "Legacy ID: chr.trna3-GlyGCC" |
| 79 | + ] |
| 80 | + } |
| 81 | +
|
| 82 | + The ``db_xref`` field will contain data from the ``db_xref`` qualifier of |
| 83 | + the record. As well as data from the comment field. Additionally there with |
| 84 | + will be a ``db_xref`` field which contains the information from the ``DR`` |
| 85 | + lines, excluding the MD5 xref. The keys will be the database in upper case, |
| 86 | + and the values will be a list of primary id and secondary id will be |
| 87 | + ``null`` or the secondary id for the database, if present. |
| 88 | + |
| 89 | + .. code:: json |
| 90 | +
|
| 91 | + { |
| 92 | + "ena_refs": { |
| 93 | + "WORMBASE": ["WBGene00196009", "F19C6.9"], |
| 94 | + "BIOSAMPLE": ["SAMEA3138177", null] |
| 95 | + } |
| 96 | + } |
| 97 | +
|
| 98 | + 6. More publications are extracted. This will pull publication data from the |
| 99 | + ``experiment`` qualifier. Sometimes this qualifier contains a string like |
| 100 | + ``PMID: 10, 12``, in which case those numbers are extracted and the used to |
| 101 | + lookup the publication information from Europe PMC. |
| 102 | + |
| 103 | + 7. The ``anticodon`` field is more filled out. This will try to pull the |
| 104 | + sequence from the ``anticodon`` qualifier, or the ``gene`` qualifier, or |
| 105 | + from the ``note`` qualifier, using a variety of patterns. |
| 106 | + |
| 107 | +- Database mappings are generated by this pipeline instead of the webapp. There |
| 108 | + is only 1 intentional change from the previous format, and that is adding |
| 109 | + chain ids to PDBe mappings as per `rnacentral-webcode/288 |
| 110 | + <https://github.com/RNAcentral/rnacentral-webcode/issues/288>`_. |
| 111 | + |
| 112 | +- Other exports are now done by this pipeline and are expected to produce the |
| 113 | + same results. They are: |
| 114 | + |
| 115 | + - MD5 mapping export |
| 116 | + - FASTA export |
0 commit comments