RNAcentral
diff --git a/‎RELEASE.rst‎
Lines changed: 116 additions & 0 deletions b/‎RELEASE.rst‎
Lines changed: 116 additions & 0 deletions
diff --git a/‎bin/ftp-export‎
Lines changed: 12 additions & 0 deletions b/‎bin/ftp-export‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎bin/import-data‎
Lines changed: 22 additions & 0 deletions b/‎bin/import-data‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎bin/pipeline‎
Lines changed: 38 additions & 0 deletions b/‎bin/pipeline‎
Lines changed: 38 additions & 0 deletions
diff --git a/‎bin/scheduler‎
Lines changed: 14 additions & 0 deletions b/‎bin/scheduler‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎bin/search-export‎
Lines changed: 13 additions & 0 deletions b/‎bin/search-export‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎bin/update‎
Lines changed: 7 additions & 0 deletions b/‎bin/update‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎bin/update-config‎
Lines changed: 16 additions & 0 deletions b/‎bin/update-config‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎data/ena/wgs_acnt01_pro.ncr‎ renamed to ‎data/ena/ncr/ex/wgs_acnt01_pro.ncr‎ b/‎data/ena/wgs_acnt01_pro.ncr‎ renamed to ‎data/ena/ncr/ex/wgs_acnt01_pro.ncr‎
diff --git a/‎data/ena/wgs_aacd01_fun.ncr‎ renamed to ‎data/ena/ncr/wgs/aa/wgs_aacd01_fun.ncr‎ b/‎data/ena/wgs_aacd01_fun.ncr‎ renamed to ‎data/ena/ncr/wgs/aa/wgs_aacd01_fun.ncr‎
@@ -0,0 +1,116 @@
+Release Notes
+=============
+
+This is a summary of changes to the internals of the database and how we
+process data. The format is based off of `Keep a Changelog
+<http://keepachangelog.com/en/1.0.0/>`_. Restructured text is being used because
+it provides more structure to the text.
+
+Release 9
+---------
+
+Added
+`````
+
+- Imported RGD data.
+
+  This import may be a one off task. One issue with this is that they do not
+  provide their sequences on their FTP site. These sequences can be extracted
+  via the gff files and chromosomes they provide, however I didn't want to add
+  that complexity. I asked them to provide sequences on an on going basis, but
+  they may have considered it a one off task. We will have to follow up with
+  them on this for long term updates.
+
+- Create Rfam annotation export. This is now officially part of our FTP export.
+
+- Added ``rnc_genomic_mapping`` table. This table stores the infered locations
+  for particular sequences.
+
+- Added a ``has_coordinates`` column to ``rnc_rna_precomputed``. This column is
+  meant to reflect the if a UPI/taxid pair has any known genomic locations. It
+  summarizes if there any entries in ``rnc_coordinates`` and
+  ``rnc_genomic_mapping``. It defaults to ``false`` and isn't currently updated
+  for entries where taxid is null.
+
+Changed
+```````
+
+- GENCODE accessions are now namespaced with ``GENCODE:`` prefixed to the
+  corresponding Ensembl id to produce the GENCODE accession.
+
+- All ENA parsing is done in python. This has several effects:
+
+  1. We do not write out both the original ENA entry and the expert database
+     entry for sequences that come from an expert database. We modify the
+     original ENA entry and only produce data for the expert database.
+
+  2. We determine what sequences come from expert databases via the `XREF
+     service <https://www.ebi.ac.uk/ena/browse/xref-service-rest>`_ instead of
+     parsing ``DR`` lines. We think XREF service more reliable than ``DR``
+     lines as it seems to be updated automatically and much faster, while
+     historically ``DR`` lines do not.
+
+  3. This may change how some columns are populated. The ``locus_tag`` field
+     will now be populated with data from the ``locus`` qualifier from the EMBL
+     file. For TAIR the value in the ``locus`` qualifier was used for
+     ``optional_id`` in the database.
+
+  4. Fields in ``rnc_accessions`` which used to be the empty string are now
+     ``NULL``.
+
+  5. The data in the ``note`` and ``db_xref`` are now JSON data structures. For
+     ``note`` the data structure will have ``text`` and ``ontology`` fields. The
+     ``ontology`` field is a mapping from a selected ontology (GO, SO, ECO) to a
+     list of the terms that have been extracted. The ``text`` field contains a
+     list of all strings of all text that could not be recognized as an from an
+     ontology. As an example:
+
+     .. code:: json
+
+       {
+         "ontology": [
+             "ECO:0000202",
+             "GO:0030533",
+             "SO:0000253"
+         ],
+         "text": [
+             "Covariance Model: Bacteria; CM Score: 87.61",
+             "Legacy ID: chr.trna3-GlyGCC"
+         ]
+       }
+
+     The ``db_xref`` field will contain data from the ``db_xref`` qualifier of
+     the record. As well as data from the comment field. Additionally there with
+     will be a ``db_xref`` field which contains the information from the ``DR``
+     lines, excluding the MD5 xref. The keys will be the database in upper case,
+     and the values will be a list of primary id and secondary id will be
+     ``null`` or the secondary id for the database, if present.
+
+     .. code:: json
+
+       {
+         "ena_refs": {
+           "WORMBASE": ["WBGene00196009", "F19C6.9"],
+           "BIOSAMPLE": ["SAMEA3138177", null]
+         }
+       }
+
+  6. More publications are extracted. This will pull publication data from the
+     ``experiment`` qualifier. Sometimes this qualifier contains a string like
+     ``PMID: 10, 12``, in which case those numbers are extracted and the used to
+     lookup the publication information from Europe PMC.
+
+  7. The ``anticodon`` field is more filled out. This will try to pull the
+     sequence from the ``anticodon`` qualifier, or the ``gene`` qualifier, or
+     from the ``note`` qualifier, using a variety of patterns.
+
+- Database mappings are generated by this pipeline instead of the webapp. There
+  is only 1 intentional change from the previous format, and that is adding
+  chain ids to PDBe mappings as per `rnacentral-webcode/288
+  <https://github.com/RNAcentral/rnacentral-webcode/issues/288>`_.
+
+- Other exports are now done by this pipeline and are expected to produce the
+  same results. They are:
+
+  - MD5 mapping export
+  - FASTA export
@@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+IFS=$'\n\t'
+
+WORKERS="{1:-10}"
+
+scheduler=$(bin/scheduler)
+bsub \
+  -J "/rnc/export/ftp[1-$WORKERS]"
+  -w "started($scheduler)" \
+  python -m luigi --module tasks.export.ftp FtpExport
@@ -0,0 +1,22 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+IFS=$'\n\t'
+
+WORKERS=${1:-80}
+
+scheduler=$(bin/scheduler)
+
+# Create all CSV's
+bsub \
+  -M 4000 \
+  -J "/rnc/process[1-$workers]" \
+  -w "started($scheduler)" \
+  python -m luigi --module tasks ProcessData
+
+# Update database with CSV data
+bsub \
+  -M 4000 \
+  -J "rnc-update" \
+  -w "done(/rnc/process)" \
+  bin/update
@@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+IFS=$'\n\t'
+
+error()
+{
+   echo 1>&2 $@ && exit
+}
+
+workers=${1:-80}
+virtual_env=${2:-import-env}
+
+activate_sh="$virtual_env/bin/activate"
+[ -e "$activate_sh" ] || error "No virtual_env to activate"
+
+module load pgsql/pgsql-95
+VIRTUAL_ENV_DISABLE_PROMPT=1 source "$activate_sh"
+export PYTHONPATH="${PYTHONPATH:-}:luigi"
+
+# Start the scheduler
+bsub \
+  -J "/rnc/scheduler" \
+  'bin/update-config && luigid'
+
+# Create all CSV's
+bsub \
+  -M 4000 \
+  -J "rnc-process[1-$workers]" \
+  -w 'started(rnc-scheduler)' \
+  python -m luigi --module tasks ProcessData
+
+# Update database with CSV data
+bsub \
+  -M 4000 \
+  -J "rnc-update" \
+  -w "done(rnc-process)" \
+  bin/update
@@ -0,0 +1,14 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+IFS=$'\n\t'
+
+job_is_unknown() {
+  name="$1"
+  bjobs -ru $USER -J $"name" | grep -c '^JOBID'
+}
+
+name="/rnc/scheduler"
+
+[ $(job_is_unknown "$name") -eq 0 ] || bsub -J "$name" 'bin/update-config && luigid'
+echo $name
@@ -0,0 +1,13 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+IFS=$'\n\t'
+
+WORKERS="{1:-6}"
+
+scheduler=$(bin/scheduler)
+bsub \
+  -J "/rnc/export/search[1-$WORKERS]"
+  -w "started($scheduler)" \
+  -M 60000 \
+  python -m luigi --module tasks.export.search Search
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+IFS=$'\n\t'
+
+python -m luigi --module tasks TruncateLoadTables --local-scheduler
+python -m luigi --module tasks Update --local-scheduler
@@ -0,0 +1,16 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+IFS=$'\n\t'
+
+error()
+{
+   echo 1>&2 $@ && exit
+}
+
+[ -e luigi.cfg ] || error "Must have existing config to modify"
+
+tmp=$(mktemp)
+mv luigi.cfg $tmp
+sed "s|default-scheduler-url=.*|default-scheduler-url=http://$HOSTNAME.ebi.ac.uk:8082|" $tmp > luigi.cfg
+rm $tmp