Skip to content

Commit a77f9bb

Browse files
authored
Merge pull request #29 from RNAcentral/release-9
Release 9
2 parents 91d1474 + 4e32c64 commit a77f9bb

File tree

184 files changed

+43723
-28968
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

184 files changed

+43723
-28968
lines changed

RELEASE.rst

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
Release Notes
2+
=============
3+
4+
This is a summary of changes to the internals of the database and how we
5+
process data. The format is based off of `Keep a Changelog
6+
<http://keepachangelog.com/en/1.0.0/>`_. Restructured text is being used because
7+
it provides more structure to the text.
8+
9+
Release 9
10+
---------
11+
12+
Added
13+
`````
14+
15+
- Imported RGD data.
16+
17+
This import may be a one off task. One issue with this is that they do not
18+
provide their sequences on their FTP site. These sequences can be extracted
19+
via the gff files and chromosomes they provide, however I didn't want to add
20+
that complexity. I asked them to provide sequences on an on going basis, but
21+
they may have considered it a one off task. We will have to follow up with
22+
them on this for long term updates.
23+
24+
- Create Rfam annotation export. This is now officially part of our FTP export.
25+
26+
- Added ``rnc_genomic_mapping`` table. This table stores the infered locations
27+
for particular sequences.
28+
29+
- Added a ``has_coordinates`` column to ``rnc_rna_precomputed``. This column is
30+
meant to reflect the if a UPI/taxid pair has any known genomic locations. It
31+
summarizes if there any entries in ``rnc_coordinates`` and
32+
``rnc_genomic_mapping``. It defaults to ``false`` and isn't currently updated
33+
for entries where taxid is null.
34+
35+
Changed
36+
```````
37+
38+
- GENCODE accessions are now namespaced with ``GENCODE:`` prefixed to the
39+
corresponding Ensembl id to produce the GENCODE accession.
40+
41+
- All ENA parsing is done in python. This has several effects:
42+
43+
1. We do not write out both the original ENA entry and the expert database
44+
entry for sequences that come from an expert database. We modify the
45+
original ENA entry and only produce data for the expert database.
46+
47+
2. We determine what sequences come from expert databases via the `XREF
48+
service <https://www.ebi.ac.uk/ena/browse/xref-service-rest>`_ instead of
49+
parsing ``DR`` lines. We think XREF service more reliable than ``DR``
50+
lines as it seems to be updated automatically and much faster, while
51+
historically ``DR`` lines do not.
52+
53+
3. This may change how some columns are populated. The ``locus_tag`` field
54+
will now be populated with data from the ``locus`` qualifier from the EMBL
55+
file. For TAIR the value in the ``locus`` qualifier was used for
56+
``optional_id`` in the database.
57+
58+
4. Fields in ``rnc_accessions`` which used to be the empty string are now
59+
``NULL``.
60+
61+
5. The data in the ``note`` and ``db_xref`` are now JSON data structures. For
62+
``note`` the data structure will have ``text`` and ``ontology`` fields. The
63+
``ontology`` field is a mapping from a selected ontology (GO, SO, ECO) to a
64+
list of the terms that have been extracted. The ``text`` field contains a
65+
list of all strings of all text that could not be recognized as an from an
66+
ontology. As an example:
67+
68+
.. code:: json
69+
70+
{
71+
"ontology": [
72+
"ECO:0000202",
73+
"GO:0030533",
74+
"SO:0000253"
75+
],
76+
"text": [
77+
"Covariance Model: Bacteria; CM Score: 87.61",
78+
"Legacy ID: chr.trna3-GlyGCC"
79+
]
80+
}
81+
82+
The ``db_xref`` field will contain data from the ``db_xref`` qualifier of
83+
the record. As well as data from the comment field. Additionally there with
84+
will be a ``db_xref`` field which contains the information from the ``DR``
85+
lines, excluding the MD5 xref. The keys will be the database in upper case,
86+
and the values will be a list of primary id and secondary id will be
87+
``null`` or the secondary id for the database, if present.
88+
89+
.. code:: json
90+
91+
{
92+
"ena_refs": {
93+
"WORMBASE": ["WBGene00196009", "F19C6.9"],
94+
"BIOSAMPLE": ["SAMEA3138177", null]
95+
}
96+
}
97+
98+
6. More publications are extracted. This will pull publication data from the
99+
``experiment`` qualifier. Sometimes this qualifier contains a string like
100+
``PMID: 10, 12``, in which case those numbers are extracted and the used to
101+
lookup the publication information from Europe PMC.
102+
103+
7. The ``anticodon`` field is more filled out. This will try to pull the
104+
sequence from the ``anticodon`` qualifier, or the ``gene`` qualifier, or
105+
from the ``note`` qualifier, using a variety of patterns.
106+
107+
- Database mappings are generated by this pipeline instead of the webapp. There
108+
is only 1 intentional change from the previous format, and that is adding
109+
chain ids to PDBe mappings as per `rnacentral-webcode/288
110+
<https://github.com/RNAcentral/rnacentral-webcode/issues/288>`_.
111+
112+
- Other exports are now done by this pipeline and are expected to produce the
113+
same results. They are:
114+
115+
- MD5 mapping export
116+
- FASTA export

bin/ftp-export

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#!/usr/bin/env bash
2+
3+
set -euo pipefail
4+
IFS=$'\n\t'
5+
6+
WORKERS="{1:-10}"
7+
8+
scheduler=$(bin/scheduler)
9+
bsub \
10+
-J "/rnc/export/ftp[1-$WORKERS]"
11+
-w "started($scheduler)" \
12+
python -m luigi --module tasks.export.ftp FtpExport

bin/import-data

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
#!/usr/bin/env bash
2+
3+
set -euo pipefail
4+
IFS=$'\n\t'
5+
6+
WORKERS=${1:-80}
7+
8+
scheduler=$(bin/scheduler)
9+
10+
# Create all CSV's
11+
bsub \
12+
-M 4000 \
13+
-J "/rnc/process[1-$workers]" \
14+
-w "started($scheduler)" \
15+
python -m luigi --module tasks ProcessData
16+
17+
# Update database with CSV data
18+
bsub \
19+
-M 4000 \
20+
-J "rnc-update" \
21+
-w "done(/rnc/process)" \
22+
bin/update

bin/pipeline

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
#!/usr/bin/env bash
2+
3+
set -euo pipefail
4+
IFS=$'\n\t'
5+
6+
error()
7+
{
8+
echo 1>&2 $@ && exit
9+
}
10+
11+
workers=${1:-80}
12+
virtual_env=${2:-import-env}
13+
14+
activate_sh="$virtual_env/bin/activate"
15+
[ -e "$activate_sh" ] || error "No virtual_env to activate"
16+
17+
module load pgsql/pgsql-95
18+
VIRTUAL_ENV_DISABLE_PROMPT=1 source "$activate_sh"
19+
export PYTHONPATH="${PYTHONPATH:-}:luigi"
20+
21+
# Start the scheduler
22+
bsub \
23+
-J "/rnc/scheduler" \
24+
'bin/update-config && luigid'
25+
26+
# Create all CSV's
27+
bsub \
28+
-M 4000 \
29+
-J "rnc-process[1-$workers]" \
30+
-w 'started(rnc-scheduler)' \
31+
python -m luigi --module tasks ProcessData
32+
33+
# Update database with CSV data
34+
bsub \
35+
-M 4000 \
36+
-J "rnc-update" \
37+
-w "done(rnc-process)" \
38+
bin/update

bin/scheduler

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
#!/usr/bin/env bash
2+
3+
set -euo pipefail
4+
IFS=$'\n\t'
5+
6+
job_is_unknown() {
7+
name="$1"
8+
bjobs -ru $USER -J $"name" | grep -c '^JOBID'
9+
}
10+
11+
name="/rnc/scheduler"
12+
13+
[ $(job_is_unknown "$name") -eq 0 ] || bsub -J "$name" 'bin/update-config && luigid'
14+
echo $name

bin/search-export

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/usr/bin/env bash
2+
3+
set -euo pipefail
4+
IFS=$'\n\t'
5+
6+
WORKERS="{1:-6}"
7+
8+
scheduler=$(bin/scheduler)
9+
bsub \
10+
-J "/rnc/export/search[1-$WORKERS]"
11+
-w "started($scheduler)" \
12+
-M 60000 \
13+
python -m luigi --module tasks.export.search Search

bin/update

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/usr/bin/env bash
2+
3+
set -euo pipefail
4+
IFS=$'\n\t'
5+
6+
python -m luigi --module tasks TruncateLoadTables --local-scheduler
7+
python -m luigi --module tasks Update --local-scheduler

bin/update-config

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/usr/bin/env bash
2+
3+
set -euo pipefail
4+
IFS=$'\n\t'
5+
6+
error()
7+
{
8+
echo 1>&2 $@ && exit
9+
}
10+
11+
[ -e luigi.cfg ] || error "Must have existing config to modify"
12+
13+
tmp=$(mktemp)
14+
mv luigi.cfg $tmp
15+
sed "s|default-scheduler-url=.*|default-scheduler-url=http://$HOSTNAME.ebi.ac.uk:8082|" $tmp > luigi.cfg
16+
rm $tmp
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)