Skip to content

3742 create data backend #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: otter
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# POS — Open Targets Pipeline Output Stage

Create Platform backend (OpenSearch and Clickhouse) and release data.


## Summary

This application uses the [Otter](http://github.com/opentargets/otter) library to
define the steps of the pipeline that consume the output from the data generation pipeline
and create the backend for the Open Targets Platform. There are also steps for releasing
data to BigQuery.

Check out the [config.yaml](config/config.yaml) file to see the steps and the tasks that
make them up.

TODO:
- [X] croissant
- [X] prep data for loading
- [X] load Clickhouse
- [X] load OpenSearch
- [X] create google disk snapshots for ch and os
- [X] create data tarballs
- [X] load BigQuery
- [ ] GCS release
- [ ] FTP release

## Installation and running

POS uses [UV](https://docs.astral.sh/uv/) as its package manager. It is compatible
with PIP, so you can also fall back to it if you feel more comfortable.


```bash
uv run pos -h
```

### Configuration

A folder for all the configuration is [here](config), which has the following:

- Main config for otter: [config.yaml](config/config.yaml)
- Config for datasets, data sources/table names/settings etc. for Clickhouse, OpenSearch, BigQuery: [datasets.yaml](config/datasets.yaml)
- Clickhouse configs/schema/sql: [clickhouse](config/clickhouse/)
- OpenSearch Dockerfile/index settings: [opensearch](config/opensearch/)

It's configured by default to load all the necessary datasets, but it can be modified. Make sure that the dataset names in the config.yaml have a corresponding entry in the datasets.yaml and so on.

### Create the OT Platform backend
1. start a google vm and clone this repo, see installation.
1. ideally something like n2-highmem-96 - reserve half the mem for the JVM
2. external disk for opensearch
3. external disk for clickhouse
2. opensearch (each step needs to be completed before starting the next)
1. `uv run pos -p 300 -c config/config.yaml -s open_search_prep_all`
2. `uv run pos -p 100 -c config/config.yaml -s open_search_load_all`
3. `uv run pos -c config/config.yaml -s open_search_stop`
4. `uv run pos -c config/config.yaml -s open_search_disk_snapshot`
5. `uv run pos -c config/config.yaml -s open_search_tarball`
3. clickhouse (each step needs to be completed before starting the next)
1. `uv run pos -c config/config.yaml -s clickhouse_load_all`
2. `uv run pos -c config/config.yaml -s clickhouse_stop`
3. `uv run pos -c config/config.yaml -s clickhouse_disk_snapshot`
4. `uv run pos -c config/config.yaml -s clickhouse_tarball`



## Copyright

Copyright 2014-2025 EMBL - European Bioinformatics Institute, Genentech, GSK,
MSD, Pfizer, Sanofi and Wellcome Sanger Institute

This software was developed as part of the Open Targets project. For more
information please see: http://www.opentargets.org

Licensed under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of the
License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
38 changes: 0 additions & 38 deletions config.yaml

This file was deleted.

12 changes: 12 additions & 0 deletions config/clickhouse/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
ARG TAG=latest

FROM clickhouse/clickhouse-server:${TAG}

ARG UID=1000
ARG GID=1000

USER root

RUN chown -R ${UID}:${GID} /var/lib/clickhouse /var/log/clickhouse-server /etc/clickhouse-server /docker-entrypoint-initdb.d

USER ${UID}:${GID}
43 changes: 43 additions & 0 deletions config/clickhouse/config.d/config.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
<?xml version="1.0"?>
<yandex>
<logger>
<level>warning</level>
<log>/var/log/clickhouse-server/clickhouse-server.log</log>
<errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
<size>1000M</size>
<count>7</count>
</logger>
<display_name>ot-genetics</display_name>
<http_port>8123</http_port>
<tcp_port>9000</tcp_port>
<interserver_http_port>9009</interserver_http_port>
<listen_host>0.0.0.0</listen_host>
<listen_try>0</listen_try>
<listen_reuse_port>1</listen_reuse_port>
<listen_backlog>256</listen_backlog>
<max_connections>2048</max_connections>
<keep_alive_timeout>60</keep_alive_timeout>
<max_concurrent_queries>256</max_concurrent_queries>
<max_open_files>262144</max_open_files>
<uncompressed_cache_size>17179869184</uncompressed_cache_size>
<mark_cache_size>17179869184</mark_cache_size>
<!-- Path to data directory, with trailing slash. -->
<path>/var/lib/clickhouse/</path>
<tmp_path>/var/lib/clickhouse/tmp/</tmp_path>
<user_files_path>/var/lib/clickhouse/user_files/</user_files_path>
<users_config>users.xml</users_config>
<default_profile>default</default_profile>
<!-- <system_profile>default</system_profile> -->
<default_database>default</default_database>
<umask>022</umask>
<zookeeper incl="zookeeper-servers" optional="true" />
<macros incl="macros" optional="true" />
<dictionaries_config>*_dictionary.xml</dictionaries_config>
<builtin_dictionaries_reload_interval>3600</builtin_dictionaries_reload_interval>
<max_session_timeout>3600</max_session_timeout>
<default_session_timeout>60</default_session_timeout>
<distributed_ddl>
<path>/clickhouse/task_queue/ddl</path>
</distributed_ddl>
<format_schema_path>/var/lib/clickhouse/format_schemas/</format_schema_path>
</yandex>
12 changes: 12 additions & 0 deletions config/clickhouse/schema/aotf_log.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
create database if not exists ot;

create table if not exists ot.associations_otf_log (
row_id String,
disease_id String,
target_id String,
disease_data Nullable (String),
target_data Nullable (String),
datasource_id String,
datatype_id String,
row_score Float64
) engine = Log;
13 changes: 13 additions & 0 deletions config/clickhouse/schema/literature_log.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
create database if not exists ot;

create table if not exists ot.literature_log (
pmid String,
pmcid Nullable (String),
date Date,
year UInt16,
month UInt8,
day UInt8,
keywordId String,
relevance Float64,
keywordType FixedString (2)
) engine = Log;
14 changes: 14 additions & 0 deletions config/clickhouse/schema/sentences_log.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
create database if not exists ot;

create table if not exists ot.sentences_log (
pmid String,
pmcid Nullable (String),
section String,
endInSentence UInt16,
label Nullable (String),
sectionEnd UInt16,
sectionStart UInt16,
startInSentence UInt16,
keywordType FixedString (2),
keywordId String
) engine = Log;
8 changes: 8 additions & 0 deletions config/clickhouse/schema/w2v_log.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
create database if not exists ot;

create table if not exists ot.ml_w2v_log (
category String,
word String,
norm Float64,
vector Array(Float64)
) engine = Log;
55 changes: 55 additions & 0 deletions config/clickhouse/scripts/aotf.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
create table if not exists ot.associations_otf_disease engine = MergeTree ()
order by
(A, B, datasource_id) primary key (A) as
select
row_id,
A,
B,
datatype_id,
datasource_id,
row_score,
A_search,
B_search
from
(
select
row_id,
disease_id as A,
target_id as B,
datatype_id,
datasource_id,
row_score,
lower(disease_data) as A_search,
lower(target_data) as B_search
from
ot.associations_otf_log
);

create table if not exists ot.associations_otf_target engine = MergeTree ()
order by
(A, B, datasource_id) primary key (A) as
select
row_id,
A,
B,
datatype_id,
datasource_id,
row_score,
A_search,
B_search
from
(
select
row_id,
disease_id as B,
target_id as A,
datatype_id,
datasource_id,
row_score,
lower(disease_data) as B_search,
lower(target_data) as A_search
from
ot.associations_otf_log
);

drop table ot.associations_otf_log;
36 changes: 36 additions & 0 deletions config/clickhouse/scripts/literature.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
-- cat part-00* | clickhouse-client -h localhost --query="insert into ot.literature_log format JSONEachRow "
create database if not exists ot;

create table if not exists ot.literature_index engine = MergeTree ()
order by
(keywordId, SHA512 (pmid), year, month, day) as (
select
pmid,
pmcid,
keywordId,
relevance,
date,
year,
month,
day
from
ot.literature_log
);

create table if not exists ot.literature engine = MergeTree ()
order by
(SHA512 (pmid)) as (
select
pmid,
any (pmcid) as pmcid,
any (date) as date,
any (year) as year,
any (month) as month,
any (day) as day
from
ot.literature_log
group by
pmid
);

drop table ot.literature_log;
20 changes: 20 additions & 0 deletions config/clickhouse/scripts/sentences.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
create database if not exists ot;

create table if not exists ot.sentences engine = MergeTree ()
order by
(keywordId, SHA512 (pmid)) as (
select
pmid,
section,
label,
sectionEnd,
sectionStart,
startInSentence,
endInSentence,
keywordType,
keywordId
from
ot.sentences_log
);

drop table ot.sentences_log;
20 changes: 20 additions & 0 deletions config/clickhouse/scripts/w2v.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
create table if not exists ot.ml_w2v engine = MergeTree ()
order by
(word) primary key (word) as
select
category,
word,
norm,
vector
from
(
select
category,
word,
norm,
vector
from
ot.ml_w2v_log
);

drop table ot.ml_w2v_log;
Loading