In most cases Tanagra can query the source data directly, but for improved performance, Tanagra generates indexed tables and queries them instead. The indexer config specifies where Tanagra can write generated index tables.
However here are a few scenarios where indexing is strictly required, such as calculating ancestors for every item in hierarchies based off the parent-child input data. These steps use Dataflow because they cannot be reasonably simplified to SQL. For most things it's for performance reasons, though some of those (e.g. calculating rollup counts) would be slow enough to be completely unusable without it.
Another consideration is that performance directly correlates to cost in many cases, either because it simplifies queries or allows the BQ tables to be optimized (e.g. clustering for common columns).
Generating index tables is part of the deployment process; It is not managed by the service. There is a basic command line interface to run the indexing jobs. Currently, this CLI just uses Gradle's application plugin, so the commands are actually Gradle commands.
Before running the indexing jobs, you need to specify the data mapping and indexer config files.
There are 4 steps to generating the index tables:
- Setup credentials with read permissions on the source data, and read-write permissions on the index data.
- Create the index dataset.
- Build the indexer code.
- Run the jobs.
Below you can see an example of the commands for an OMOP dataset.
You need to set up 2 types of credentials to run all parts of indexing:
- Default application credentials to allow Tanagra to talk to BigQuery.
- gcloud credentials to allow you to create a new BigQuery dataset using the
bq
CLI.
Set the default application credentials to an account that has read permissions on the source data, and read-write permissions on the index data. Best practice is to use a service account that has indexing-specific permissions, but for debugging, end-user credentials can be very useful.
To use a service account key file, you can set an environment variable:
export GOOGLE_APPLICATION_CREDENTIALS=$(PWD)/rendered/tanagra_sa.json
To use end-user credentials, you can run:
gcloud auth application-default login
More information and ways to set up application default credentials in the GCP docs.
To use a service account key file:
gcloud auth activate-service-account --key-file=$GOOGLE_APPLICATION_CREDENTIALS
To use end-user credentials:
gcloud auth login
Create a new BigQuery dataset to hold the indexed tables. Change the location, project, and dataset name below to your own.
bq mk --location=location project_id:dataset_id
Build the indexer code once.
source indexer/tools/local-dev.sh
If you make a change to any config file, no need to re-build. If you make a change to any Java code, you should re-build.
Documentation for the indexing commands is generated from annotations in the Java classes as a manpage. You can also see usage help for the commands by omitting arguments. e.g.
> tanagra
Tanagra command-line interface.
Usage: tanagra [COMMAND]
Commands:
index Commands to run indexing.
clean Commands to clean up indexing outputs.
Exit codes:
0 Successful program execution
1 User-actionable error (e.g. missing parameter)
2 System or internal error (e.g. validation error from within the Tanagra
code)
3 Unexpected error (e.g. Java exception)
Do a dry run first for validation. This provides a sanity check that the indexing jobs inputs, especially the SQL query inputs, are valid. This step is not required, but highly recommended to help catch errors/bugs sooner and without running a bunch of computation first. Dry run includes validating that the attribute data types specified match those returned by running the SQL query.
tanagra index underlay --indexer-config=cmssynpuf_verily --dry-run
Kick off all jobs for the underlay.
tanagra index underlay --indexer-config=cmssynpuf_verily
This can take a long time to complete. If e.g. your computer falls asleep, or you need to kill the process on your computer, you can re-run the same command again. You need to check that there are no in-progress Dataflow jobs in the project before kicking it off again, because the jobs check for the existence of the output BQ table (not whether there are any in-progress Dataflow jobs) to tell if they need to run.
You can also kick off the jobs for a single entity or entity group. This is helpful for testing and debugging. To kick off all the jobs for a single entity:
tanagra index entity --names=person --indexer-config=cmssynpuf_verily --dry-run
tanagra index entity --names=person --indexer-config=cmssynpuf_verily
all entities:
tanagra index entity --all --indexer-config=cmssynpuf_verily --dry-run
tanagra index entity --all --indexer-config=cmssynpuf_verily
a single entity group:
tanagra index group --names=conditionPerson --indexer-config=cmssynpuf_verily --dry-run
tanagra index group --names=conditionPerson --indexer-config=cmssynpuf_verily
all entity groups:
tanagra index group --all --indexer-config=cmssynpuf_verily --dry-run
tanagra index group --all --indexer-config=cmssynpuf_verily
All the entities in a group must be indexed before the group. The tanagra index underlay
command ensures this ordering,
but keep this in mind if you're running the jobs for each entity or entity group separately.
By default, the indexing jobs are run concurrently as much as possible. You can force it to run jobs serially by overriding the default job executor:
tanagra index underlay --indexer-config=cmssynpuf_verily --job-executor=SERIAL
Indexing jobs will not overwrite existing index tables. If you want to re-run indexing, either for a single entity/group or for everything, you need to delete any existing index tables. You can either do that manually or using the clean commands below. Similar to the indexing commands, the clean commands also allow dry runs.
To clean the generated index tables for everything:
tanagra clean underlay --indexer-config=cmssynpuf_verily --dry-run
tanagra clean underlay --indexer-config=cmssynpuf_verily
a single entity:
tanagra clean entity --names=person --indexer-config=cmssynpuf_verily --dry-run
tanagra clean entity --names=person --indexer-config=cmssynpuf_verily
a single entity group:
tanagra clean group --names=conditionPerson --indexer-config=cmssynpuf_verily --dry-run
tanagra clean group --names=conditionPerson --indexer-config=cmssynpuf_verily
While developing a job, running locally is faster. Also, you can use Intellij debugger.
Add to BigQueryIndexingJob.buildDataflowPipelineOptions()
:
import org.apache.beam.runners.direct.DirectRunner;
dataflowOptions.setRunner(DirectRunner.class);
dataflowOptions.setTempLocation("gs://dataflow-staging-us-central1-694046000181/temp");
Filter your queries on one person, eg:
.where(
new BinaryFilterVariable(
idFieldVar,
BinaryOperator.EQUALS,
new Literal.Builder().dataType(DataType.INT64).int64Val(1107050).build()))
The cmssynpuf
underlay is a data mapping for a public dataset
that uses the OMOP schema.
You can see the top-level underlay config file for this dataset here.
bq mk --location=US verily-tanagra-dev:cmssynpuf_index
tanagra index underlay --indexer-config=cmssynpuf_verily --dry-run
tanagra index underlay --indexer-config=cmssynpuf_verily