Skip to content

gregkosi/dbt-impala-example

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dbt-impala-example

This repo provides an example project for the dbt-impala adapter for dbt.

dbt_impala_demo

This directory is a dbt project.

util

This directory contains some utilities for generating fake data.

Getting started

It is recommended to use CDP CML.

Requirements

dbt >= 1.0.1

dbt-impala >= 1.0.1

impyla

Install

Start by creating a new project in CML.

Use the folowing git repository:

https://github.com/gregkosi/dbt-impala-example.git

Next install the requirements by running the following:

Option 1: pip3 install -r requirements.txt

Option 2: ./cdsw-build.sh

in your home folder /home/cdsw/

Configure

Go to profile in ~/.dbt/profiles.yml

For a Cloudera Data Platform cluster (CDW or DataHub), it should look like this:

dbt_impala_demo:
  outputs:
    dev:
     type: impala
     host: coordinator-default-impala.dw-go02-demo-aws.ylcu-atmi.cloudera.site
     port: 443
     dbname: xx_dbt_demo  <- change the "xx" to your initials lower cases
     schema: xx_dbt_demo  <- change the "xx" to your initials lower cases
     user: xx     <- add your CDP username
     password: xx <- add your CDP workload password 
     auth_type: ldap
     use_http_transport: true
     use_ssl: true
     http_path: cliservice
  target: dev

Test the profile with dbt debug

Generate fake raw data

To generate fake data, we must first add our Impala details to util/data_gen/write_data.py

Modify the following section to reflect your environment:

impala_conf = {
    'host': 'coordinator-default-impala.dw-go02-demo-aws.ylcu-atmi.cloudera.site',
    'port': '443',
    'user': 'xx',     <- change "xx" to your CDP username
    'password': 'xx', <- change "xx" to your CDP workload password
    'auth_mechanism': 'ldap',
    'use_ssl': True,
    'use_http_transport': True,
    'http_path': 'cliservice'
}

# Add initials  
initials = 'xx'   <- change "xx" to your initials lower cases

Modify a test in dbt_impala_demo/models/raw/covid/raw_covid.yml

sources:
   name: raw_covid
    schema: xx_dbt_demo_raw_covid <- change "xx" to your initials lower cases
    tables:
      - name: raw_covid__vaccines
      - name: raw_covid__cases
          tests:
           - not_null
           - length:
               len: 2

Next, move to the util/data_gen directory

cd util/data_gen

Run the start.sh helper to generate the first set of fake data

./start.sh 1

This generates 2 days of fake data for the dates 01/01/2022 and 02/01/2022 and writes it to Impala. It will create a database xx_dbt_demo_raw_covid with 2 tables.

Where xx are your initials.

Using dbt

With our fake data loaded, we can start using dbt.

dbt seed

First, run the Seeds to load some reference data. Two Seeds are included in the demo, populations and country_codes.

Move to the dbt_impala_demo dbt project directory.

Run the seeds with

dbt seed

dbt test

Our Seeds are configured with a couple of Tests

We also have a custom test created in dbt_impala_demo/tests/generic/test_length.sql which is used to test the character length of a column.

Our reference data has columns that include ISO Alpha2 and Alpha3 country codes - we know that these columns should always be 2 or 3 columns respectively. To ensure that our reference data is high quality, we can use dbt to test these assumptions and report the results.

We expect that Alpha2 and Alpha3 columns are the correct length, and that no fields should be null.

Run the tests with

dbt test

dbt run

We have 3 sets of models in this demo project.

Firstly, we have raw. Our raw models make use of Sources. This is data that already exists in our database that dbt needs to refer to. This is the fake data we loaded earlier.

Our raw models are defined in models/raw/covid/.

Next, we have staging. These are Models. Our staging models use the source() method to refer to the Sources we defined in our raw models. The staging models are intermediate views created over our raw data to handle some basic type conversion. These are materialized as views, and we don't expect our end users to query the staging models.

Our staging models are defined in models/staging/covid

Lastly, we have mart. These are Models. Our mart models use the ref() method to refer to the staging models and reference seeds we created using dbt. We use the staging views to handle most of the logic for type casting, but we do some renaming here to make our models easier for users to understand. These models are materialized as tables, as this gives greater performance for user queries. We can use incremental models to make the building of the model more performant.

Our mart models are defined in models/mart/covid

Run the models with

dbt run

Review the data

You should now have the following databases & tables:

  • reference (database)
    • ref__population (table)
    • ref__country_codes (table)
  • raw_covid (database)
    • raw_covid__vaccines (table)
    • raw_covid__cases (table)
  • staging_covid (database)
    • stg_covid__cases (view)
  • mart_covid (database)
    • covid_cases (table)

In the raw, staging and mart tables, you should see 2 days worth of data for the dates 01/01/2022 and 02/01/2022.

Usefull SQL queries:

--- Row data ----
select * from xx_dbt_demo_raw_covid.raw_covid__cases;
select * from xx_dbt_demo_raw_covid.raw_covid__vaccines;
select date_rep from xx_dbt_demo_raw_covid.raw_covid__cases group by date_rep;
select year_week_iso from xx_dbt_demo_raw_covid.raw_covid__vaccines group by year_week_iso;

---- staging Data ----
select * from xx_dbt_demo_staging_covid.stg_covid__cases;
select date_rep from xx_dbt_demo_staging_covid.stg_covid__cases group by date_rep;

---- Mart data ----
select * from xx_dbt_demo_mart_covid.covid_cases;
select report_date from xx_dbt_demo_mart_covid.covid_cases group by report_date;
show create table xx_dbt_demo_mart_covid.covid_cases;
show partitions xx_dbt_demo_mart_covid.covid_cases;

Generate more fake data

To demonstrate how we can handle new data arriving, let's generate some more data.

As before, move to the util/data_gen dir and generate the next 2 days of data with

./start.sh 2

This will generate fake data for 03/01/2022 and 04/01/2022 and write it into the raw tables in Impala.

Select the data in raw_covid.raw_covid__cases and you should see that you now have 4 days of data.

Select the data in staging_covid.stg_covid__cases and you should see that you also have 4 days of data, as this is a view ontop of the raw table.

Selecting the data from mart_covid.covid_cases will show that you still only have 2 days worth of data in the mart model. This is because we have not yet updated the model.

Run the models again

To get the latest data into our mart model, we must run the models again.

Run the models again with

dbt run

This will trigger the incremental update of the mart model, by selecting only the days that are later than the maximum date we already have (our current maximum is 02/01/2022, so it will only select the dates 03/01/2022 and 04/01/2022).

When complete, review the mart_covid.covid_cases tables and you should see that you now have 4 days worth of data here.

Usefull SQL queries:

---- Mart data ----
select * from xx_dbt_demo_mart_covid.covid_cases;
select report_date from xx_dbt_demo_mart_covid.covid_cases group by report_date;
show create table xx_dbt_demo_mart_covid.covid_cases;
show partitions xx_dbt_demo_mart_covid.covid_cases;

Post workshop cleaning.

Warnning! Pls make sure you replaced the "xx" with your initials.

------- DB environment cleaning --------
drop database xx_dbt_demo_mart_covid cascade;
drop database xx_dbt_demo_raw_covid cascade;
drop database xx_dbt_demo_reference cascade;
drop database xx_dbt_demo_staging_covid cascade;

About

A demo project for dbt-impala adapter for dbt

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 91.7%
  • Shell 8.3%