Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorial batches in incremental models #11324

Open
3 tasks done
ehrete opened this issue Feb 19, 2025 · 0 comments
Open
3 tasks done

Categorial batches in incremental models #11324

ehrete opened this issue Feb 19, 2025 · 0 comments
Labels
enhancement New feature or request triage

Comments

@ehrete
Copy link

ehrete commented Feb 19, 2025

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Situation

We want to combine data from different source systems (e.g. SAP, SALESFORCE, ABC) with different transformation logic in a single target table

The current handling of this requirement --> One query with union all operators:

Model definition:

-- logic for source system SAP
SELECT
    *
FROM "SAP"."......."

-- logic for source system SALESFORCE
UNION ALL
SELECT
    *
FROM "SALESFORCE"."......."

-- logic for source system ABC
UNION ALL
SELECT
    *
FROM "ABC"."......."

Disadvantage of such approach:

Proposal / Feature request

  • New function Categorical batches
  • Should use a similar logic as the microbatch materialisation
  • Only active with incremental materialisation
  • Additional properties in the incremental model configuration
    • categorial_batch_column
    • categorial_batch_values

Example - Model definition

{{
    config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key = ["KEY_COLUMN"],

        -- Additional properties
        categorial_batch_column = 'SOURCE_SYSTEM',  --> the column after which is filtered
        categorial_batch_values = ["SAP", "SALESFORCE", "ABC"] --> 3 possible batches
    )
}}


with 

...

-- logic that transforms the sap data (potentially ephemeral model)
sap_data as (

    select
        A,
        B,
        C
        'SAP'  AS "SOURCE_SYSTEM"
    from ...

),

-- logic that transforms the salesforce data (potentially ephemeral model)
salesforce_data as (

    select
        A,
        B,
        C
        'SALESFORCE'  AS "SOURCE_SYSTEM"
    from ...

),

-- logic that transforms the abc data (potentially ephemeral model)
abc_data as (

    select
        A,
        B,
        C
        'ABC'  AS "SOURCE_SYSTEM"
    from ...    
),


-- concatenation of all sources
final as (

    select * from sap_data

    union all

    select * from salesforce_data

    union all

    select * from abc_data

)


select 
    *
from final

Compiled queries

There is one batch for each value in the variable categorial_batch_values. The propsed categorial batch feature packs the compiled model definition into a sub-query and adds the value as a filter. The database pushes the filter down and prunes out the irrelevant values (other source systems).

Batch 1: SAP Data

select 
    *
from 
(

        with 

        ...


        sap_data as (

            select
                A,
                B,
                C
                'SAP'  AS "SOURCE_SYSTEM"
            from ...

        ),

        salesforce_data as (

            select
                A,
                B,
                C
                'SALESFORCE'  AS "SOURCE_SYSTEM"
            from ...

        ),

        abc_data as (

            select
                A,
                B,
                C
                'ABC'  AS "SOURCE_SYSTEM"
            from ...    
        ),


        final as (

            select * from sap_data

            union all

            select * from salesforce_data

            union all

            select * from abc_data

        )


        select 
            *
        from final

) query

where SOURCE_SYSTEM = 'SAP'

Batch 2: Salesforce Data

select 
    *
from 
(

        with 

        ...


        final as (

            select * from sap_data

            union all

            select * from salesforce_data

            union all

            select * from abc_data

        )


        select 
            *
        from final

) query

where SOURCE_SYSTEM = 'SALESFORCE'

Batch 3: ABC Data

select 
    *
from 
(

        with 

        ...


        final as (

            select * from sap_data

            union all

            select * from salesforce_data

            union all

            select * from abc_data

        )


        select 
            *
        from final

) query

where SOURCE_SYSTEM = 'ABC'

Describe alternatives you've considered

Combine different transformation logic with Union all statement (see feature description)

Who will this benefit?

See feature description

Additional benefit: You can divide a demanding sql query into multiple small batches based on a categorial/textual column.

Are you interested in contributing this feature?

No response

Anything else?

No response

@ehrete ehrete added enhancement New feature or request triage labels Feb 19, 2025
@ehrete ehrete changed the title [Feature] <title>Categorial batches in incremental models Categorial batches in incremental models Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

1 participant