Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ To configure an attribute, you will need to set:

Attribute calculation starts when the definitions are applied, and aren't backdated.

All configuration is defined using the Signals Python SDK.
All configuration is defined using the [Signals Python SDK](https://pypi.org/project/snowplow-signals/).

## Minimal example

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,19 @@ To configure a table for batch attributes, you may choose to set up an attribute

For stream attributes, you can choose to configure and apply attribute groups that don't calculate their attribute values.

This means that configuration, calculation, materialization, and retrieval are fully decoupled.
This means that configuration, calculation, syncing, and retrieval are fully decoupled.

## Versioning

TODO

## Types of attribute groups

Signals includes three types of attribute groups. Choose which one to use depending on how you want to calculate and materialize the attributes:
Signals includes three types of attribute groups. Choose which one to use depending on how you want to calculate and sync the attributes:

- `StreamAttributeGroup`: processed from the real-time event stream
- `BatchAttributeGroup`: processed using the batch engine
- `ExternalBatchAttributeGroup`: uses precalculated attributes from an existing warehouse table that's materialized into Signals
- `ExternalBatchAttributeGroup`: uses precalculated attributes from an existing warehouse table that's synced into Signals

### StreamAttributeGroup

Expand Down Expand Up @@ -76,7 +76,7 @@ my_batch_attribute_group = BatchAttributeGroup(

### ExternalBatchAttributeGroup

Use an `ExternalBatchAttributeGroup` to materialize attributes from an existing warehouse table.
Use an `ExternalBatchAttributeGroup` to sync attributes from an existing warehouse table.


```python
Expand Down Expand Up @@ -124,7 +124,7 @@ Below is a summary of all options available for configuring attribute groups in
| `tags` | Metadata key-value pairs | `dict` | | ❌ | All |
| `attributes` | List of attributes to calculate | list of `Attribute` | | ✅ | `StreamAttributeGroup`, `BatchAttributeGroup` |
| `batch_source` | The batch data source for the attribute group | `BatchSource` | | ✅/❌ | `BatchAttributeGroup`/`ExternalBatchAttributeGroup` |
| `fields` | Table columns for materialization | `Field` | | ✅ | `ExternalBatchAttributeGroup` |
| `fields` | Table columns for syncing | `Field` | | ✅ | `ExternalBatchAttributeGroup` |
| `offline` | Calculate in warehouse (`True`) or real-time (`False`) | `bool` | varies | ❌ | All |
| `online` | Enable online retrieval (`True`) or not (`False`) | `bool` | `True` | ❌ | All |

Expand Down Expand Up @@ -213,4 +213,4 @@ Some attributes will only be relevant for a certain amount of time, and eventual

To avoid stale attributes staying in your Profiles Store forever, you can configure TTL lifetimes for attribute keys and attribute groups. When none of the attributes for an attribute key or attribute group have been updated for the defined lifespan, the attribute key or attribute group expires. Any attribute values for this attribute key or attribute group will be deleted: fetching them will return `None` values.

If Signals then processes a new event that calculates the attribute again, or materializes the attribute from the warehouse again, the expiration timer is reset.
If Signals then processes a new event that calculates the attribute again, or syncs the attribute from the warehouse again, the expiration timer is reset.
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ flowchart TD
K -->|No| L[Debug model issues]
L --> J
K -->|Yes| M[Update batch source config]
M --> N[Materialize tables to Signals]
M --> N[Sync tables to Signals]
N --> O[Attributes are available to use]
```

Expand All @@ -42,7 +42,7 @@ Choose where your new Signals dbt projects will live. Install the CLI tool there
pip install 'snowplow-signals[batch-engine]'
```

This adds the `snowplow-batch-autogen` tool to your environment.
This adds the `snowplow-batch-engine` tool to your environment.

### CLI commands

Expand All @@ -51,7 +51,7 @@ The available options are:
```
init # Initialize dbt project structure and base configuration
generate # Generate dbt project assets
materialize # Registers the attribute table as a data source with Signals
sync # Registers the attribute table as a data source with Signals and publishes the Attribute Group so that syncing can begin
test_connection # Test the connection to the authentication and API services
```

Expand All @@ -60,7 +60,7 @@ A `--verbose` flag is available for every command.
Here's an example of using the CLI:

```bash
snowplow-batch-autogen init --verbose
snowplow-batch-engine init --verbose
```

## Creating and registering tables
Expand All @@ -85,8 +85,6 @@ You will need to update the variables for each attribute group individually, by
| `snowplow__backfill_limit_days` | Limit backfill increments for the `filtered_events_table` | `1` |
| `snowplow__late_event_lookback_days` | The number of days to allow for late arriving data to be reprocessed during daily aggregation | `5` |
| `snowplow__min_late_events_to_process` | The threshold number of skipped daily events to process during daily aggregation | `1` |
| `snowplow__allow_refresh` | If set to true, the incremental manifest will be dropped when running with a `--full-refresh` flag | `false` |
| `snowplow__dev_target_name` | The target name of your development environment as defined in your dbt `profiles.yml` file | `dev` |
| `snowplow__atomic_schema` | Change this if you aren't using `atomic` schema for Snowplow event data | `'atomic'` |
| `snowplow__database` | Change this if you aren't using `target.database` for Snowplow event data | |
| `snowplow__events_table` | Change this if you aren't using `events` table for Snowplow event data | `"events"` |
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ You can use existing attributes that are already in your warehouse, or use the S
To use historical, warehouse attributes in your real-time use cases, you will need to sync the data to the Profiles Store. Signals includes a sync engine to do this.

:::note Warehouse support
Only Snowflake is supported currently.
Only Snowflake and BigQuery are supported currently.
:::

## Existing or new attributes?
Expand Down Expand Up @@ -53,17 +53,16 @@ The table below lists all available arguments for a `BatchSource`:
| `database` | The database where the attributes are stored | `string` | ✅ |
| `schema` | The schema for the table of interest | `string` | ✅ |
| `table` | The table where the attributes are stored | `string` | ✅ |
| `timestamp_field` | The timestamp field to use for point-in-time joins of attribute values | `string` | ❌ |
| `created_timestamp_column` | A timestamp column indicating when the row was created, used for deduplicating rows | `string` | ❌ |
| `date_partition_column` | A timestamp column used for partitioning data | `string` | ❌ |
| `timestamp_field` | Primary timestamp of the attribute value, the sync engine uses this to incrementally process only the rows that have changed since the last run | `string` | ❌ |
| `owner` | The owner of the source, typically the email of the primary maintainer | `string` | ❌ |
| `tags` | String key-value pairs of arbitrary metadata | dictionary | ❌ |

The `timestamp_field` is optional but recommended for incremental or snapshot-based tables. It should show the last modified time of a record. It's used during materialization to identify which rows have changed since the last sync. The sync engine only sends those with a newer timestamp to the Profiles Store.
The sync engine only sends rows with a newer timestamp to the Profiles Store, based on the `timestamp_field`. For each attribute key, make sure there is only one row per timestamp — otherwise, one value may be discarded arbitrarily.


### Defining an attribute group with fields

Pass your source to a new `ExternalBatchAttributeGroup` so that Signals does not materialize the attributes. This will be done later, once Signals has connected to the table.
Pass your source to a new `ExternalBatchAttributeGroup` so that Signals does not sync the attributes. This will be done later, once Signals has connected to the table.

For stream or batch attributes that are calculated by Signals, an attribute group contains references to your attribute definitions. In this case, the attributes are already defined elsewhere and pre-calculated in the warehouse. Instead of `attributes`, this attribute group will have `fields`.

Expand Down Expand Up @@ -112,15 +111,15 @@ Apply the attribute group configuration to Signals.
sp_signals.publish([attribute_group])
```

Signals will connect to the table, but the attributes will not be materialized into Signals yet because the attribute group has `online=False`.
Signals will connect to the table, but the attributes will not be synced into Signals yet because the attribute group has `online=False`.

To send the attributes to the Profiles Store, change the `online` parameter to `True`, and apply the attribute group again.

```python
sp_signals.publish([attribute_group])
```

The sync will begin: the sync engine will look for new records at a given interval, based on the `timestamp_field` and the last time it ran. The default time interval is 5 minutes.
The sync will begin: the sync engine will look for new records at a given interval, based on the `timestamp_field` and the last time it ran. The default time interval is 1 hour.

## Creating new attribute tables

Expand Down Expand Up @@ -165,6 +164,6 @@ The sync engine is a cron job that sends warehouse attributes to the Profiles St

The engine will be enabled when you either:
* Apply an `ExternalBatchAttributeGroup` for an existing table
* Run the batch engine `materialize` command after creating new attribute tables
* Run the batch engine `sync` command after creating new attribute tables

Once enabled, syncs begin at a fixed interval. By default, this is every 5 minutes. Only the records that have changed since the last sync are sent to the Profiles Store.
8 changes: 6 additions & 2 deletions tutorials/signals-batch-engine/conclusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,16 @@ title: Conclusion
In this tutorial you've learned how to calculate attributes from your warehouse data, and apply them to Signals.

This is the process workflow:
* Define batch view configurations and apply them to Signals
* Define batch attribute group configurations and apply them to Signals
* Initialize dbt projects
* Generate models
* Configure the projects with dbt
* Create tables by running the models
* Connect the tables to Signals by materialization
* Connect the tables to Signals by syncing

Supported warehouses:
* Snowflake
* BigQuery

## Next steps

Expand Down
19 changes: 11 additions & 8 deletions tutorials/signals-batch-engine/generate-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ Each project will have its own set of models generated based on its specific sch
For each project, the generation process will:

1. Create dbt configuration files
2. Generate SQL models based on the batch view's schema
2. Generate SQL models based on the batch attribute group's schema
3. Set up necessary macros and functions
4. Update any existing files if needed

For each batch view, the generated models are specifically designed for batch processing:
For each batch attribute group, the generated models are specifically designed for batch processing:

* Base models: raw data transformations
* Filtered events: event filtering and cleaning
Expand All @@ -23,22 +23,25 @@ For each batch view, the generated models are specifically designed for batch pr

Depending on how you initialized your projects, you can generate models in two ways.

If you created projects for all views, you can generate models for all of them at once:
If you created projects for all attribute groups, you can generate models for all of them at once:

```bash
# For all views
snowplow-batch-autogen generate --verbose
# For all attribute groups
snowplow-batch-engine generate --verbose
```

To generate models for a specific project:

```bash
snowplow-batch-autogen generate \
snowplow-batch-engine generate \
--project-name "user_attributes_1" \
--target-type snowflake \
--verbose
```

Remember that project names follow the format `{view_name}_{view_version}`.
Adjust the target-type to `bigquery`, if relevant.

Remember that project names follow the format `{attribute_group_name}_{attribute_group_version}`.

## Project structure

Expand All @@ -59,7 +62,7 @@ my_snowplow_repo/
│ │ └── dbt_config.json
│ │ └── batch_source_config.json
│ └── macros/ # Reusable SQL functions
├── product_views_2/
├── product_attribute_groups_2/
│ └── ... (same structure)
└── user_segments_1/
└── ... (same structure)
Expand Down
27 changes: 15 additions & 12 deletions tutorials/signals-batch-engine/initialize-project.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,30 @@ Having tested the connection, you can now initialize your projects.

When you run the initialization command, the CLI will:

1. Create a separate project directory for each relevant view
1. Create a separate project directory for each relevant attribute group
2. Set up the basic configuration files for each project
3. Initialize the necessary folder structure for each project
4. Prepare each project for model generation

## Run initialize

You can generate projects for all the relevant views in Signals at once, or one at a time.
You can generate projects for all the relevant attribute groups in Signals at once, or one at a time. Change your target-type to `bigquery` if relevant.

```bash
# For all views
snowplow-batch-autogen init --verbose
# For all attribute groups
snowplow-batch-engine init \
--target-type snowflake \
--verbose

# For a specific view
snowplow-batch-autogen init \
--view-name "user_attributes" \
--view-version 1 \
# For a specific attribute group
snowplow-batch-engine init \
--attribute-group-name "user_attributes" \
--attribute-group-version 1 \
--target-type snowflake \
--verbose
```

Each view will have its own separate dbt project, with the project name following the format `{view_name}_{view_version}`.
Each attribute group will have its own separate dbt project, with the project name following the format `{attribute_group_name}_{attribute_group_version}`.

The files will be generated at the path specified in your `SNOWPLOW_REPO_PATH` environment variable.

Expand All @@ -37,20 +40,20 @@ After initialization, your repository will have a structure like this:

```
my_repo/
├── my_view_1/
├── my_attribute_group_1/
│ └── configs/
│ └── base_config.json
├── etc.
```

In this example, projects were generated for three views: `user_attributes` v1, `product_views` v2, and `user_segments` v3:
In this example, projects were generated for three attribute groups: `user_attributes` v1, `product_attribute_groups` v2, and `user_segments` v3:

```
my_snowplow_repo/
├── user_attributes_1/
│ └── configs/
│ └── base_config.json
├── product_views_2/
├── product_attribute_groups_2/
│ └── configs/
│ └── base_config.json
└── user_segments_1/
Expand Down
4 changes: 2 additions & 2 deletions tutorials/signals-batch-engine/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ The batch engine is part of the Signals Python SDK. It's not installed by defaul
pip install 'snowplow-signals[batch-engine]'
```

This will install the CLI tool as `snowplow-batch-autogen`, along with the necessary dependencies.
This will install the CLI tool as `snowplow-batch-engine`, along with the necessary dependencies.

## Available commands

Expand All @@ -22,7 +22,7 @@ The available options are:
```bash
init # Initialize dbt project structure and base configuration
generate # Generate dbt project assets
materialize # Registers the attribute table as a data source with Signals
sync # Registers the attribute table as a data source with Signals
test_connection # Test the connection to the authentication and API services
```

Expand Down
2 changes: 1 addition & 1 deletion tutorials/signals-batch-engine/run-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Before running your new models, you'll need to configure their dbt connection pr

During the run process:
* dbt will compile your SQL models
* Tables and views will be created in your data warehouse
* Tables will be created in your data warehouse
* You'll see progress updates in the terminal
* Any errors will be clearly displayed

Expand Down
20 changes: 11 additions & 9 deletions tutorials/signals-batch-engine/start.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,15 @@ Welcome to the [Snowplow Signals](/docs/signals/) batch engine tutorial.

Snowplow Signals is a real-time personalization engine for customer intelligence, built on Snowplow's behavioral data pipeline. It allows you to compute, access, and act on in-session stream and historical user data, in near real time.

The Signals batch engine is a CLI tool to help with historical data analysis. It isn't required to use Signals: it's only necessary if you want to:
* Analyze historical data, rather than in real time
The Signals batch engine is a CLI tool to create the attributes in your warehouse to compute over larger historical data that otherwise would not be possible / efficient to do so in real time. It isn't required to use Signals: it's only necessary if you want to:
* Calculate attributes from Snowplow events in your warehouse
* Sync those attributes to the Profiles Store so they can be served in real time alongside stream attributes

The batch engine helps by:
* Generating separate dbt projects for each batch view definition
* Testing and validating your data pipelines
* Materializing calulated attributes to Signals for production use
* Generating separate dbt projects for each batch attribute group definition
* Building efficient modeled datasets at different aggregation levels, instead of querying directly against large atomic event tables
* Producing attributes tables optimized for downstream use
* Syncing the calculated attributes to Signals, making them available for production use

To use tables of pre-existing, already calculated values, read up on external batch sources in the [Signals documentation](/docs/signals/concepts/).

Expand All @@ -25,15 +26,16 @@ This guide will walk you through the steps to set up the batch engine and calcul
This tutorial assumes that you have:

* Python 3.11+ installed in your environment
* Snowflake warehouse with tables of Snowplow events
* Permissions to create tables and views in your warehouse
* [dbt](https://www.getdbt.com/) configured in your warehouse
* Snowflake or BigQuery warehouse with your atomic Snowplow events ready to use as the data source
* [dbt](https://www.getdbt.com/) with your warehouse [target](https://docs.getdbt.com/reference/dbt-jinja-functions/target) set up
* Basic [dbt](https://www.getdbt.com/) knowledge
* Valid API credentials for your Signals account:
* Signals API URL
* Snowplow API key
* Snowplow API key ID
* Snowplow organization ID
* BatchView definitions already applied to Signals
* Batch attribute groups already created for Signals, but not yet published

The batch source configuration can't be done before the attributes table has been created.

Check out the [Signals configuration](/docs/signals/) documentation to find out where to find these credentials, and how to apply attribute configurations.
Loading