Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions docs/resources/migration-guides/google-analytics/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
---
title: "Google Analytics to Snowplow"
date: "2025-01-20"
sidebar_position: 10
---

This guide helps technical implementers migrate from Google Analytics to Snowplow.

## Platform differences

There are significant differences between Google Analytics and Snowplow as data platforms. For migration, it's important to understand how Snowplow structures events differently from GA4. This affects how you'll implement tracking and how you'll model the warehouse data.

### Google Analytics event structure

GA4 uses an event-based model where all user interactions are tracked as events with parameters. The main tracking method is `gtag('event', ...)` with event names and parameter objects.

GA4 events include:
* Built-in events like `page_view`, `session_start`, and `first_visit`
* Enhanced measurement events like `scroll`, `click`, and `file_download`
* Custom events that you define for specific business actions
* Ecommerce events like `purchase`, `add_to_cart`, and `view_item`

GA4 stores all data in a single `events_YYYYMMDD` table in BigQuery with nested RECORD fields for parameters and user properties. Each parameter exists as key-value pairs with separate columns for different data types (`string_value`, `int_value`, `float_value`).

Here's an example of GA4 ecommerce tracking:

```javascript
// GA4 tracking
gtag('event', 'purchase', {
transaction_id: '12345',
value: 25.42,
currency: 'USD',
items: [{
item_id: 'SKU_123',
item_name: 'Product Name',
category: 'Category',
quantity: 1,
price: 25.42
}]
});
```

### Snowplow event structure

Snowplow separates the action that occurred (the [event](/docs/fundamentals/events/index.md)) from the contextual objects involved in the action (the [entities](/docs/fundamentals/entities/index.md)), such as the user, the device, products, etc.

Snowplow SDKs provide methods for tracking page views, self-describing events, and many other kinds of events. All Snowplow events are defined by [JSON schemas](/docs/fundamentals/schemas/index.md) and are validated as they're processed through the pipeline.

Here's the equivalent ecommerce tracking with Snowplow:

```javascript
// Snowplow tracking
snowplow('trackSelfDescribingEvent', {
event: {
schema: 'iglu:com.snowplowanalytics.snowplow.ecommerce/transaction/jsonschema/1-0-0',
data: {
transaction_id: '12345',
revenue: 25.42,
currency: 'USD'
}
},
context: [{
schema: 'iglu:com.snowplowanalytics.snowplow.ecommerce/product/jsonschema/1-0-0',
data: {
id: 'SKU_123',
name: 'Product Name',
category: 'Category',
price: 25.42,
quantity: 1
}
}]
});
```

The key difference is that Snowplow separates the transaction event from the product entity, creating reusable data structures that can attach to multiple event types.

### Tracking comparison

This table shows how common GA4 events map to Snowplow tracking:

| GA4 Event | GA4 Example | Snowplow Implementation |
|-----------|-------------|-------------------------|
| `page_view` | `gtag('event', 'page_view', {page_title: 'Home'})` | Use `trackPageView()`. The tracker captures details like `title` and `url` automatically. |
| `purchase` | `gtag('event', 'purchase', {transaction_id: '123', value: 99.99})` | Use built-in ecommerce tracking or define a custom `purchase` schema with `trackSelfDescribingEvent`. |
| `add_to_cart` | `gtag('event', 'add_to_cart', {currency: 'USD', value: 15.25})` | Use ecommerce tracking with `product` entity attached to `add_to_cart` event. |
| Custom events | `gtag('event', 'video_play', {video_title: 'Demo'})` | Define a custom `video_play` schema containing `video_title` property and track with `trackSelfDescribingEvent`. |

### Warehouse data structure

GA4 exports all data to a single `events_YYYYMMDD` table in BigQuery with nested structures. This requires complex UNNEST operations to extract parameters and makes joining data difficult.

Snowplow uses a single [`atomic.events`](/docs/fundamentals/canonical-event/index.md) table where events and entities appear as structured columns. This eliminates the need for complex unnesting and simplifies analysis.

### Key architectural differences

| Feature | Google Analytics 4 | Snowplow |
|---------|-------------------|----------|
| **Deployment** | SaaS-only; data processed on Google servers | Private cloud; runs in your AWS/GCP/Azure account |
| **Data ownership** | Data exported to BigQuery; vendor controls processing | Complete data ownership and pipeline control |
| **Data validation** | Optional validation through limited schema enforcement | Mandatory schema validation for all events |
| **Real-time processing** | 1-4 hour batch processing | 2-5 second event processing |
| **Customization** | Limited to predefined parameters and events | Unlimited custom events and entities |
| **Cost model** | Based on BigQuery usage and data export limits | Based on event volume and infrastructure usage |

## Migration phases

We recommend using a parallel-run migration approach. This process can be divided into three phases:
1. Assess and plan
2. Implement and validate
3. Cutover and finalize

### Assess and plan

#### Audit existing implementation
- Audit all GA4 tracking in your application code (gtag, Google Tag Manager)
- Document all custom events, parameters, and conversion definitions
- Export GA4 configuration and identify enhanced ecommerce tracking patterns
- Map existing audiences and segments
- Document downstream data consumers (BI dashboards, reports, ML models)

#### Design Snowplow tracking plan
- Translate GA4 events into Snowplow self-describing events
- The [Snowplow CLI](/docs/data-product-studio/snowplow-cli/index.md) MCP server can help with this
- Identify reusable entities that can replace repeated parameters
- Create JSON schemas for all events and entities
- Design enhanced data capture beyond GA4's limitations

#### Deploy infrastructure
- Set up Snowplow pipeline components in your cloud environment
- Configure data warehouse destinations
- Publish schemas to your schema registry
- Use Snowplow BDP Console or the Snowplow CLI

### Implement and validate

#### Set up dual tracking
- Add [Snowplow tracking](/docs/sources/index.md) to run parallel with existing GA4 tracking
- Use [Snowtype](/docs/data-product-studio/snowtype/index.md) to generate type-safe tracking code
- Configure Google Tag Manager Server-side if needed to avoid client-side changes
- Use [Snowplow Micro](/docs/data-product-studio/data-quality/snowplow-micro/index.md) for local testing

#### Data validation
- Compare high-level metrics between systems (daily events, unique users, sessions)
- Validate critical business events and parameter values
- Perform end-to-end data reconciliation in your warehouse
- Monitor data quality and pipeline health

#### Historical data strategy

For historical data, you have two approaches:
- **Coexistence**: leave historical GA4 data in BigQuery. Write queries that combine data from both systems using transformation layers in dbt.
- **Unification**: export GA4 data from BigQuery and transform it into Snowplow format. This requires custom engineering work but provides a unified historical dataset.

#### Gradual rollout
- Start with non-critical pages or events
- Gradually expand tracking coverage
- Update [dbt models](https://docs.snowplow.io/docs/modeling-data/modeling-your-data/dbt/) to use Snowplow data structure
- Test downstream integrations

### Cutover and finalize

#### Update downstream consumers
- Migrate BI dashboards and reports to query Snowplow tables
- Update data pipelines and ML models
- Test all data-dependent workflows

#### Configure integrations
- Set up [event forwarding](https://docs.snowplow.io/docs/destinations/forwarding-events/) for real-time destinations
- Configure reverse ETL workflows if needed
- Integrate with existing marketing and analytics tools

#### Complete transition
- Remove GA4 tracking code from applications
- Archive GA4 configuration and data exports
- Update documentation and team processes
- Monitor system performance and data quality post-migration
22 changes: 22 additions & 0 deletions docs/resources/migration-guides/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: "Migration guides"
sidebar_position: 1
---

This section contains advice for migrating to Snowplow from other solutions.

There are two possible migration strategies: parallel-run or full re-architecture.

## Parallel-run

A parallel-run approach is the recommended, lowest-risk strategy. It involves running both systems simultaneously (dual-tracking) before switching over to Snowplow entirely. This allows you to test and validate your new Snowplow data in your warehouse, without affecting any existing workflows or production systems.

## Full re-architecture

A "rip-and-replace" approach is faster but riskier, involving a direct switch from your existing system to Snowplow. This is best suited for:

* Major application refactors where the switch can be part of a larger effort
* Teams with high risk tolerance and robust automated testing frameworks
* New projects or applications with minimal legacy systems

A full re-architecture strategy requires thorough testing in a staging environment to prevent data loss.
157 changes: 157 additions & 0 deletions docs/resources/migration-guides/segment/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
---
title: "Segment to Snowplow"
date: "2025-08-04"
sidebar_position: 0
---

This guide helps technical implementers migrate from Segment to Snowplow.

## Platform differences

There are a number of differences between Segment and Snowplow as a data platform. For migration, it's important to be aware of how Snowplow structures events differently from Segment. This affects how you'll implement tracking and how you'll model the warehouse data.

### Segment event structure

Segment's core method for tracking user behavior is `track`. A `track` call contains a name that describes the action taken, and a `properties` object that contains contextual information about the action.

The other Segment tracking methods are:
* `page` and `screen` record page views and screen views
* `identify` describes the user, and associates a `userId` with user `traits`
* `group` associates the user with a group
* `alias` merges user identities, for identity resolution across applications

With Segment, you track data about the user's action separately from data about the user. These are stitched together during data modeling in the warehouse.

Here's an example showing how you can track an ecommerce transaction event on web using Segment:

```javascript
analytics.track('Transaction Completed', {
order_id: 'T_12345',
revenue: 99.99,
currency: 'USD',
products: [{
product_id: 'ABC123',
name: 'Widget',
price: 99.99,
quantity: 1
}]
})
```

The tracked events can be optionally validated against Protocols, defined as part of a tracking plan. They'll detect violations against your tracking plan, and you can choose to filter out events that don't pass validation.

### Snowplow event structure

Snowplow separates the action that occurred (the [event](/docs/fundamentals/events/index.md)) from the contextual objects involved in the action (the [entities](/docs/fundamentals/entities/index.md)), such as the user, the device, etc.

Snowplow SDKs also provide methods for tracking page views and screen views, along with many other kinds of events, such as button clicks, form submissions, page pings (activity), media interactions, and so on.

The equivalent to Segment's custom `track` method is `trackSelfDescribingEvent`.

All Snowplow events, whether designed by you or built-in, are defined by [JSON schemas](/docs/fundamentals/schemas/index.md). The events are always validated as they're processed through the Snowplow pipeline, and events that fail validation are separated out for assessment.


Here's an example showing how you could track a Snowplow ecommerce transaction event on web:

```javascript
snowplow('trackTransaction', {
transaction_id: 'T_12345',
revenue: 99.99,
currency: 'USD',
products: [{
id: 'ABC123',
name: 'Widget',
price: 99.99,
quantity: 1
}]
})
```

Superficially, it looks similar to Segment's `track` call. The first key difference is that the products property here contains a reusable `product` entity. You'd add this entity to any other relevant event, such as `add_to_cart` or `view_product`.

Secondly, the Snowplow tracking SDKs add multiple entities to all tracked events by default, including information about the specific page or screen view, the user's session, and the device or browser. Many other built-in entities can be configured, and you can define your own custom entities to add to any or all Snowplow events.

### Tracking comparison

This table gives examples of how the different Segment tracking methods map to Snowplow tracking.

| Segment API Call | Segment Example | Snowplow Implementation |
| ---------------- | --------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `track()` | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | Use one of the built-in event types, or define a custom `order_completed` schema containing `revenue` and `currency` properties, and track `trackSelfDescribingEvent`. |
| `page()` | `page('Pricing')` | Use `trackPageView`. The tracker SDK will capture details such as `title` and `url`. |
| `screen()` | `screen('Home Screen')` | Use `trackScreenView`. |
| `identify()` | `identify('user123', {plan: 'pro', created_at: '2024-01-15'})` | Call `setUserId('user123')` to track the ID in all events. Attach a custom `user` entity with schema containing `plan` and `created_at` properties. |
| `group()` | `group('company-123', {name: 'Acme Corp', plan: 'Enterprise'})` | No direct equivalent. Attach a custom `group` entity to your events, or track group membership changes as custom events with `group_joined` or `group_updated` schemas. |
| `alias()` | `alias('new-user-id', 'anonymous-id')` | No direct equivalent. Track identity changes as custom events with `user_alias_created` schema. Use `setUserId` to update the current user identifier for subsequent events. |

### Warehouse data structure

Segment loads each custom event type into separate tables, for example, `order_completed`, or `product_viewed` tables. Analysts must `UNION` tables together to reconstruct user journeys.

Snowplow uses a single [`atomic.events`](/docs/fundamentals/canonical-event/index.md) table in warehouses like Snowflake and BigQuery. Events and entities are stored as structured columns within that table, simplifying analysis.

## Migration phases

We recommend using a parallel-run migration approach. This process can be divided into three phases:
1. Assess and plan
2. Implement and validate
3. Cutover and finalize

### Assess and plan

#### Audit existing implementation
- Audit the Segment tracking calls in your application code
- Document all downstream data consumers, such as BI dashboards, dbt models, or ML pipelines
- Export your complete Segment tracking plan, using one of these methods:
- Ideally, use the Segment Public API to obtain the full JSON structure for each event
- Manually download CSVs from the Segment UI
- Infer it from warehouse data

#### Design Snowplow tracking plan
- Translate Segment events into Snowplow self-describing events
- The [Snowplow CLI](/docs/data-product-studio/snowplow-cli/index.md) MCP server can help with this
- Identify reusable entities that can replace repeated properties
- Create JSON schemas for all events and entities

#### Deploy infrastructure
- Confirm that your Snowplow infrastructure is up and running
- Publish your schemas so they're available to your pipeline
- Use Snowplow BDP Console or the Snowplow CLI

### Implement and validate

#### Set up dual tracking
- Add [Snowplow tracking](/docs/sources/index.md) to run in parallel with existing Segment tracking
- Use [Snowtype](/docs/data-product-studio/snowtype/index.md) to generate type-safe tracking code
- Use [Snowplow Micro](/docs/data-product-studio/data-quality/snowplow-micro/index.md) for local testing and validation

#### Data validation
- Compare high-level metrics between systems e.g. daily event counts, or unique users
- Validate critical business logic and property values
- Perform end-to-end data reconciliation in your warehouse
- Decide what to do about historical data

For historical data, you have a choice of approaches:
- Coexistence: leave historical Segment data in existing tables. Write queries that `UNION` data from both systems, using a transformation layer (for example, in dbt) to create compatible structures.
- Unification: transform and backfill historical Segment data into Snowplow format. This requires a custom engineering project to export Segment data, reshape it into the Snowplow enriched event format, and load it into the warehouse. The result is a unified historical dataset.

#### Gradual rollout
- Start with non-critical pages or features
- Gradually expand to cover all tracking points
- Monitor data quality and pipeline health
- Update [dbt models](https://docs.snowplow.io/docs/modeling-data/modeling-your-data/dbt/) to use the new data structure

### Cutover and finalize

#### Update downstream consumers
- Migrate BI dashboards to query Snowplow tables
- Test all data-dependent workflows

#### Configure integrations
- Set up [event forwarding](https://docs.snowplow.io/docs/destinations/forwarding-events/) for real-time destinations
- Configure reverse ETL workflows to use your new modeled data

#### Complete transition
- Remove Segment tracking from codebases
- Decommission Segment sources
- Cancel Segment subscription once validation period is complete