Skip to content

Commit 2f7497c

Browse files
authored
Merge pull request #31 from segmentio/repo-sync
repo sync
2 parents 9bcd7f2 + 81b0c83 commit 2f7497c

File tree

3 files changed

+42
-42
lines changed

3 files changed

+42
-42
lines changed

src/connections/storage/data-lakes/comparison.md

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -13,18 +13,13 @@ Data Lakes and Warehouses are not identical, but are compatible with a configura
1313

1414
Data Lakes and Warehouses offer different sync frequencies:
1515
- Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/warehouses/selective-sync/) collections and properties within a source to Warehouses.
16-
- Data Lakes offers 12 syncs in a 24 hour period, and does not offer custom sync schedules or selective sync.
16+
- Data Lakes offers 12 syncs in a 24 hour period, and doesn't offer custom sync schedules or selective sync.
1717

1818
## Duplicates
1919

20-
Segment's overall guarantee for duplicate data also applies to data in Data Lakes: 99% guarantee of no duplicates for data within a [24 hour look-back window](https://segment.com/docs/guides/duplicate-data/). The guarantee remains the same for Warehouses.
21-
22-
Both Data Lakes and Warehouses (and all Segment destinations) rely on the [de-duplication process](/docs/guides/duplicate-data/) at time of event ingest, to ensure:
23-
- The 24 hour look back window duplicate guarantee is met
24-
- Processing costs for customers are managed appropriately
25-
26-
Warehouses also have a secondary de-duplication system built in to further reduce the volume of duplicates. If you have advanced requirements for duplicates in Data Lakes, you can add de-duplication steps downstream to reduce duplicates outside this look back window.
20+
Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window applies to data in Data Lakes and Warehouses.
2721

22+
[Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.
2823

2924
## Object vs Event Data
3025

src/connections/storage/data-lakes/index.md

Lines changed: 26 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,12 @@ Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapR
2323

2424
![](images/dl_vpc.png)
2525

26+
Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync schedule or selective sync.
27+
28+
### Data Lake deduplication
29+
30+
In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window, Data Lakes have another layer of deduplication to ensure clean data in your Data Lake. Segment removes duplicate events at the time your Data Lake ingests data. Data Lakes deduplicate any data synced within the last 7 days, based on the `message_id` field.
31+
2632
### Using a Data Lake with a Data Warehouse
2733

2834
The Data Lakes and Warehouses products are compatible using a mapping, but do not maintain exact parity with each other. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. You can [read more about the differences between Data Lakes and Warehouses](/docs/connections/storage/data-lakes/comparison/).
@@ -114,44 +120,36 @@ If Data Lakes sees a bad data type, for example text in place of a number or an
114120

115121

116122
## FAQ
117-
{% faq %}
118-
{% faqitem How often is data synced to Data Lakes? %}
119-
Data Lakes offers 12 syncs in a 24 hour period. Data Lakes does not offer a custom sync schedule, or allow you use Selective Sync to manage what data is sent.
120-
{% endfaqitem %}
121-
{% faqitem What should I expect in terms of duplicates in Data Lakes? %}
122-
Segment's overall guarantee for duplicate data also applies to data in Data Lakes: 99% guarantee of no duplicates for data within a [24 hour look-back window](https://segment.com/docs/guides/duplicate-data/).
123-
124-
If you have advanced requirements for de-duplication, you can add de-duplication steps downstream to reduce duplicates outside this look back window.
125-
{% endfaqitem %}
126-
{% faqitem Can I send all of my Segment data into Data Lakes? %}
123+
124+
#### Can I send all of my Segment data into Data Lakes?
127125
Data Lakes supports data from all event sources, including website libraries, mobile, server and event cloud sources.
128126

129-
Data Lakes does not support loading [object cloud source data](https://segment.com/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
130-
{% endfaqitem %}
131-
{% faqitem Are user deletions and suppression supported? %}
132-
User deletions are not supported in Data Lakes, however [user suppression](https://segment.com/docs/privacy/user-deletion-and-suppression/#suppressed-users) is supported.
133-
{% endfaqitem %}
134-
{% faqitem How does Data Lakes handle schema evolution? %}
127+
Data Lakes doesn't support loading [object cloud source data](https://segment.com/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
128+
129+
#### Are user deletions and suppression supported?
130+
Segment doesn't support User deletions in Data Lakes, but supports [user suppression](https://segment.com/docs/privacy/user-deletion-and-suppression/#suppressed-users).
131+
132+
#### How does Data Lakes handle schema evolution?
135133
As the data schema evolves and new columns are added, Segment Data Lakes will detect any new columns. New columns will be appended to the end of the table in the Glue Data Catalog.
136-
{% endfaqitem %}
137-
{% faqitem How does Data Lakes work with Protocols? %}
138-
Data Lakes does not have a direct integration with [Protocols](https://segment.com/docs/protocols/).
134+
135+
#### How does Data Lakes work with Protocols?
136+
Data Lakes doesn't have a direct integration with [Protocols](https://segment.com/docs/protocols/).
139137

140138
Any changes to events at the source level made with Protocols also change the data for all downstream destinations, including Data Lakes.
141139

142-
- **Mutated events** - If Protocols mutates an event due to a rule set in the Tracking Plan, then that mutation appears in Segment's internal archives and is reflected in your data lake. For example, if you used Protocols to mutate the event `product_id` to be `productID`, then the event appears in both Data Lakes and Warehouses as `productID`.
140+
- **Mutated events** - If Protocols mutates an event due to a rule set in the Tracking Plan, then that mutation appears in Segment's internal archives and reflects in your data lake. For example, if you use Protocols to mutate the event `product_id` to be `productID`, then the event appears in both Data Lakes and Warehouses as `productID`.
143141

144-
- **Blocked events** - If a Protocols Tracking Plan blocks an event, the event is not forwarded to any downstream Segment destinations, including Data Lakes. However events which are only marked with a violation _are_ passed to Data Lakes.
142+
- **Blocked events** - If a Protocols Tracking Plan blocks an event, the event isn't forwarded to any downstream Segment destinations, including Data Lakes. However events which are only marked with a violation _are_ passed to Data Lakes.
145143

146-
Data types and labels available in Protocols are not supported by Data Lakes.
144+
Data types and labels available in Protocols aren't supported by Data Lakes.
147145

148-
- **Data Types** - Data Lakes infers the data type for each event using its own schema inference systems, instead of using a data type set for an event in Protocols. This might lead to the data type set in a data lake being different from the data type in the tracking plan. For example, if you set `product_id` to be an integer in the Protocols Tracking Plan, but the event is sent into Segment as a string, then Data Lakes may infer this data type as a string in the Glue Data Catalog.
149-
- **Labels** - Labels set in Protocols are not sent to Data Lakes.
150-
{% endfaqitem %}
151-
{% faqitem What is the cost to use AWS Glue? %}
146+
- **Data Types** - Data Lakes infers the data type for each event using its own schema inference systems instead of using a data type set for an event in Protocols. This might lead to the data type set in a data lake being different from the data type in the tracking plan. For example, if you set `product_id` to be an integer in the Protocols Tracking Plan, but the event is sent into Segment as a string, then Data Lakes may infer this data type as a string in the Glue Data Catalog.
147+
- **Labels** - Labels set in Protocols aren't sent to Data Lakes.
148+
149+
#### What is the cost to use AWS Glue?
152150
You can find details on Amazon's [pricing for Glue page](https://aws.amazon.com/glue/pricing/). For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
153-
{% endfaqitem %}
154-
{% faqitem What limits does AWS Glue have? %}
151+
152+
#### What limits does AWS Glue have?
155153
AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the [full list of Glue limits](https://docs.aws.amazon.com/general/latest/gr/glue.html#limits_glue) for more information.
156154

157155
The most common limits to keep in mind are:
@@ -162,5 +160,3 @@ The most common limits to keep in mind are:
162160
Segment stops creating new tables for the events after you exceed this limit. However you can contact your AWS account representative to increase these limits.
163161

164162
You should also read the [additional considerations](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html) when using AWS Glue Data Catalog.
165-
{% endfaqitem %}
166-
{% endfaq %}

src/guides/duplicate-data.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,19 @@
22
title: Handling Duplicate Data
33
---
44

5-
Segment has a special de-duplication service that sits just behind the `api.segment.com` endpoint, and attempts to drop duplicate data. However, that de-duplication service has to hold the entire set of events in memory in order to know whether or not it has seen that event already. Segment stores 24 hours worth of event `message_id`s. This means Segment can de-duplicate any data that appears within a 24 hour rolling window.
5+
Segment guarantees that 99% of your data won't have duplicates within a 24 hour look-back window. Warehouses and Data Lakes also have their own secondary deduplication process to ensure you store clean data.
66

7-
An important point to remember is that Segment de-duplicates on the event's `message_id`, _not_ on the contents of an event payload. So if you aren't generating `message_id`s for each event, or are trying to de-duplicate data over a longer period than 24 hours, Segment does not have a built-in way to de-duplicate data.
7+
## 99% deduplication
88

9-
Since the API layer is de-duplicating during this window, duplicate events that are further than 24 hours apart from one another must be de-duplicated in the Warehouse. Segment also de-duplicates messages going into a Warehouse based on the `message_id`, which is the `id` column in a Segment Warehouse. Note that in these cases you will see duplications in end tools as there is no additional layer prior to sending the event to downstream tools.
9+
Segment has a special deduplication service that sits behind the `api.segment.com` endpoint and attempts to drop 99% of duplicate data. Segment stores 24 hours worth of event `message_id`s, allowing Segment to deduplicate any data that appears within a 24 hour rolling window.
1010

11-
Keep in mind that Segment's libraries all generate `message_id`s for you for each event payload, with the exception of the Segment HTTP API, which assigns each event a unique `message_id` when the message is ingested. You can override these default generated IDs and manually assign a `message_id` if necessary.
11+
Segment deduplicates on the event's `message_id`, _not_ on the contents of the event payload. Segment doesn't have a built-in way to deduplicate data over periods longer than 24 hours or for events that don't generate `message_id`s.
12+
13+
> info ""
14+
> Keep in mind that Segment's libraries all generate `message_id`s for each event payload, with the exception of the Segment HTTP API, which assigns each event a unique `message_id` when the message is ingested. You can override these default generated IDs and manually assign a `message_id` if necessary.
15+
16+
## Warehouse deduplication
17+
Duplicate events that are more than 24 hours apart from one another deduplicate in the Warehouse. Segment deduplicates messages going into a Warehouse based on the `message_id`, which is the `id` column in a Segment Warehouse.
18+
19+
## Data Lake deduplication
20+
To ensure clean data in your Data Lake, Segment removes duplicate events at the time your Data Lake ingests data. The Data Lake deduplication process dedupes the data the Data Lake syncs within the last 7 days with Segment deduping the data based on the `message_id`.

0 commit comments

Comments
 (0)