Merge pull request #31 from segmentio/repo-sync

bot-docsteam · web-flow · commit 2f7497c555f6 · 2021-09-30T10:35:53.000-07:00
repo sync
diff --git a/src/connections/storage/data-lakes/comparison.md b/src/connections/storage/data-lakes/comparison.md
@@ -13,18 +13,13 @@ Data Lakes and Warehouses are not identical, but are compatible with a configura
 
 Data Lakes and Warehouses offer different sync frequencies:
 - Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/warehouses/selective-sync/) collections and properties within a source to Warehouses.
-- Data Lakes offers 12 syncs in a 24 hour period, and does not offer custom sync schedules or selective sync.
+- Data Lakes offers 12 syncs in a 24 hour period, and doesn't offer custom sync schedules or selective sync.
 
 ## Duplicates
 
-Segment's overall guarantee for duplicate data also applies to data in Data Lakes: 99% guarantee of no duplicates for data within a [24 hour look-back window](https://segment.com/docs/guides/duplicate-data/). The guarantee remains the same for Warehouses.
-
-Both Data Lakes and Warehouses (and all Segment destinations) rely on the [de-duplication process](/docs/guides/duplicate-data/) at time of event ingest, to ensure:
-- The 24 hour look back window duplicate guarantee is met
-- Processing costs for customers are managed appropriately
-
-Warehouses also have a secondary de-duplication system built in to further reduce the volume of duplicates. If you have advanced requirements for duplicates in Data Lakes, you can add de-duplication steps downstream to reduce duplicates outside this look back window.
+Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window applies to data in Data Lakes and Warehouses.
 
+[Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.
 
 ## Object vs Event Data
 
diff --git a/src/connections/storage/data-lakes/index.md b/src/connections/storage/data-lakes/index.md
@@ -23,6 +23,12 @@ Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapR
 
 ![](images/dl_vpc.png)
 
+Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync schedule or selective sync.
+
+### Data Lake deduplication
+
+In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window, Data Lakes have another layer of deduplication to ensure clean data in your Data Lake. Segment removes duplicate events at the time your Data Lake ingests data.  Data Lakes deduplicate any data synced within the last 7 days, based on the `message_id` field.
+
 ### Using a Data Lake with a Data Warehouse
 
 The Data Lakes and Warehouses products are compatible using a mapping, but do not maintain exact parity with each other. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. You can [read more about the differences between Data Lakes and Warehouses](/docs/connections/storage/data-lakes/comparison/).
@@ -114,44 +120,36 @@ If Data Lakes sees a bad data type, for example text in place of a number or an
 
 
 ## FAQ
-{% faq %}
-{% faqitem How often is data synced to Data Lakes? %}
-Data Lakes offers 12 syncs in a 24 hour period. Data Lakes does not offer a custom sync schedule, or allow you use Selective Sync to manage what data is sent.
-{% endfaqitem %}
-{% faqitem What should I expect in terms of duplicates in Data Lakes? %}
-Segment's overall guarantee for duplicate data also applies to data in Data Lakes: 99% guarantee of no duplicates for data within a [24 hour look-back window](https://segment.com/docs/guides/duplicate-data/).
-
-If you have advanced requirements for de-duplication, you can add de-duplication steps downstream to reduce duplicates outside this look back window.
-{% endfaqitem %}
-{% faqitem Can I send all of my Segment data into Data Lakes? %}
+
+#### Can I send all of my Segment data into Data Lakes?
 Data Lakes supports data from all event sources, including website libraries, mobile, server and event cloud sources.
 
-Data Lakes does not support loading [object cloud source data](https://segment.com/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
-{% endfaqitem %}
-{% faqitem Are user deletions and suppression supported? %}
-User deletions are not supported in Data Lakes, however [user suppression](https://segment.com/docs/privacy/user-deletion-and-suppression/#suppressed-users) is supported.
-{% endfaqitem %}
-{% faqitem How does Data Lakes handle schema evolution? %}
+Data Lakes doesn't support loading [object cloud source data](https://segment.com/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
+
+#### Are user deletions and suppression supported?
+Segment doesn't support User deletions in Data Lakes, but supports [user suppression](https://segment.com/docs/privacy/user-deletion-and-suppression/#suppressed-users).
+
+#### How does Data Lakes handle schema evolution?
 As the data schema evolves and new columns are added, Segment Data Lakes will detect any new columns. New columns will be appended to the end of the table in the Glue Data Catalog.
-{% endfaqitem %}
-{% faqitem How does Data Lakes work with Protocols? %}
-Data Lakes does not have a direct integration with [Protocols](https://segment.com/docs/protocols/).
+
+#### How does Data Lakes work with Protocols?
+Data Lakes doesn't have a direct integration with [Protocols](https://segment.com/docs/protocols/).
 
 Any changes to events at the source level made with Protocols also change the data for all downstream destinations, including Data Lakes.
 
-- **Mutated events** - If Protocols mutates an event due to a rule set in the Tracking Plan, then that mutation appears in Segment's internal archives and is reflected in your data lake. For example, if you used Protocols to mutate the event `product_id` to be `productID`, then the event appears in both Data Lakes and Warehouses as `productID`.
+- **Mutated events** - If Protocols mutates an event due to a rule set in the Tracking Plan, then that mutation appears in Segment's internal archives and reflects in your data lake. For example, if you use Protocols to mutate the event `product_id` to be `productID`, then the event appears in both Data Lakes and Warehouses as `productID`.
 
-- **Blocked events** - If a Protocols Tracking Plan blocks an event, the event is not forwarded to any downstream Segment destinations, including Data Lakes. However events which are only marked with a violation _are_ passed to Data Lakes.
+- **Blocked events** - If a Protocols Tracking Plan blocks an event, the event isn't forwarded to any downstream Segment destinations, including Data Lakes. However events which are only marked with a violation _are_ passed to Data Lakes.
 
-Data types and labels available in Protocols are not supported by Data Lakes.
+Data types and labels available in Protocols aren't supported by Data Lakes.
 
-- **Data Types** - Data Lakes infers the data type for each event using its own schema inference systems, instead of using a data type set for an event in Protocols. This might lead to the data type set in a data lake being different from the data type in the tracking plan. For example, if you set `product_id` to be an integer in the Protocols Tracking Plan, but the event is sent into Segment as a string, then Data Lakes may infer this data type as a string in the Glue Data Catalog.
-- **Labels** - Labels set in Protocols are not sent to Data Lakes.
-{% endfaqitem %}
-{% faqitem What is the cost to use AWS Glue? %}
+- **Data Types** - Data Lakes infers the data type for each event using its own schema inference systems instead of using a data type set for an event in Protocols. This might lead to the data type set in a data lake being different from the data type in the tracking plan. For example, if you set `product_id` to be an integer in the Protocols Tracking Plan, but the event is sent into Segment as a string, then Data Lakes may infer this data type as a string in the Glue Data Catalog.
+- **Labels** - Labels set in Protocols aren't sent to Data Lakes.
+
+#### What is the cost to use AWS Glue?
 You can find details on Amazon's [pricing for Glue page](https://aws.amazon.com/glue/pricing/). For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
-{% endfaqitem %}
-{% faqitem What limits does AWS Glue have? %}
+
+#### What limits does AWS Glue have?
 AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the [full list of Glue limits](https://docs.aws.amazon.com/general/latest/gr/glue.html#limits_glue) for more information.
 
 The most common limits to keep in mind are:
@@ -162,5 +160,3 @@ The most common limits to keep in mind are:
 Segment stops creating new tables for the events after you exceed this limit. However you can contact your AWS account representative to increase these limits.
 
 You should also read the [additional considerations](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html) when using AWS Glue Data Catalog.
-{% endfaqitem %}
-{% endfaq %}
diff --git a/src/guides/duplicate-data.md b/src/guides/duplicate-data.md
@@ -2,10 +2,19 @@
 title: Handling Duplicate Data
 ---
 
-Segment has a special de-duplication service that sits just behind the `api.segment.com` endpoint, and attempts to drop duplicate data. However, that de-duplication service has to hold the entire set of events in memory in order to know whether or not it has seen that event already. Segment stores 24 hours worth of event `message_id`s. This means Segment can de-duplicate any data that appears within a 24 hour rolling window.
+Segment guarantees that 99% of your data won't have duplicates within a 24 hour look-back window. Warehouses and Data Lakes also have their own secondary deduplication process to ensure you store clean data.
 
-An important point to remember is that Segment de-duplicates on the event's `message_id`, _not_ on the contents of an event payload. So if you aren't generating `message_id`s for each event, or are trying to de-duplicate data over a longer period than 24 hours, Segment does not have a built-in way to de-duplicate data.
+## 99% deduplication
 
-Since the API layer is de-duplicating during this window, duplicate events that are further than 24 hours apart from one another must be de-duplicated in the Warehouse. Segment also de-duplicates messages going into a Warehouse based on the `message_id`, which is the `id` column in a Segment Warehouse. Note that in these cases you will see duplications in end tools as there is no additional layer prior to sending the event to downstream tools.
+Segment has a special deduplication service that sits behind the `api.segment.com` endpoint and attempts to drop 99% of duplicate data. Segment stores 24 hours worth of event `message_id`s, allowing Segment to deduplicate any data that appears within a 24 hour rolling window.
 
-Keep in mind that Segment's libraries all generate `message_id`s for you for each event payload, with the exception of the Segment HTTP API, which assigns each event a unique `message_id` when the message is ingested. You can override these default generated IDs and manually assign a `message_id` if necessary.
+Segment deduplicates on the event's `message_id`, _not_ on the contents of the event payload. Segment doesn't have a built-in way to deduplicate data over periods longer than 24 hours or for events that don't generate `message_id`s.
+
+> info ""
+> Keep in mind that Segment's libraries all generate `message_id`s for each event payload, with the exception of the Segment HTTP API, which assigns each event a unique `message_id` when the message is ingested. You can override these default generated IDs and manually assign a `message_id` if necessary.
+
+## Warehouse deduplication
+Duplicate events that are more than 24 hours apart from one another deduplicate in the Warehouse. Segment  deduplicates messages going into a Warehouse based on the `message_id`, which is the `id` column in a Segment Warehouse.
+
+## Data Lake deduplication
+To ensure clean data in your Data Lake, Segment removes duplicate events at the time your Data Lake ingests data. The Data Lake deduplication process dedupes the data the Data Lake syncs within the last 7 days with Segment deduping the data based on the `message_id`.