Skip to content

Commit f019fee

Browse files
authored
Merge pull request #105 from segmentio/repo-sync
repo sync
2 parents 9a9f8e0 + da76045 commit f019fee

File tree

3 files changed

+82
-34
lines changed

3 files changed

+82
-34
lines changed

src/_includes/content/snippet-helper.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
{% codeexampletab Minified %}
33
```js
44
<script>
5-
!function(){var analytics=window.analytics=window.analytics||[];if(!analytics.initialize)if(analytics.invoked)window.console&&console.error&&console.error("Segment snippet included twice.");else{analytics.invoked=!0;analytics.methods=["trackSubmit","trackClick","trackLink","trackForm","pageview","identify","reset","group","track","ready","alias","debug","page","once","off","on","addSourceMiddleware","addIntegrationMiddleware","setAnonymousId","addDestinationMiddleware"];analytics.factory=function(e){return function(){var t=Array.prototype.slice.call(arguments);t.unshift(e);analytics.push(t);return analytics}};for(var e=0;e<analytics.methods.length;e++){var key=analytics.methods[e];analytics[key]=analytics.factory(key)}analytics.load=function(key,e){var t=document.createElement("script");t.type="text/javascript";t.async=!0;t.src="https://cdn.segment.com/analytics.js/v1/" + key + "/analytics.min.js";var n=document.getElementsByTagName("script")[0];n.parentNode.insertBefore(t,n);analytics._loadOptions=e};analytics._writeKey="YOUR_WRITE_KEY";analytics.SNIPPET_VERSION="4.13.2";
5+
!function(){var analytics=window.analytics=window.analytics||[];if(!analytics.initialize)if(analytics.invoked)window.console&&console.error&&console.error("Segment snippet included twice.");else{analytics.invoked=!0;analytics.methods=["trackSubmit","trackClick","trackLink","trackForm","pageview","identify","reset","group","track","ready","alias","debug","page","once","off","on","addSourceMiddleware","addIntegrationMiddleware","setAnonymousId","addDestinationMiddleware"];analytics.factory=function(e){return function(){var t=Array.prototype.slice.call(arguments);t.unshift(e);analytics.push(t);return analytics}};for(var e=0;e<analytics.methods.length;e++){var key=analytics.methods[e];analytics[key]=analytics.factory(key)}analytics.load=function(key,e){var t=document.createElement("script");t.type="text/javascript";t.async=!0;t.src="https://cdn.segment.com/analytics.js/v1/" + key + "/analytics.min.js";var n=document.getElementsByTagName("script")[0];n.parentNode.insertBefore(t,n);analytics._loadOptions=e};analytics._writeKey="YOUR_WRITE_KEY";analytics.SNIPPET_VERSION="4.15.2";
66
analytics.load("YOUR_WRITE_KEY");
77
analytics.page();
88
}}();
@@ -85,7 +85,7 @@
8585
};
8686
analytics._writeKey = 'YOUR_WRITE_KEY'
8787
// Add a version to keep track of what's in the wild.
88-
analytics.SNIPPET_VERSION = '4.13.2';
88+
analytics.SNIPPET_VERSION = '4.15.2';
8989
// Load Analytics.js with your key, which will automatically
9090
// load the tools you've enabled for your account. Boosh!
9191
analytics.load("YOUR_WRITE_KEY");

src/connections/storage/data-lakes/data-lakes-manual-setup.md

Lines changed: 78 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -5,60 +5,83 @@ title: Configure the Data Lakes AWS Environment
55
{% include content/plan-grid.md name="data-lakes" %}
66

77

8-
The instructions below will guide you through the process required to configure the environment required to begin loading data into your Segment Data Lake. For a more automated process, see [Step 1 - Configure AWS Resources](#step-1---configure-aws-resources) above.
8+
The instructions below will guide you through the process required to configure the environment required to begin loading data into your Segment Data Lake. For a more automated process, see [Set Up Segment Data Lakes](/docs/connections/storage/catalog/data-lakes/index).
99

10+
As a best practice, Segment recommends that you consult with your network and security teams before you configure your EMR cluster.
1011

11-
## Step 1 - Create an S3 Bucket
12+
## Step 1 - Create a VPC and an S3 bucket
1213

13-
In this step, you'll create the S3 bucket that will store both the intermediate and final data.
14+
In this step, you'll create a Virtual Private Cloud (VPC) to securely launch your AWS resources into and an S3 bucket that will store both the intermediate and final data.
15+
16+
To create a VPC, follow the instructions outlined in Amazon's documentation, [Create and configure your VPC](https://docs.aws.amazon.com/directoryservice/latest/admin-guide/gsg_create_vpc.html){:target="_blank"}.
17+
18+
To create an S3 bucket, see Amazon's [Create your first S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html){:target="_blank"} instructions.
1419

1520
> info ""
16-
> Take note of the S3 bucket name you set in this step, as the rest of the set up flow requires it. In these instructions, the name is `segment-data-lake`.
21+
> Take note of the S3 bucket name you set in this step, as the rest of the set up flow requires it.
1722
18-
During the set up process, create a Lifecycle rule and set it to expire staging data after **14 days**. For more information, see Amazon's documentation, [How do I create a lifecycle?](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html).
23+
After you create an S3 bucket, configure a lifecycle rule for the bucket and set it to expire staging data after **14 days**. For instructions on configuring lifecycle rules, see Amazon's documentation, [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html){:target="_blank"}.
1924

20-
![Create a Lifecycle rule to expire staging data after 14 days](images/01_14-day-lifecycle.png)
25+
Apply the following lifecycle settings to your staging data:
26+
* **Expire after:** 14 days
27+
* **Permanently delete after:** 14 days
28+
* **Clean up incomplete multipart uploads:** after 14 days
2129

2230
## Step 2 - Configure an EMR cluster
2331

24-
Segment requires access to an EMR cluster to perform necessary data processing. We recommend starting with a small cluster, with the option to add more compute as required.
32+
Segment requires access to an EMR cluster to perform necessary data processing. For best results, start with a small cluster and add more compute resources as required.
2533

2634
### Configure the hardware and networking configuration
2735

28-
1. Locate and select EMR from the AWS console.
29-
2. Click **Create Cluster**, and open the **Advanced Options**.
30-
3. In the Advanced Options, on Step 1: Software and Steps, ensure you select the following options, along with the defaults:
31-
- `Use for Hive table metadata`
32-
- `Use for Spark table metadata` ![Select to use for both Have and Spark table metadata](images/02_hive-spark-table.png)
33-
4. In the Networking setup section, select to create the cluster in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires additional configuration. Creating a cluster in a public subnet is accessible from the internet. You can configure strict security groups to prevent inbound access to the cluster. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. As a best practice, Segment recommends that you consult with your network and security before you configure your EMR cluster.
34-
5. In the Hardware Configuration section, create a cluster with the nodes listed below. This configuration uses the default **On demand** purchasing option for the instances.
36+
1. In the AWS console, navigate to **Services > Analytics > EMR**.
37+
2. Click **Create Cluster**. On the Create Cluster - Quick Options page, click **Go to advanced options**.
38+
3. In Advanced Options, on Step 1: Software and Steps, select both the `emr-5.33.0` release and the following applications:
39+
- Hadoop 2.10.1
40+
- Hive 2.3.7
41+
- Hue 4.9.0
42+
- Spark 2.4.7
43+
- Pig 0.17.0
44+
4. In the AWS Glue Data Catalog settings, select the following options:
45+
- Use for Hive table metadata
46+
- Use for Spark table metadata
47+
5. Select **Next** to proceed to Step 2: Hardware.
48+
6.In the Networking section, select a Network (the VPC you created in [Step 1](#step-1---create-a-vpc-and-an-s3-bucket)) and EC2 Subnet for your EMR instance.
49+
50+
Creating the cluster in a private subnet is more secure, but requires additional configuration. Creating the cluster in a public subnet leaves it accessible from the Internet, but requires less up front configuration.
51+
52+
If you create clusters in public subnets, you can configure strict security groups to prevent unauthorized inbound EMR cluster access. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html){:target="_blank"} for more information.
53+
54+
7. In the Cluster Nodes and Instances section, create a cluster that includes the following on-demand nodes:
3555
- **1** master node
3656
- **2** core nodes
37-
- **2** task nodes ![Configure the number of nodes](images/03_hardware-node-instances.png)
57+
- **2** task nodes
58+
59+
Each node should meet or exceed the following specifications:
60+
* Instance type: mx5.xlarge
61+
* Number of vCores: 4
62+
* Memory: 16 GiB
63+
* EBS Storage: 64 GiB, EBS only storage
3864

39-
For more information about configuring the cluster hardware and networking, see Amazon's document, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
65+
For more information about configuring cluster hardware and networking, see Amazon's documentation, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html){:target="_blank"}.
4066

67+
8. Click **Next** to proceed to Step 3: General Cluster Settings.
4168

4269
### Configure logging
4370

44-
On the General Options step, configure logging to use the same S3 bucket you configured as the destination for the final data (`segment-data-lakes` in this case). Once configured, logs are to a new prefix, and separated from the final processed data.
45-
46-
Set value of the **vendor** tag to `segment`. The IAM policy uses this to provide Segment access to submit jobs in the EMR cluster.
71+
9. On Step 3: General Cluster Settings, configure logging to use the same S3 bucket you configured as the destination for the final data. Once configured, logs are assigned a new prefix and separated from the final processed data.
4772

73+
10. Add a new key-value pair to the Tags section, a `vendor` key with a value of `segment`. The IAM policy uses this tag to provide Segment access to submit jobs in the EMR cluster.
4874

49-
![Configure logging](images/05_logging.png)
75+
11. Click **Next** to proceed to Step 4: Security.
5076

5177
### Secure the cluster
78+
12. On Step 4: Security, in the Security Options section, create or select an **EC2 key pair**.
79+
13. Choose the appropriate roles in the **EC2 instance profile**.
80+
14. Expand the EC2 security groups section and select the appropriate security groups for the Master and Core & Task types.
81+
15. Select **Create cluster**.
5282

53-
On the Security step, be sure to complete the following steps:
54-
1. Create or select an **EC2 key pair**.
55-
2. Choose the appropriate roles in the **EC2 instance profile**.
56-
3. Select the appropriate security groups for the Master and Core & Task types.
57-
58-
![Secure the cluster](images/06_secure-cluster.png)
59-
60-
The image uses the default settings. You can make these settings more restrictive, if required.
61-
83+
> note ""
84+
> If you update the EMR cluster of existing Data Lakes instance, take note of the EMR cluster ID on the confirmation page.
6285
6386
## Step 3 - Create an Access Management role and policy
6487

@@ -100,7 +123,7 @@ Attach the following trust relationship document to the role to create a `segmen
100123
```
101124

102125
> note ""
103-
> **NOTE:** Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
126+
> Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
104127
105128
#### IAM Role for Data Lakes created in EU workspaces:
106129

@@ -225,7 +248,7 @@ Add a policy to the role created above to give Segment access to the relevant Gl
225248
```
226249

227250
> note ""
228-
> **NOTE:** The policy above grants full access to Athena, but the individual Glue and S3 policies decide which table is queryable. Segment queries for debugging purposes, and will notify you be for running any queries.
251+
> The policy above grants full access to Athena, but the individual Glue and S3 policies determine which table is queried. Segment queries for debugging purposes, and notifies you before running any queries.
229252
230253
## Debugging
231254

@@ -235,3 +258,27 @@ Segment requires access to the data and schema for debugging data quality issues
235258
- Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade.
236259
![Debugging](images/dl_setup_glueerror.png)
237260
- An easier alternative is to create a new account that has Athena backed by Glue as the default.
261+
262+
## Updating EMR Clusters
263+
You can update your existing Data Lake destination to EMR version 5.33.0 by creating a new v5.33.0 cluster in AWS and associating it with your existing Data Lake. After you update the EMR cluster, your Segment Data Lake continues to use the Glue data catalog you initially configured.
264+
265+
When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc){:target="_blank"}, use dynamic auto-scaling, and experience faster Parquet jobs.
266+
267+
> info ""
268+
> Your Segment Data Lake does not need to be disabled during the update process, and any ongoing syncs will complete on the old cluster. Any syncs that fail while you are updating the cluster ID field will be restarted on the new cluster.
269+
270+
## Prerequisites
271+
* An EMR v5.33.0 cluster
272+
* An existing Segment Data Lakes destination
273+
274+
## Procedure
275+
1. Open your Segment app workspace and select the Data Lakes destination.
276+
2. On the Settings tab, select the EMR Cluster ID field and replace the existing ID with the ID of your v5.33.0 EMR cluster. For help finding the cluster ID in AWS, see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html). You don't need to update the Glue Catalog ID, IAM Role ARN, or S3 Bucket name fields.
277+
3. Click **Save**.
278+
4. In the AWS EMR console, view the Events tab for your cluster to verify it is receiving data.
279+
280+
You can delete the old EMR cluster from AWS after the following conditions have been met:
281+
* You have updated all Data Lakes to use the EMR cluster
282+
* A sync has successfully completed in the new cluster
283+
* Data is synced into the new cluster
284+
* There are no ongoing jobs in the old cluster

vale-styles/Vocab/Docs/accept.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,4 +55,5 @@ Smartly
5555
Hubspot
5656
Friendbuy
5757
Chargebee
58-
(?:L|l)ookback
58+
(?:L|l)ookback
59+
Subnet

0 commit comments

Comments
 (0)