Skip to content

Commit 7032ade

Browse files
authored
Merge pull request #5431 from segmentio/databricks-betaupdates
Databricks Delta Lake beta docs [DOC-734]
2 parents d08e477 + 2d76bae commit 7032ade

File tree

3 files changed

+317
-0
lines changed

3 files changed

+317
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
---
2+
title: Databricks Delta Lake Destination (AWS Setup)
3+
beta: true
4+
hidden: true
5+
---
6+
7+
{% comment %}
8+
9+
With the Databricks Delta Lake Destination, you can ingest event data from Segment into the bronze layer of your Databricks Delta Lake.
10+
11+
This page will help you use the Databricks Delta Lake Destination to sync Segment events into your Databricks Delta Lake built on S3.
12+
13+
> info "Databricks Delta Lake Destination in Public Beta"
14+
> The Databricks Delta Lake Destination is in public beta, and Segment is actively working on this integration. [Contact Segment](https://segment.com/help/contact/){:target="_blank"} with any feedback or questions.
15+
16+
## Overview
17+
18+
Before getting started, use the overview below to get up to familiarize yourself with Segment's Databricks Delta Lake Destination.
19+
20+
1. Segment writes directly to your Delta Lake in the cloud storage (S3)
21+
- Segment manages the creation and evolution of Delta tables.
22+
- Segment uses IAM role assumption to write Delta tables to AWS S3.
23+
2. Segment supports both OAuth and personal access tokens (PAT) for API authentication.
24+
3. Segment creates and updates the table's metadeta in Unity Catalog by running queries on a small, single node Databricks SQL warehouse in your environment.
25+
4. If a table already exists and no new columns are introduced, Segment appends data to the table (no SQL required).
26+
5. For new data types/columns, Segment reads the current schema for the table from the Unity Catalog and uses the SQL warehouse to update the schema accordingly.
27+
28+
## Prerequisites
29+
30+
Please note the following prerequisites for setup.
31+
32+
1. The target Databricks workspace must be Unity Catalog enabled. Segment doesn't support the Hive metastore. Visit the Databricks guide [enabling the Unity Catalog](https://docs.databricks.com/en/data-governance/unity-catalog/enable-workspaces.html){:target="_blank"} for more information.
33+
2. You'll need the following permissions for setup:
34+
- **AWS**: The ability to create an S3 bucket and IAM role.
35+
- **Databricks**: Admin access at the account and workspace level.
36+
37+
## Authentication
38+
39+
Segment supports both OAuth and personal access token (PAT) for authentication. Segment recommends using OAuth as it's easier to set up and manage. Throughout this guide, some instructions are marked as *OAuth only* or *PAT only*. You can skip any instructions that don't correspond with your authentication method.
40+
41+
## Key terms
42+
43+
As you set up Databricks, keep the following key terms in mind.
44+
- **Databricks Workspace URL**: The base URL for your Databricks workspace.
45+
- **Service Principal Application ID**: The ID tied to the service principal you'll create for Segment.
46+
- **Service Principal Secret/Token**: The client secret or PAT you'll create for the service principal.
47+
- **Target Unity Catalog**: The catalog where Segment lands your data.
48+
- **Workspace Admin Token** (*PAT only*): The access token you'll generate for your Databricks workspace admin.
49+
50+
## Setup for Databricks Delta Lake (S3)
51+
52+
### Step 1: Find your Databricks Workspace URL
53+
54+
You'll use the Databricks workspace URL, along with Segment, to access your workspace API.
55+
56+
Check your browser's address bar when inside the workspace. The workspace URL will look something like: `https://<workspace-deployment-name>.cloud.databricks.com`. Remove any characters after this portion and note the URL for later use.
57+
58+
### Step 2: Create a service principal
59+
60+
Segment uses the service principal to access your Databricks workspace and associated APIs.
61+
1. Follow the Databricks guide for [adding a service principal to your account](https://docs.databricks.com/en/administration-guide/users-groups/service-principals.html#manage-service-principals-in-your-account){:target="_blank"}. This name can be anything, but Segment recommends something that identifies the purpose (for example, "Segment Storage Destinations"). Note the Application ID that Databricks generates for later use. Segment doesn't require Account admin or Marketplace admin roles.
62+
2. (*OAuth only*) Follow the Databricks instructions to [generate an OAuth secret](https://docs.databricks.com/en/dev-tools/authentication-oauth.html#step-2-create-an-oauth-secret-for-a-service-principal){:target="_blank"}. Note the secret generated by Databricks for later use. Once you navigate away from this page, the secret is no longer visible. If you lose or forget the secret, delete the existing secret and create a new one.
63+
64+
### Step 3: Enable entitlements for the service principal on the workspace
65+
66+
This step allows the Segment service principal to create and use a small SQL warehouse, which is used for creating and updating table schemas in the Unity Catalog.
67+
68+
To enable entitlements for the service principal you just created, follow the Databricks [guide for managing workspace entitlements for a service principal](https://docs.databricks.com/en/administration-guide/users-groups/service-principals.html#manage-workspace-entitlements-for-a-service-principal){:target="_blank"}. Segment requires the `Allow cluster creation` and `Databricks SQL access` entitlements.
69+
70+
### Step 4: Create an external location and storage credentials
71+
72+
This step creates the storage location where Segment lands your Delta Lake and the associated credentials Segment uses to access the storage.
73+
1. Follow the Databricks guide for [managing external locations and storage credentials](https://docs.databricks.com/en/data-governance/unity-catalog/manage-external-locations-and-credentials.html){:target="_blank"}. This guide assumes the target S3 bucket already exists. If not, follow the AWS guide for [creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html){:target="_blank"}.
74+
2. Once the external location and storage credentials are created in your Databricks workspace, update the permissions to allow access to the Segment service principal.
75+
1. In your workspace, navigate to **Data > External Data > Storage Credentials**.
76+
2. Click the name of the credentials created above to go to the Permissions tab.
77+
3. Click **Grant**, then select the Segment service principal from the drop-down.
78+
4. Select the **CREATE EXTERNAL TABLE**, **READ FILES**, and **WRITE FILES** checkboxes.
79+
5. Click **Grant**.
80+
6. Click **External Locations**.
81+
7. Click the name of the location created above and go to the Permissions tab.
82+
8. Click **Grant**, then select the Segment service principal from the drop-down.
83+
9. Select the **CREATE EXTERNAL TABLE**, **READ FILES**, and **WRITE FILES** checkboxes.
84+
10. Click **Grant**.
85+
3. In AWS, supplement the Trust policy for the role created when setting up the storage credentials.
86+
1. Add: `arn:aws:iam::595280932656:role/segment-storage-destinations-production-access` to the Principal list.
87+
2. Convert the `sts:ExternalID` field to a list and add the Segment Workspace ID. You'll find the Segment workspace ID in the Segment app (**Settings > Workspace settings > ID**).
88+
89+
The Trust policy should look like:
90+
91+
```
92+
{
93+
"Version": "2012-10-17",
94+
"Statement": [
95+
{
96+
"Effect": "Allow",
97+
"Principal": {
98+
"AWS": [
99+
"arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL",
100+
"arn:aws:iam::<YOUR-AWS-ACCOUNT-ID>:role/<THIS-ROLE-NAME>",
101+
"arn:aws:iam::595280932656:role/segment-storage-destinations-production-access"
102+
]
103+
},
104+
"Action": "sts:AssumeRole",
105+
"Condition": {
106+
"StringEquals": {
107+
"sts:ExternalId": [
108+
"<DATABRICKS-ACCOUNT-ID>",
109+
"<SEGMENT-WORKSPACE-ID>"
110+
]
111+
}
112+
}
113+
}
114+
]
115+
}
116+
117+
```
118+
119+
### Step 5: Create a workspace admin access token (PAT only)
120+
121+
Your Databricks workspace admin uses the workspace admin access token to generate a personal access token for the service principal.
122+
123+
To create your token, follow the Databricks guide for [generating personal access tokens](https://docs.databricks.com/en/dev-tools/auth.html#databricks-personal-access-tokens-for-workspace-users){:target="_blank"} for workspace users. Note the generated token for later use.
124+
125+
### Step 6: Enable personal access tokens for the workspace (PAT only)
126+
127+
This step allows the creation and use of personal access tokens for the workspace admin and the service principal.
128+
1. Follow the Databricks guide for [enabling personal access token authentication](https://docs.databricks.com/en/administration-guide/access-control/tokens.html#enable-or-disable-personal-access-token-authentication-for-the-workspace){:target="_blank"} for the workspace.
129+
2. Follow the Databricks docs to [grant Can Use permission](https://docs.databricks.com/en/security/auth-authz/api-access-permissions.html#manage-token-permissions-using-the-admin-settings-page){:target="_blank"} to the Segment service principal created earlier.
130+
131+
### Step 7: Generate a personal access token for the service principal (PAT only)
132+
133+
Segment uses the personal access token to access the Databricks workspace API. The Databricks UI doesn't allow for the creation of service principal tokens. Tokens must be generated using either the Databricks workspace API (*recommended*) or the Databricks CLI.
134+
Generating a token requires the following values:
135+
- **Databricks Workspace URL**: The base URL to your Databricks workspace.
136+
- **Workspace Admin Token**: The token generated for your Databricks admin user.
137+
- **Service Principal Application ID**: The ID generated for the Segment service principal.
138+
- **Lifetime Seconds**: The number of seconds before the token expires. Segment doesn't prescribe a specific token lifetime. Using the instructions below, you'll need to generate and update a new token in the Segment app before the existing token expires. Segment's general guidance is 90 days (7776000 seconds).
139+
- **Comment**: A comment which describes the purpose of the token (for example, "Grants Segment access to this workspace until 12/21/2023").
140+
1. (*Recommended option*) To create the token with the API, execute the following command in a terminal or command line tool. Be sure to update the placeholders with the relevant details from above. For more information about the API check out the [Databricks API docs](https://docs.databricks.com/api/workspace/tokenmanagement/createobotoken){:target="_blank"}.
141+
```
142+
curl --location
143+
'<DATABRICKS_WORKSPACE_URL>/api/2.0/token-management/on-behalf-of/tokens' --header 'Content-Type: application/json' --header 'Authorization: Bearer <WORKSPACE_ADMIN_TOKEN>' --data '{"application_id": "<SERVICE_PRINCIPAL_APPLICATION_ID>", "lifetime_seconds": <LIFETIME_SECONDS>, "comment": "<COMMENT>"}'
144+
```
145+
The response from the API contains a `token_value` field. Note this value for later use.
146+
2. (*Alternative option*) If you prefer to use the Databricks CLI, execute the following command in a terminal or command line tool. Be sure to update the placeholders with the relevant details from above. You will also need to [set up a profile](https://docs.databricks.com/en/dev-tools/cli/databricks-cli-ref.html#databricks-personal-access-token-authentication){:target="_blank"} for the CLI. For more info, check out the [Databricks CLI docs](https://docs.databricks.com/en/dev-tools/cli/databricks-cli-ref.html){:target="_blank"}.
147+
148+
```
149+
databricks token-management create-obo-token
150+
<SERVICE_PRINCIPAL_APPLICATION_ID> <LIFETIME_SECONDS> --comment <COMMENT> -p <PROFILE_NAME>
151+
```
152+
The response from the CLI will contain a `token_value` field. Note this value for later use.
153+
154+
### Step 8: Create a new catalog in Unity Catalog and grant Segment permissions
155+
156+
This catalog is the target catalog where Segment lands your schemas/tables.
157+
1. Follow the Databricks guide for [creating a catalog](https://docs.databricks.com/en/data-governance/unity-catalog/create-catalogs.html#create-a-catalog){:target="_blank"}. Be sure to select the storage location created earlier. You can use any valid catalog name (for example, "Segment"). Note this name for later use.
158+
2. Select the catalog you've just created.
159+
1. Select the Permissions tab, then click **Grant**
160+
2. Select the Segment service principal from the dropdown, and check `ALL PRIVILEGES`.
161+
3. Click **Grant**.
162+
163+
### Step 9: Setup the Databricks Delta Lake destination in Segment
164+
165+
This step links a Segment events source to your Databricks workspace/catalog.
166+
1. From the Segment app, navigate to **Connections > Catalog**, then click **Destinations**.
167+
2. Search for and select the "Databricks Delta Lake" destination.
168+
2. Click **Add Destination**, select a source, then click **Next**.
169+
3. Enter the name for your destination, then click **Create destination**.
170+
4. Enter connection settings for the destination.
171+
172+
173+
{% endcomment %}

0 commit comments

Comments
 (0)