Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 180 additions & 0 deletions recipes/etl-lakehouse-sync-autoscaling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# ETL: Sync Lakebase Tables to Unity Catalog (Autoscaling — Lakehouse Sync)

Replicate your Lakebase Autoscaling Postgres tables into Unity Catalog as managed Delta tables. **Lakehouse Sync** captures every row-level change using CDC and writes them as **SCD Type 2 history** — giving you a full audit trail of how your operational data changed over time, queryable from the lakehouse.

> This recipe is for **Lakebase Autoscaling** (projects/branches/endpoints with scale-to-zero). For Lakebase Provisioned, see the [Provisioned ETL recipe](./etl-register-uc-provisioned.md).

## When to use this

- You want to analyze operational data (orders, user activity, support tickets) in the lakehouse
- You need a historical record of every insert, update, and delete from your Postgres tables
- You want to join operational data with analytics data in Spark, SQL, or BI tools
- You need to feed Lakebase data into downstream pipelines or ML models

## Prerequisites

- A Databricks workspace with Lakebase enabled (**AWS only** — Lakehouse Sync Beta is not yet available on Azure)
- A **Lakebase Autoscaling project** with at least one branch containing tables with data
- The Lakebase database must be **registered in Unity Catalog** ([register it](https://docs.databricks.com/aws/en/oltp/projects/register-uc))
- Tables you want to sync must have a **primary key** or `REPLICA IDENTITY FULL` set

> **Note:** Lakehouse Sync is currently in **Beta on AWS only** (all Autoscaling regions). Azure support is not yet available. It is a native Lakebase feature — no external compute, pipelines, or jobs required, and there is no incremental charge for replication beyond the underlying Lakebase compute and storage costs.

## How it works

Lakehouse Sync uses Change Data Capture (CDC) to stream changes from Lakebase Postgres into Unity Catalog. For each synced table, a Delta history table is created:

```
lb_<table_name>_history
```

Each row includes metadata columns:
- `_change_type` — `insert`, `update_preimage`, `update_postimage`, or `delete`
- `_lsn` — Log Sequence Number for ordering changes
- `_commit_timestamp` — When the change was captured

## Step 1 — Verify table replica identity

Lakehouse Sync requires the right replica identity for capturing changes. Connect to your Lakebase database and check:

```sql
SELECT n.nspname AS table_schema,
c.relname AS table_name,
CASE c.relreplident
WHEN 'd' THEN 'default'
WHEN 'n' THEN 'nothing'
WHEN 'f' THEN 'full'
WHEN 'i' THEN 'index'
END AS replica_identity
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
AND n.nspname = 'public'
ORDER BY n.nspname, c.relname;
```

If a table shows `default` or `nothing`, set it to `FULL`:

```sql
ALTER TABLE <table_name> REPLICA IDENTITY FULL;
```

## Step 2 — Check for unsupported data types

```sql
SELECT c.table_schema, c.table_name, c.column_name, c.udt_name AS data_type
FROM information_schema.columns c
JOIN pg_catalog.pg_type t ON t.typname = c.udt_name
WHERE c.table_schema = 'public'
AND c.table_name IN (
SELECT tablename FROM pg_tables WHERE schemaname = c.table_schema
)
AND NOT (
c.udt_name IN (
'bool', 'int2', 'int4', 'int8', 'text', 'varchar', 'bpchar',
'jsonb', 'numeric', 'date', 'timestamp', 'timestamptz',
'real', 'float4', 'float8'
)
OR t.typcategory = 'E'
)
ORDER BY c.table_schema, c.table_name, c.ordinal_position;
```

If unsupported types appear, restructure those columns before enabling sync.

## Step 3 — Enable Lakehouse Sync

> **Note:** Lakehouse Sync is configured through the Databricks UI. There is no CLI or REST API support for enabling it yet.

1. Navigate to **Catalog** in the workspace sidebar
2. Open your **Lakebase Autoscaling project** and select the **branch**
3. Click **Lakehouse Sync**
4. Click **Start Sync**
5. Select the **source Postgres database and schema**
6. Select the **destination Unity Catalog catalog and schema**
7. Choose which **tables** to sync
8. Click **Start**

## Step 4 — Monitor sync status

Check active syncs from Postgres:

```sql
SELECT * FROM wal2delta.tables;
```

In the Databricks UI, the **Lakehouse Sync** tab on your branch shows sync status, lag, and errors.

## Step 5 — Query the history tables

### Latest state of each row

```sql
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY _lsn DESC) AS rn
FROM <catalog>.<schema>.lb_<table_name>_history
WHERE _change_type IN ('insert', 'update_postimage', 'delete')
)
WHERE rn = 1
AND _change_type != 'delete';
```

### Full change history for a record

```sql
SELECT *
FROM <catalog>.<schema>.lb_<table_name>_history
WHERE id = 12345
ORDER BY _lsn;
```

## Handling schema changes

If you need to change a synced table's schema in Postgres, use the rename-and-swap pattern:

```sql
CREATE TABLE users_v2 (
id INT PRIMARY KEY,
name TEXT,
new_column TEXT
);

ALTER TABLE users_v2 REPLICA IDENTITY FULL;

INSERT INTO users_v2 SELECT *, NULL FROM users;

BEGIN;
ALTER TABLE users RENAME TO users_backup;
ALTER TABLE users_v2 RENAME TO users;
COMMIT;
```

## What you end up with

- **Delta history tables** in Unity Catalog (`lb_<table_name>_history`) with full SCD Type 2 change tracking
- **Continuous replication** — changes stream from Postgres to Delta automatically
- **No external compute** — Lakehouse Sync is a native Lakebase feature
- Operational data queryable in Spark SQL, notebooks, BI tools, and downstream pipelines

## Troubleshooting

| Issue | Fix |
|-------|-----|
| Table not appearing in sync | Ensure it has a primary key or `REPLICA IDENTITY FULL` |
| Unsupported data type error | Check column types with the query in Step 2 |
| Sync lag increasing | Check Lakebase endpoint health and compute scaling |
| Missing changes on update/delete | Verify `REPLICA IDENTITY FULL` — `default` only captures PK columns |

## Limitations

- **AWS only** — Lakehouse Sync Beta is available in all Autoscaling regions on AWS. Azure support is not yet available.
- **No incremental charge** — replication cost is included in your Lakebase compute and storage.
- **Works alongside synced tables** — you can use Lakehouse Sync in a project/schema that also has Reverse ETL synced tables.

## Learn more

- [Lakehouse Sync (Autoscaling)](https://docs.databricks.com/aws/en/oltp/projects/lakehouse-sync)
- [Register Lakebase in Unity Catalog](https://docs.databricks.com/aws/en/oltp/projects/register-uc)
- [SCD Type 2 in Databricks](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/scd)
152 changes: 152 additions & 0 deletions recipes/etl-register-uc-provisioned.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# ETL: Access Lakebase Data from Unity Catalog (Provisioned — Database Catalog)

Register your Lakebase Provisioned database as a **read-only Unity Catalog catalog** so you can query operational data from the lakehouse, join it with analytics tables, and feed it into dashboards, pipelines, and ML models — all through standard SQL.

> This recipe is for **Lakebase Provisioned** (manually scaled instances). For Lakebase Autoscaling (projects with scale-to-zero and native CDC replication), see the [Autoscaling ETL recipe](./etl-lakehouse-sync-autoscaling.md).
>
> **Note:** New Lakebase instances are created as Autoscaling projects by default as of March 12, 2026. If you're starting fresh, consider using Autoscaling with Lakehouse Sync for full CDC-based replication.

## When to use this

- You want to query Lakebase operational data from Spark SQL, notebooks, or BI tools
- You need to join Lakebase tables with lakehouse data (federated queries)
- You want Unity Catalog governance (permissions, lineage, audit logs) on your Lakebase data
- You want to browse Lakebase schemas and tables in Catalog Explorer

## How it differs from Autoscaling Lakehouse Sync

| | Provisioned (this recipe) | Autoscaling (Lakehouse Sync) |
|---|---|---|
| **Mechanism** | Registers Postgres as a read-only UC catalog; queries go to Postgres via federation | Native CDC replication into Delta tables (SCD Type 2 history) |
| **Data location** | Data stays in Postgres; queries are federated | Data is copied into Delta tables in Unity Catalog |
| **History tracking** | No change history — queries reflect current Postgres state | Full SCD Type 2 change history |
| **Compute** | Requires a serverless SQL warehouse for queries | No external compute — native Lakebase feature |
| **Latency** | Real-time (reads from live Postgres) | Near real-time (CDC streaming lag) |

> **Note on Lakeflow Connect Plugin:** Provisioned previously had a Lakeflow Connect Plugin for Lakebase (Private Preview) that provided CDC-based replication similar to Lakehouse Sync. This plugin will **not** be taken to Public Preview or GA. If you need CDC replication from Lakebase to Unity Catalog, migrate to Autoscaling and use native Lakehouse Sync.

## Prerequisites

- A Databricks workspace with Lakebase enabled
- A **Lakebase Provisioned instance** with at least one database
- `CREATE CATALOG` privileges on the Unity Catalog metastore
- A **serverless SQL warehouse** (required to query the registered catalog)

## Step 1 — Register the database in Unity Catalog

### Via CLI

```bash
databricks database create-database-catalog <CATALOG_NAME> <INSTANCE_NAME> <DATABASE_NAME> --profile <PROFILE>
```

| Placeholder | Value |
|-------------|-------|
| `<CATALOG_NAME>` | Name for the new Unity Catalog catalog |
| `<INSTANCE_NAME>` | Your Lakebase Provisioned instance name |
| `<DATABASE_NAME>` | Postgres database to register (e.g., `databricks_postgres`) |

If the database doesn't exist yet, add `--create-database-if-not-exists`.

### Via UI

1. Click **Apps** in the top right → select **Lakebase Postgres**
2. Click **Provisioned** → select your instance
3. Go to **Catalogs** → click **Add catalog**
4. Enter a **Catalog name** and select the **Postgres database** (or enter a new name to create one)
5. Click **Create**

### Via REST API

```
POST /api/2.0/database/catalogs
```

```json
{
"name": "<CATALOG_NAME>",
"database_instance_name": "<INSTANCE_NAME>",
"database_name": "<DATABASE_NAME>"
}
```

## Step 2 — Verify registration

```bash
databricks database get-database-catalog <CATALOG_NAME> --profile <PROFILE>
```

Then browse to **Catalog** in the workspace sidebar — your Lakebase catalog should appear alongside other UC catalogs.

## Step 3 — Query your Lakebase data

Make sure you have a serverless SQL warehouse running as your compute resource. Then query directly:

```sql
-- Query a Lakebase table through Unity Catalog
SELECT * FROM <catalog_name>.<schema>.<table>
WHERE created_at > current_date - INTERVAL 7 DAYS;
```

```sql
-- Join Lakebase operational data with lakehouse analytics
SELECT
o.order_id,
o.status,
o.total_amount,
c.lifetime_value,
c.segment
FROM <catalog_name>.public.orders o
JOIN analytics.gold.customers c
ON o.customer_id = c.customer_id
WHERE o.created_at > current_date - INTERVAL 30 DAYS;
```

## Step 4 — Manage access

Grant read access to teams:

```sql
GRANT USE CATALOG ON CATALOG <catalog_name> TO `<group_name>`;
GRANT SELECT ON CATALOG <catalog_name> TO `<group_name>`;
```

## Delete the catalog

```bash
databricks database delete-database-catalog <CATALOG_NAME> --profile <PROFILE>
```

> Delete all synced tables from the catalog first. Each source table supports max 20 synced tables, and pending deletions count toward this limit (cleanup can take up to 3 days).

## What you end up with

- A **read-only Unity Catalog catalog** mirroring your Lakebase database structure
- **Federated queries** — query Lakebase data from Spark SQL alongside lakehouse tables
- **Unity Catalog governance** — permissions, lineage, and audit on Lakebase data
- **Catalog Explorer** — browse Lakebase schemas, tables, and views alongside other data sources

## Limitations

- The catalog is **read-only** — modify data through Lakebase directly
- One catalog per Postgres database — register each database separately
- Metadata is cached — click refresh in Catalog Explorer to see new objects
- Database instances are **single-workspace scoped** — no cross-workspace access to table contents
- Database names can only contain alphanumeric characters and underscores (no hyphens)
- **No CDC replication** — unlike Autoscaling's Lakehouse Sync, this approach does not replicate data or track change history; queries are federated to live Postgres
- **Lakeflow Connect Plugin (deprecated path)** — the Private Preview plugin for Provisioned CDC will not go to GA; migrate to Autoscaling for CDC-based ETL

## Troubleshooting

| Issue | Fix |
|-------|-----|
| Catalog not appearing | Ensure you have `CREATE CATALOG` metastore privileges |
| Tables not visible | Attach a serverless SQL warehouse and refresh the catalog view |
| Query errors | Confirm the SQL warehouse is running and can reach the instance |
| Permission denied | Grant `USE CATALOG` + `SELECT` to the querying user/group |

## Learn more

- [Register your database in Unity Catalog (Provisioned)](https://docs.databricks.com/aws/en/oltp/instances/register-uc)
- [Database Instances CLI](https://docs.databricks.com/aws/en/dev-tools/cli/reference/database-commands)
- [Upgrade to Autoscaling](https://docs.databricks.com/aws/en/oltp/upgrade-to-autoscaling)
Loading
Loading