Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 164 additions & 0 deletions content/recipes/etl-lakehouse-sync-autoscaling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# ETL: Sync Lakebase Tables to Unity Catalog (Autoscaling — Lakehouse Sync)

Replicate your Lakebase Autoscaling Postgres tables into Unity Catalog as managed Delta tables. **Lakehouse Sync** captures every row-level change using CDC and writes them as **SCD Type 2 history** — giving you a full audit trail of how your operational data changed over time, queryable from the lakehouse.

> This recipe is for **Lakebase Autoscaling** (projects/branches/endpoints with scale-to-zero). For Lakebase Provisioned, see the Provisioned ETL recipe (coming soon).

## When to use this

- You want to analyze operational data (orders, user activity, support tickets) in the lakehouse
- You need a historical record of every insert, update, and delete from your Postgres tables
- You want to join operational data with analytics data in Spark, SQL, or BI tools
- You need to feed Lakebase data into downstream pipelines or ML models

## How it works

> **Note:** Lakehouse Sync is currently in **Beta on AWS only** (all Autoscaling regions). Azure support is not yet available. It is a native Lakebase feature — no external compute, pipelines, or jobs required, and there is no incremental charge for replication beyond the underlying Lakebase compute and storage costs.

Lakehouse Sync uses Change Data Capture (CDC) to stream changes from Lakebase Postgres into Unity Catalog. For each synced table, a Delta history table is created:

```
lb_<table_name>_history
```

Each row includes metadata columns:
- `_change_type` — `insert`, `update_preimage`, `update_postimage`, or `delete`
- `_lsn` — Log Sequence Number for ordering changes
- `_commit_timestamp` — When the change was captured

## Step 1 — Verify table replica identity

Lakehouse Sync requires the right replica identity for capturing changes. Connect to your Lakebase database and check:

```sql
SELECT n.nspname AS table_schema,
c.relname AS table_name,
CASE c.relreplident
WHEN 'd' THEN 'default'
WHEN 'n' THEN 'nothing'
WHEN 'f' THEN 'full'
WHEN 'i' THEN 'index'
END AS replica_identity
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
AND n.nspname = 'public'
ORDER BY n.nspname, c.relname;
```

If a table shows `default` or `nothing`, set it to `FULL`:

```sql
ALTER TABLE <table_name> REPLICA IDENTITY FULL;
```

## Step 2 — Check for unsupported data types

```sql
SELECT c.table_schema, c.table_name, c.column_name, c.udt_name AS data_type
FROM information_schema.columns c
JOIN pg_catalog.pg_type t ON t.typname = c.udt_name
WHERE c.table_schema = 'public'
AND c.table_name IN (
SELECT tablename FROM pg_tables WHERE schemaname = c.table_schema
)
AND NOT (
c.udt_name IN (
'bool', 'int2', 'int4', 'int8', 'text', 'varchar', 'bpchar',
'jsonb', 'numeric', 'date', 'timestamp', 'timestamptz',
'real', 'float4', 'float8'
)
OR t.typcategory = 'E'
)
ORDER BY c.table_schema, c.table_name, c.ordinal_position;
```

If unsupported types appear, restructure those columns before enabling sync.

## Step 3 — Enable Lakehouse Sync

> **Note:** This step is not yet available via CLI or REST API and must be completed through the Databricks UI:
>
> In **Catalog**, open your Autoscaling project → branch → **Lakehouse Sync** → **Start Sync**, then select the source database/schema, destination catalog/schema, and tables.

## Step 4 — Monitor sync status

Check active syncs from Postgres (the `wal2delta` schema only exists after Lakehouse Sync has been enabled in Step 3):

```sql
SELECT * FROM wal2delta.tables;
```

## Step 5 — Query the history tables

### Latest state of each row

```sql
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY _lsn DESC) AS rn
FROM <catalog>.<schema>.lb_<table_name>_history
WHERE _change_type IN ('insert', 'update_postimage', 'delete')
)
WHERE rn = 1
AND _change_type != 'delete';
```

### Full change history for a record

```sql
SELECT *
FROM <catalog>.<schema>.lb_<table_name>_history
WHERE id = 12345
ORDER BY _lsn;
```

## Handling schema changes

If you need to change a synced table's schema in Postgres, use the rename-and-swap pattern:

```sql
CREATE TABLE users_v2 (
id INT PRIMARY KEY,
name TEXT,
new_column TEXT
);

ALTER TABLE users_v2 REPLICA IDENTITY FULL;

INSERT INTO users_v2 SELECT *, NULL FROM users;

BEGIN;
ALTER TABLE users RENAME TO users_backup;
ALTER TABLE users_v2 RENAME TO users;
COMMIT;
```

## What you end up with

- **Delta history tables** in Unity Catalog (`lb_<table_name>_history`) with full SCD Type 2 change tracking
- **Continuous replication** — changes stream from Postgres to Delta automatically
- **No external compute** — Lakehouse Sync is a native Lakebase feature
- Operational data queryable in Spark SQL, notebooks, BI tools, and downstream pipelines

## Troubleshooting

| Issue | Fix |
|-------|-----|
| Table not appearing in sync | Ensure it has a primary key or `REPLICA IDENTITY FULL` |
| Unsupported data type error | Check column types with the query in Step 2 |
| Sync lag increasing | Check Lakebase endpoint health and compute scaling |
| Missing changes on update/delete | Verify `REPLICA IDENTITY FULL` — `default` only captures PK columns |

## Limitations

- **AWS only** — Lakehouse Sync Beta is available in all Autoscaling regions on AWS. Azure support is not yet available.
- **No incremental charge** — replication cost is included in your Lakebase compute and storage.
- **Works alongside synced tables** — you can use Lakehouse Sync in a project/schema that also has Reverse ETL synced tables.

## Learn more

- [Lakehouse Sync (Autoscaling)](https://docs.databricks.com/aws/en/oltp/projects/lakehouse-sync)
- [Register Lakebase in Unity Catalog](https://docs.databricks.com/aws/en/oltp/projects/register-uc)
- [SCD Type 2 in Databricks](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/scd)
159 changes: 159 additions & 0 deletions content/recipes/reverse-etl-synced-tables-autoscaling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Reverse ETL: Sync a Unity Catalog Table to Lakebase (Autoscaling)

Serve lakehouse data through Lakebase Autoscaling Postgres so your applications can query it with sub-10ms latency. This creates a **synced table** — a managed copy of your Unity Catalog table in Lakebase that stays up to date automatically.

> This recipe is for **Lakebase Autoscaling** (projects/branches/endpoints with scale-to-zero). For Lakebase Provisioned (manually scaled instances), see the Provisioned Reverse ETL recipe (coming soon).

## When to use this

- Your app needs fast lookup-style queries against analytics data (user profiles, feature values, risk scores)
- You want to serve gold tables, ML outputs, or enriched records through a standard Postgres connection
- You need ACID transactions and sub-10ms reads alongside your operational state

## Choose a sync mode


| Mode | Behavior | Best for |
| -------------- | ---------------------------------------------- | ------------------------------------------------------------------------------------- |
| **Snapshot** | One-time full copy | Source changes >10% of rows per cycle, or source doesn't support CDF (views, Iceberg) |
| **Triggered** | Incremental updates on demand or on a schedule | Known cadence of changes, good cost/freshness balance |
| **Continuous** | Real-time streaming (seconds of latency) | Changes must appear in Lakebase near-instantly |


> **Triggered** and **Continuous** modes require [Change Data Feed (CDF)](https://docs.databricks.com/aws/en/delta/delta-change-data-feed) enabled on the source table. If it's not enabled, run:
>
> ```sql
> ALTER TABLE <catalog>.<schema>.<table> SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
> ```

## Sync throughput

Autoscaling CUs are physically 8x smaller than Provisioned CUs, so per-CU throughput differs:


| Mode | Rows/sec per CU |
| ---------------------------------------- | --------------- |
| **Snapshot** (initial + full refresh) | ~2,000 |
| **Triggered / Continuous** (incremental) | ~150 |


> A 10x speedup for large-table snapshot sync (writing Postgres pages directly, leveraging separation of storage and compute) is coming for Autoscaling only.

## Create a synced table

```bash
databricks database create-synced-database-table \
--json '{
"name": "<CATALOG>.<SCHEMA>.<SYNCED_TABLE_NAME>",
"database_instance_name": "<INSTANCE_NAME>",
"logical_database_name": "<POSTGRES_DATABASE>",
"spec": {
"source_table_full_name": "<CATALOG>.<SCHEMA>.<SOURCE_TABLE>",
"primary_key_columns": ["<PRIMARY_KEY_COLUMN>"],
"scheduling_policy": "<SNAPSHOT|TRIGGERED|CONTINUOUS>",
"create_database_objects_if_missing": true
}
}' --profile <PROFILE>
```

> If your Lakebase database is **registered as a Unity Catalog catalog**, you can omit `database_instance_name` and `logical_database_name`.

Verify:

```bash
databricks database get-synced-database-table <CATALOG>.<SCHEMA>.<SYNCED_TABLE_NAME> --profile <PROFILE>
```

> **Important:** If your Autoscaling project was created via the `/postgres/` API (not `/database/`), programmatic synced table creation is not yet available via CLI — use the Databricks UI as a fallback. In **Catalog**, select the source table → **Create synced table**, then choose your Lakebase project, branch, sync mode, and pipeline. This gap is expected to close soon.

## Pipeline reuse guidance

How you set up pipelines depends on your sync mode:


| Sync mode | Recommendation | Why |
| ------------------------ | -------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| **Continuous** | **Reuse** a pipeline across ~10 tables | Cost-advantageous — e.g., 1 pipeline for 10 tables ≈ $204/table/month vs $2,044/table/month for individual pipelines |
| **Snapshot / Triggered** | **Separate** pipelines per table | Allows re-snapshotting individual tables without impacting others |


## Schedule ongoing syncs

The initial snapshot runs automatically on creation. For **Snapshot** and **Triggered** modes, subsequent syncs need to be triggered.

> **Note:** Table-update triggers for sync pipelines are not yet available via CLI and must be configured through the Databricks UI: **Workflows** → create/open a job → add a **Database Table Sync pipeline** task → **Schedules & Triggers** → add a **Table update** trigger pointing to your source table.

Trigger a sync update programmatically via the Databricks SDK:

```python
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

table = w.database.get_synced_database_table(
name="<CATALOG>.<SCHEMA>.<SYNCED_TABLE_NAME>"
)
pipeline_id = table.data_synchronization_status.pipeline_id

w.pipelines.start_update(pipeline_id=pipeline_id)
```

## Query the synced data in Postgres

Once synced, the table is available in Lakebase Postgres. The Unity Catalog schema becomes the Postgres schema:

```sql
SELECT * FROM "<schema>"."<synced_table_name>" WHERE "user_id" = 12345;
```

Connect with any standard Postgres client (psql, DBeaver, your application's Postgres driver).

## What you end up with

- A **synced table** in Unity Catalog that tracks the sync pipeline
- A **read-only Postgres table** in Lakebase that your apps can query with sub-10ms latency
- A **managed Lakeflow pipeline** that keeps the data in sync based on your chosen mode
- Up to **16 connections** per sync to your Lakebase database

## Important constraints

- **Primary key is mandatory.** Synced tables always require a primary key — it enables efficient point lookups and incremental updates. Rows with nulls in PK columns are excluded from the sync.
- **Duplicate primary keys fail the sync** unless you configure a `timeseries_key` for deduplication (latest value wins per PK). Using a timeseries key has a performance penalty.
- **Schema changes**: For Triggered/Continuous mode, only **additive** changes (e.g., adding a column) propagate. Dropping or renaming columns requires recreating the synced table.
- **FGAC tables**: Direct sync of Fine-Grained Access Control tables fails. **Workaround**: create a view (`SELECT * FROM table`), then sync the view in Snapshot mode. Caveat: runs as the sync creator and only sees their visible rows.
- **Connection limits**: Autoscaling supports up to 4,000 concurrent connections (varies by compute size). Each sync uses up to 16 connections.
- **Read-only in Postgres**: Synced tables should only be read from Postgres. Writing to them interferes with the sync pipeline.

## Cost guidance

Cost formula: `[Rows / (Speed × CUs × 3600)] × DLT Hourly Rate`

Example costs (181M rows, 1 CU, $2.80/hr DLT rate):


| Mode | Monthly cost |
| ---------------------------------- | ------------ |
| Snapshot (daily) | ~$2,110 |
| Triggered (daily, 5% changes) | ~$1,407 |
| Continuous (10 tables, 1 pipeline) | ~$204/table |
| Continuous (1 table, 1 pipeline) | ~$2,044 |


## Troubleshooting


| Issue | Fix |
| ----------------------------------- | ----------------------------------------------------------------------------------------- |
| CDF not enabled warning | Run `ALTER TABLE ... SET TBLPROPERTIES (delta.enableChangeDataFeed = true)` on the source |
| Schema not visible in create dialog | Confirm you have `USE_SCHEMA` and `CREATE_TABLE` on the target schema |
| Null bytes in string columns | Clean source data: `SELECT REPLACE(col, CAST(CHAR(0) AS STRING), '') AS col FROM table` |
| Sync failing | Check the pipeline in the synced table's Overview tab for error details |
| FGAC table sync fails | Create a view over the table and sync the view in Snapshot mode |
| Duplicate primary key failure | Add a `timeseries_key` to deduplicate (latest wins) |


## Learn more

- [Synced tables (Autoscaling)](https://docs.databricks.com/aws/en/oltp/projects/sync-tables)
- [Change Data Feed](https://docs.databricks.com/aws/en/delta/delta-change-data-feed)
- [Lakebase Autoscaling](https://docs.databricks.com/aws/en/oltp/projects/)
Loading