databricks · sav-maya · Mar 19, 2026
diff --git a/recipes/etl-lakehouse-sync-autoscaling.md b/recipes/etl-lakehouse-sync-autoscaling.md
@@ -0,0 +1,180 @@
+# ETL: Sync Lakebase Tables to Unity Catalog (Autoscaling — Lakehouse Sync)
+
+Replicate your Lakebase Autoscaling Postgres tables into Unity Catalog as managed Delta tables. **Lakehouse Sync** captures every row-level change using CDC and writes them as **SCD Type 2 history** — giving you a full audit trail of how your operational data changed over time, queryable from the lakehouse.
+
+> This recipe is for **Lakebase Autoscaling** (projects/branches/endpoints with scale-to-zero). For Lakebase Provisioned, see the [Provisioned ETL recipe](./etl-register-uc-provisioned.md).
+
+## When to use this
+
+- You want to analyze operational data (orders, user activity, support tickets) in the lakehouse
+- You need a historical record of every insert, update, and delete from your Postgres tables
+- You want to join operational data with analytics data in Spark, SQL, or BI tools
+- You need to feed Lakebase data into downstream pipelines or ML models
+
+## Prerequisites
+
+- A Databricks workspace with Lakebase enabled (**AWS only** — Lakehouse Sync Beta is not yet available on Azure)
+- A **Lakebase Autoscaling project** with at least one branch containing tables with data
+- The Lakebase database must be **registered in Unity Catalog** ([register it](https://docs.databricks.com/aws/en/oltp/projects/register-uc))
+- Tables you want to sync must have a **primary key** or `REPLICA IDENTITY FULL` set
+
+> **Note:** Lakehouse Sync is currently in **Beta on AWS only** (all Autoscaling regions). Azure support is not yet available. It is a native Lakebase feature — no external compute, pipelines, or jobs required, and there is no incremental charge for replication beyond the underlying Lakebase compute and storage costs.
+
+## How it works
+
+Lakehouse Sync uses Change Data Capture (CDC) to stream changes from Lakebase Postgres into Unity Catalog. For each synced table, a Delta history table is created:
+
+```
+lb_<table_name>_history
+```
+
+Each row includes metadata columns:
+- `_change_type` — `insert`, `update_preimage`, `update_postimage`, or `delete`
+- `_lsn` — Log Sequence Number for ordering changes
+- `_commit_timestamp` — When the change was captured
+
+## Step 1 — Verify table replica identity
+
+Lakehouse Sync requires the right replica identity for capturing changes. Connect to your Lakebase database and check:
+
+```sql
+SELECT n.nspname AS table_schema,
+       c.relname AS table_name,
+       CASE c.relreplident
+         WHEN 'd' THEN 'default'
+         WHEN 'n' THEN 'nothing'
+         WHEN 'f' THEN 'full'
+         WHEN 'i' THEN 'index'
+       END AS replica_identity
+FROM pg_class c
+JOIN pg_namespace n ON n.oid = c.relnamespace
+WHERE c.relkind = 'r'
+  AND n.nspname = 'public'
+ORDER BY n.nspname, c.relname;
+```
+
+If a table shows `default` or `nothing`, set it to `FULL`:
+
+```sql
+ALTER TABLE <table_name> REPLICA IDENTITY FULL;
+```
+
+## Step 2 — Check for unsupported data types
+
+```sql
+SELECT c.table_schema, c.table_name, c.column_name, c.udt_name AS data_type
+FROM information_schema.columns c
+JOIN pg_catalog.pg_type t ON t.typname = c.udt_name
+WHERE c.table_schema = 'public'
+  AND c.table_name IN (
+    SELECT tablename FROM pg_tables WHERE schemaname = c.table_schema
+  )
+  AND NOT (
+    c.udt_name IN (
+      'bool', 'int2', 'int4', 'int8', 'text', 'varchar', 'bpchar',
+      'jsonb', 'numeric', 'date', 'timestamp', 'timestamptz',
+      'real', 'float4', 'float8'
+    )
+    OR t.typcategory = 'E'
+  )
+ORDER BY c.table_schema, c.table_name, c.ordinal_position;
+```
+
+If unsupported types appear, restructure those columns before enabling sync.
+
+## Step 3 — Enable Lakehouse Sync
+
+> **Note:** Lakehouse Sync is configured through the Databricks UI. There is no CLI or REST API support for enabling it yet.
+
+1. Navigate to **Catalog** in the workspace sidebar
+2. Open your **Lakebase Autoscaling project** and select the **branch**
+3. Click **Lakehouse Sync**
+4. Click **Start Sync**
+5. Select the **source Postgres database and schema**
+6. Select the **destination Unity Catalog catalog and schema**
+7. Choose which **tables** to sync
+8. Click **Start**
+
+## Step 4 — Monitor sync status
+
+Check active syncs from Postgres:
+
+```sql
+SELECT * FROM wal2delta.tables;
+```
+
+In the Databricks UI, the **Lakehouse Sync** tab on your branch shows sync status, lag, and errors.
+
+## Step 5 — Query the history tables
+
+### Latest state of each row
+
+```sql
+SELECT *
+FROM (
+  SELECT *,
+    ROW_NUMBER() OVER (PARTITION BY id ORDER BY _lsn DESC) AS rn
+  FROM <catalog>.<schema>.lb_<table_name>_history
+  WHERE _change_type IN ('insert', 'update_postimage', 'delete')
+)
+WHERE rn = 1
+  AND _change_type != 'delete';
+```
+
+### Full change history for a record
+
+```sql
+SELECT *
+FROM <catalog>.<schema>.lb_<table_name>_history
+WHERE id = 12345
+ORDER BY _lsn;
+```
+
+## Handling schema changes
+
+If you need to change a synced table's schema in Postgres, use the rename-and-swap pattern:
+
+```sql
+CREATE TABLE users_v2 (
+  id INT PRIMARY KEY,
+  name TEXT,
+  new_column TEXT
+);
+
+ALTER TABLE users_v2 REPLICA IDENTITY FULL;
+
+INSERT INTO users_v2 SELECT *, NULL FROM users;
+
+BEGIN;
+ALTER TABLE users RENAME TO users_backup;
+ALTER TABLE users_v2 RENAME TO users;
+COMMIT;
+```
+
+## What you end up with
+
+- **Delta history tables** in Unity Catalog (`lb_<table_name>_history`) with full SCD Type 2 change tracking
+- **Continuous replication** — changes stream from Postgres to Delta automatically
+- **No external compute** — Lakehouse Sync is a native Lakebase feature
+- Operational data queryable in Spark SQL, notebooks, BI tools, and downstream pipelines
+
+## Troubleshooting
+
+| Issue | Fix |
+|-------|-----|
+| Table not appearing in sync | Ensure it has a primary key or `REPLICA IDENTITY FULL` |
+| Unsupported data type error | Check column types with the query in Step 2 |
+| Sync lag increasing | Check Lakebase endpoint health and compute scaling |
+| Missing changes on update/delete | Verify `REPLICA IDENTITY FULL` — `default` only captures PK columns |
+
+## Limitations
+
+- **AWS only** — Lakehouse Sync Beta is available in all Autoscaling regions on AWS. Azure support is not yet available.
+- **No incremental charge** — replication cost is included in your Lakebase compute and storage.
+- **Works alongside synced tables** — you can use Lakehouse Sync in a project/schema that also has Reverse ETL synced tables.
+
+## Learn more
+
+- [Lakehouse Sync (Autoscaling)](https://docs.databricks.com/aws/en/oltp/projects/lakehouse-sync)
+- [Register Lakebase in Unity Catalog](https://docs.databricks.com/aws/en/oltp/projects/register-uc)
+- [SCD Type 2 in Databricks](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/scd)
diff --git a/recipes/etl-register-uc-provisioned.md b/recipes/etl-register-uc-provisioned.md
@@ -0,0 +1,152 @@
+# ETL: Access Lakebase Data from Unity Catalog (Provisioned — Database Catalog)
+
+Register your Lakebase Provisioned database as a **read-only Unity Catalog catalog** so you can query operational data from the lakehouse, join it with analytics tables, and feed it into dashboards, pipelines, and ML models — all through standard SQL.
+
+> This recipe is for **Lakebase Provisioned** (manually scaled instances). For Lakebase Autoscaling (projects with scale-to-zero and native CDC replication), see the [Autoscaling ETL recipe](./etl-lakehouse-sync-autoscaling.md).
+>
+> **Note:** New Lakebase instances are created as Autoscaling projects by default as of March 12, 2026. If you're starting fresh, consider using Autoscaling with Lakehouse Sync for full CDC-based replication.
+
+## When to use this
+
+- You want to query Lakebase operational data from Spark SQL, notebooks, or BI tools
+- You need to join Lakebase tables with lakehouse data (federated queries)
+- You want Unity Catalog governance (permissions, lineage, audit logs) on your Lakebase data
+- You want to browse Lakebase schemas and tables in Catalog Explorer
+
+## How it differs from Autoscaling Lakehouse Sync
+
+| | Provisioned (this recipe) | Autoscaling (Lakehouse Sync) |
+|---|---|---|
+| **Mechanism** | Registers Postgres as a read-only UC catalog; queries go to Postgres via federation | Native CDC replication into Delta tables (SCD Type 2 history) |
+| **Data location** | Data stays in Postgres; queries are federated | Data is copied into Delta tables in Unity Catalog |
+| **History tracking** | No change history — queries reflect current Postgres state | Full SCD Type 2 change history |
+| **Compute** | Requires a serverless SQL warehouse for queries | No external compute — native Lakebase feature |
+| **Latency** | Real-time (reads from live Postgres) | Near real-time (CDC streaming lag) |
+
+> **Note on Lakeflow Connect Plugin:** Provisioned previously had a Lakeflow Connect Plugin for Lakebase (Private Preview) that provided CDC-based replication similar to Lakehouse Sync. This plugin will **not** be taken to Public Preview or GA. If you need CDC replication from Lakebase to Unity Catalog, migrate to Autoscaling and use native Lakehouse Sync.
+
+## Prerequisites
+
+- A Databricks workspace with Lakebase enabled
+- A **Lakebase Provisioned instance** with at least one database
+- `CREATE CATALOG` privileges on the Unity Catalog metastore
+- A **serverless SQL warehouse** (required to query the registered catalog)
+
+## Step 1 — Register the database in Unity Catalog
+
+### Via CLI
+
+```bash
+databricks database create-database-catalog <CATALOG_NAME> <INSTANCE_NAME> <DATABASE_NAME> --profile <PROFILE>
+```
+
+| Placeholder | Value |
+|-------------|-------|
+| `<CATALOG_NAME>` | Name for the new Unity Catalog catalog |
+| `<INSTANCE_NAME>` | Your Lakebase Provisioned instance name |
+| `<DATABASE_NAME>` | Postgres database to register (e.g., `databricks_postgres`) |
+
+If the database doesn't exist yet, add `--create-database-if-not-exists`.
+
+### Via UI
+
+1. Click **Apps** in the top right → select **Lakebase Postgres**
+2. Click **Provisioned** → select your instance
+3. Go to **Catalogs** → click **Add catalog**
+4. Enter a **Catalog name** and select the **Postgres database** (or enter a new name to create one)
+5. Click **Create**
+
+### Via REST API
+
+```
+POST /api/2.0/database/catalogs
+```
+
+```json
+{
+  "name": "<CATALOG_NAME>",
+  "database_instance_name": "<INSTANCE_NAME>",
+  "database_name": "<DATABASE_NAME>"
+}
+```
+
+## Step 2 — Verify registration
+
+```bash
+databricks database get-database-catalog <CATALOG_NAME> --profile <PROFILE>
+```
+
+Then browse to **Catalog** in the workspace sidebar — your Lakebase catalog should appear alongside other UC catalogs.
+
+## Step 3 — Query your Lakebase data
+
+Make sure you have a serverless SQL warehouse running as your compute resource. Then query directly:
+
+```sql
+-- Query a Lakebase table through Unity Catalog
+SELECT * FROM <catalog_name>.<schema>.<table>
+WHERE created_at > current_date - INTERVAL 7 DAYS;
+```
+
+```sql
+-- Join Lakebase operational data with lakehouse analytics
+SELECT
+  o.order_id,
+  o.status,
+  o.total_amount,
+  c.lifetime_value,
+  c.segment
+FROM <catalog_name>.public.orders o
+JOIN analytics.gold.customers c
+  ON o.customer_id = c.customer_id
+WHERE o.created_at > current_date - INTERVAL 30 DAYS;
+```
+
+## Step 4 — Manage access
+
+Grant read access to teams:
+
+```sql
+GRANT USE CATALOG ON CATALOG <catalog_name> TO `<group_name>`;
+GRANT SELECT ON CATALOG <catalog_name> TO `<group_name>`;
+```
+
+## Delete the catalog
+
+```bash
+databricks database delete-database-catalog <CATALOG_NAME> --profile <PROFILE>
+```
+
+> Delete all synced tables from the catalog first. Each source table supports max 20 synced tables, and pending deletions count toward this limit (cleanup can take up to 3 days).
+
+## What you end up with
+
+- A **read-only Unity Catalog catalog** mirroring your Lakebase database structure
+- **Federated queries** — query Lakebase data from Spark SQL alongside lakehouse tables
+- **Unity Catalog governance** — permissions, lineage, and audit on Lakebase data
+- **Catalog Explorer** — browse Lakebase schemas, tables, and views alongside other data sources
+
+## Limitations
+
+- The catalog is **read-only** — modify data through Lakebase directly
+- One catalog per Postgres database — register each database separately
+- Metadata is cached — click refresh in Catalog Explorer to see new objects
+- Database instances are **single-workspace scoped** — no cross-workspace access to table contents
+- Database names can only contain alphanumeric characters and underscores (no hyphens)
+- **No CDC replication** — unlike Autoscaling's Lakehouse Sync, this approach does not replicate data or track change history; queries are federated to live Postgres
+- **Lakeflow Connect Plugin (deprecated path)** — the Private Preview plugin for Provisioned CDC will not go to GA; migrate to Autoscaling for CDC-based ETL
+
+## Troubleshooting
+
+| Issue | Fix |
+|-------|-----|
+| Catalog not appearing | Ensure you have `CREATE CATALOG` metastore privileges |
+| Tables not visible | Attach a serverless SQL warehouse and refresh the catalog view |
+| Query errors | Confirm the SQL warehouse is running and can reach the instance |
+| Permission denied | Grant `USE CATALOG` + `SELECT` to the querying user/group |
+
+## Learn more
+
+- [Register your database in Unity Catalog (Provisioned)](https://docs.databricks.com/aws/en/oltp/instances/register-uc)
+- [Database Instances CLI](https://docs.databricks.com/aws/en/dev-tools/cli/reference/database-commands)
+- [Upgrade to Autoscaling](https://docs.databricks.com/aws/en/oltp/upgrade-to-autoscaling)