databricks · sav-maya · Mar 20, 2026
diff --git a/content/recipes/etl-lakehouse-sync-autoscaling.md b/content/recipes/etl-lakehouse-sync-autoscaling.md
@@ -0,0 +1,164 @@
+# ETL: Sync Lakebase Tables to Unity Catalog (Autoscaling — Lakehouse Sync)
+
+Replicate your Lakebase Autoscaling Postgres tables into Unity Catalog as managed Delta tables. **Lakehouse Sync** captures every row-level change using CDC and writes them as **SCD Type 2 history** — giving you a full audit trail of how your operational data changed over time, queryable from the lakehouse.
+
+> This recipe is for **Lakebase Autoscaling** (projects/branches/endpoints with scale-to-zero). For Lakebase Provisioned, see the Provisioned ETL recipe (coming soon).
+
+## When to use this
+
+- You want to analyze operational data (orders, user activity, support tickets) in the lakehouse
+- You need a historical record of every insert, update, and delete from your Postgres tables
+- You want to join operational data with analytics data in Spark, SQL, or BI tools
+- You need to feed Lakebase data into downstream pipelines or ML models
+
+## How it works
+
+> **Note:** Lakehouse Sync is currently in **Beta on AWS only** (all Autoscaling regions). Azure support is not yet available. It is a native Lakebase feature — no external compute, pipelines, or jobs required, and there is no incremental charge for replication beyond the underlying Lakebase compute and storage costs.
+
+Lakehouse Sync uses Change Data Capture (CDC) to stream changes from Lakebase Postgres into Unity Catalog. For each synced table, a Delta history table is created:
+
+```
+lb_<table_name>_history
+```
+
+Each row includes metadata columns:
+- `_change_type` — `insert`, `update_preimage`, `update_postimage`, or `delete`
+- `_lsn` — Log Sequence Number for ordering changes
+- `_commit_timestamp` — When the change was captured
+
+## Step 1 — Verify table replica identity
+
+Lakehouse Sync requires the right replica identity for capturing changes. Connect to your Lakebase database and check:
+
+```sql
+SELECT n.nspname AS table_schema,
+       c.relname AS table_name,
+       CASE c.relreplident
+         WHEN 'd' THEN 'default'
+         WHEN 'n' THEN 'nothing'
+         WHEN 'f' THEN 'full'
+         WHEN 'i' THEN 'index'
+       END AS replica_identity
+FROM pg_class c
+JOIN pg_namespace n ON n.oid = c.relnamespace
+WHERE c.relkind = 'r'
+  AND n.nspname = 'public'
+ORDER BY n.nspname, c.relname;
+```
+
+If a table shows `default` or `nothing`, set it to `FULL`:
+
+```sql
+ALTER TABLE <table_name> REPLICA IDENTITY FULL;
+```
+
+## Step 2 — Check for unsupported data types
+
+```sql
+SELECT c.table_schema, c.table_name, c.column_name, c.udt_name AS data_type
+FROM information_schema.columns c
+JOIN pg_catalog.pg_type t ON t.typname = c.udt_name
+WHERE c.table_schema = 'public'
+  AND c.table_name IN (
+    SELECT tablename FROM pg_tables WHERE schemaname = c.table_schema
+  )
+  AND NOT (
+    c.udt_name IN (
+      'bool', 'int2', 'int4', 'int8', 'text', 'varchar', 'bpchar',
+      'jsonb', 'numeric', 'date', 'timestamp', 'timestamptz',
+      'real', 'float4', 'float8'
+    )
+    OR t.typcategory = 'E'
+  )
+ORDER BY c.table_schema, c.table_name, c.ordinal_position;
+```
+
+If unsupported types appear, restructure those columns before enabling sync.
+
+## Step 3 — Enable Lakehouse Sync
+
+> **Note:** This step is not yet available via CLI or REST API and must be completed through the Databricks UI:
+>
+> In **Catalog**, open your Autoscaling project → branch → **Lakehouse Sync** → **Start Sync**, then select the source database/schema, destination catalog/schema, and tables.
+
+## Step 4 — Monitor sync status
+
+Check active syncs from Postgres (the `wal2delta` schema only exists after Lakehouse Sync has been enabled in Step 3):
+
+```sql
+SELECT * FROM wal2delta.tables;
+```
+
+## Step 5 — Query the history tables
+
+### Latest state of each row
+
+```sql
+SELECT *
+FROM (
+  SELECT *,
+    ROW_NUMBER() OVER (PARTITION BY id ORDER BY _lsn DESC) AS rn
+  FROM <catalog>.<schema>.lb_<table_name>_history
+  WHERE _change_type IN ('insert', 'update_postimage', 'delete')
+)
+WHERE rn = 1
+  AND _change_type != 'delete';
+```
+
+### Full change history for a record
+
+```sql
+SELECT *
+FROM <catalog>.<schema>.lb_<table_name>_history
+WHERE id = 12345
+ORDER BY _lsn;
+```
+
+## Handling schema changes
+
+If you need to change a synced table's schema in Postgres, use the rename-and-swap pattern:
+
+```sql
+CREATE TABLE users_v2 (
+  id INT PRIMARY KEY,
+  name TEXT,
+  new_column TEXT
+);
+
+ALTER TABLE users_v2 REPLICA IDENTITY FULL;
+
+INSERT INTO users_v2 SELECT *, NULL FROM users;
+
+BEGIN;
+ALTER TABLE users RENAME TO users_backup;
+ALTER TABLE users_v2 RENAME TO users;
+COMMIT;
+```
+
+## What you end up with
+
+- **Delta history tables** in Unity Catalog (`lb_<table_name>_history`) with full SCD Type 2 change tracking
+- **Continuous replication** — changes stream from Postgres to Delta automatically
+- **No external compute** — Lakehouse Sync is a native Lakebase feature
+- Operational data queryable in Spark SQL, notebooks, BI tools, and downstream pipelines
+
+## Troubleshooting
+
+| Issue | Fix |
+|-------|-----|
+| Table not appearing in sync | Ensure it has a primary key or `REPLICA IDENTITY FULL` |
+| Unsupported data type error | Check column types with the query in Step 2 |
+| Sync lag increasing | Check Lakebase endpoint health and compute scaling |
+| Missing changes on update/delete | Verify `REPLICA IDENTITY FULL` — `default` only captures PK columns |
+
+## Limitations
+
+- **AWS only** — Lakehouse Sync Beta is available in all Autoscaling regions on AWS. Azure support is not yet available.
+- **No incremental charge** — replication cost is included in your Lakebase compute and storage.
+- **Works alongside synced tables** — you can use Lakehouse Sync in a project/schema that also has Reverse ETL synced tables.
+
+## Learn more
+
+- [Lakehouse Sync (Autoscaling)](https://docs.databricks.com/aws/en/oltp/projects/lakehouse-sync)
+- [Register Lakebase in Unity Catalog](https://docs.databricks.com/aws/en/oltp/projects/register-uc)
+- [SCD Type 2 in Databricks](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/scd)
diff --git a/content/recipes/reverse-etl-synced-tables-autoscaling.md b/content/recipes/reverse-etl-synced-tables-autoscaling.md
@@ -0,0 +1,159 @@
+# Reverse ETL: Sync a Unity Catalog Table to Lakebase (Autoscaling)
+
+Serve lakehouse data through Lakebase Autoscaling Postgres so your applications can query it with sub-10ms latency. This creates a **synced table** — a managed copy of your Unity Catalog table in Lakebase that stays up to date automatically.
+
+> This recipe is for **Lakebase Autoscaling** (projects/branches/endpoints with scale-to-zero). For Lakebase Provisioned (manually scaled instances), see the Provisioned Reverse ETL recipe (coming soon).
+
+## When to use this
+
+- Your app needs fast lookup-style queries against analytics data (user profiles, feature values, risk scores)
+- You want to serve gold tables, ML outputs, or enriched records through a standard Postgres connection
+- You need ACID transactions and sub-10ms reads alongside your operational state
+
+## Choose a sync mode
+
+
+| Mode           | Behavior                                       | Best for                                                                              |
+| -------------- | ---------------------------------------------- | ------------------------------------------------------------------------------------- |
+| **Snapshot**   | One-time full copy                             | Source changes >10% of rows per cycle, or source doesn't support CDF (views, Iceberg) |
+| **Triggered**  | Incremental updates on demand or on a schedule | Known cadence of changes, good cost/freshness balance                                 |
+| **Continuous** | Real-time streaming (seconds of latency)       | Changes must appear in Lakebase near-instantly                                        |
+
+
+> **Triggered** and **Continuous** modes require [Change Data Feed (CDF)](https://docs.databricks.com/aws/en/delta/delta-change-data-feed) enabled on the source table. If it's not enabled, run:
+>
+> ```sql
+> ALTER TABLE <catalog>.<schema>.<table> SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
+> ```
+
+## Sync throughput
+
+Autoscaling CUs are physically 8x smaller than Provisioned CUs, so per-CU throughput differs:
+
+
+| Mode                                     | Rows/sec per CU |
+| ---------------------------------------- | --------------- |
+| **Snapshot** (initial + full refresh)    | ~2,000          |
+| **Triggered / Continuous** (incremental) | ~150            |
+
+
+> A 10x speedup for large-table snapshot sync (writing Postgres pages directly, leveraging separation of storage and compute) is coming for Autoscaling only.
+
+## Create a synced table
+
+```bash
+databricks database create-synced-database-table \
+  --json '{
+    "name": "<CATALOG>.<SCHEMA>.<SYNCED_TABLE_NAME>",
+    "database_instance_name": "<INSTANCE_NAME>",
+    "logical_database_name": "<POSTGRES_DATABASE>",
+    "spec": {
+      "source_table_full_name": "<CATALOG>.<SCHEMA>.<SOURCE_TABLE>",
+      "primary_key_columns": ["<PRIMARY_KEY_COLUMN>"],
+      "scheduling_policy": "<SNAPSHOT|TRIGGERED|CONTINUOUS>",
+      "create_database_objects_if_missing": true
+    }
+  }' --profile <PROFILE>
+```
+
+> If your Lakebase database is **registered as a Unity Catalog catalog**, you can omit `database_instance_name` and `logical_database_name`.
+
+Verify:
+
+```bash
+databricks database get-synced-database-table <CATALOG>.<SCHEMA>.<SYNCED_TABLE_NAME> --profile <PROFILE>
+```
+
+> **Important:** If your Autoscaling project was created via the `/postgres/` API (not `/database/`), programmatic synced table creation is not yet available via CLI — use the Databricks UI as a fallback. In **Catalog**, select the source table → **Create synced table**, then choose your Lakebase project, branch, sync mode, and pipeline. This gap is expected to close soon.
+
+## Pipeline reuse guidance
+
+How you set up pipelines depends on your sync mode:
+
+
+| Sync mode                | Recommendation                         | Why                                                                                                                  |
+| ------------------------ | -------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
+| **Continuous**           | **Reuse** a pipeline across ~10 tables | Cost-advantageous — e.g., 1 pipeline for 10 tables ≈ $204/table/month vs $2,044/table/month for individual pipelines |
+| **Snapshot / Triggered** | **Separate** pipelines per table       | Allows re-snapshotting individual tables without impacting others                                                    |
+
+
+## Schedule ongoing syncs
+
+The initial snapshot runs automatically on creation. For **Snapshot** and **Triggered** modes, subsequent syncs need to be triggered.
+
+> **Note:** Table-update triggers for sync pipelines are not yet available via CLI and must be configured through the Databricks UI: **Workflows** → create/open a job → add a **Database Table Sync pipeline** task → **Schedules & Triggers** → add a **Table update** trigger pointing to your source table.
+
+Trigger a sync update programmatically via the Databricks SDK:
+
+```python
+from databricks.sdk import WorkspaceClient
+
+w = WorkspaceClient()
+
+table = w.database.get_synced_database_table(
+    name="<CATALOG>.<SCHEMA>.<SYNCED_TABLE_NAME>"
+)
+pipeline_id = table.data_synchronization_status.pipeline_id
+
+w.pipelines.start_update(pipeline_id=pipeline_id)
+```
+
+## Query the synced data in Postgres
+
+Once synced, the table is available in Lakebase Postgres. The Unity Catalog schema becomes the Postgres schema:
+
+```sql
+SELECT * FROM "<schema>"."<synced_table_name>" WHERE "user_id" = 12345;
+```
+
+Connect with any standard Postgres client (psql, DBeaver, your application's Postgres driver).
+
+## What you end up with
+
+- A **synced table** in Unity Catalog that tracks the sync pipeline
+- A **read-only Postgres table** in Lakebase that your apps can query with sub-10ms latency
+- A **managed Lakeflow pipeline** that keeps the data in sync based on your chosen mode
+- Up to **16 connections** per sync to your Lakebase database
+
+## Important constraints
+
+- **Primary key is mandatory.** Synced tables always require a primary key — it enables efficient point lookups and incremental updates. Rows with nulls in PK columns are excluded from the sync.
+- **Duplicate primary keys fail the sync** unless you configure a `timeseries_key` for deduplication (latest value wins per PK). Using a timeseries key has a performance penalty.
+- **Schema changes**: For Triggered/Continuous mode, only **additive** changes (e.g., adding a column) propagate. Dropping or renaming columns requires recreating the synced table.
+- **FGAC tables**: Direct sync of Fine-Grained Access Control tables fails. **Workaround**: create a view (`SELECT * FROM table`), then sync the view in Snapshot mode. Caveat: runs as the sync creator and only sees their visible rows.
+- **Connection limits**: Autoscaling supports up to 4,000 concurrent connections (varies by compute size). Each sync uses up to 16 connections.
+- **Read-only in Postgres**: Synced tables should only be read from Postgres. Writing to them interferes with the sync pipeline.
+
+## Cost guidance
+
+Cost formula: `[Rows / (Speed × CUs × 3600)] × DLT Hourly Rate`
+
+Example costs (181M rows, 1 CU, $2.80/hr DLT rate):
+
+
+| Mode                               | Monthly cost |
+| ---------------------------------- | ------------ |
+| Snapshot (daily)                   | ~$2,110      |
+| Triggered (daily, 5% changes)      | ~$1,407      |
+| Continuous (10 tables, 1 pipeline) | ~$204/table  |
+| Continuous (1 table, 1 pipeline)   | ~$2,044      |
+
+
+## Troubleshooting
+
+
+| Issue                               | Fix                                                                                       |
+| ----------------------------------- | ----------------------------------------------------------------------------------------- |
+| CDF not enabled warning             | Run `ALTER TABLE ... SET TBLPROPERTIES (delta.enableChangeDataFeed = true)` on the source |
+| Schema not visible in create dialog | Confirm you have `USE_SCHEMA` and `CREATE_TABLE` on the target schema                     |
+| Null bytes in string columns        | Clean source data: `SELECT REPLACE(col, CAST(CHAR(0) AS STRING), '') AS col FROM table`   |
+| Sync failing                        | Check the pipeline in the synced table's Overview tab for error details                   |
+| FGAC table sync fails               | Create a view over the table and sync the view in Snapshot mode                           |
+| Duplicate primary key failure       | Add a `timeseries_key` to deduplicate (latest wins)                                       |
+
+
+## Learn more
+
+- [Synced tables (Autoscaling)](https://docs.databricks.com/aws/en/oltp/projects/sync-tables)
+- [Change Data Feed](https://docs.databricks.com/aws/en/delta/delta-change-data-feed)
+- [Lakebase Autoscaling](https://docs.databricks.com/aws/en/oltp/projects/)