Skip to content

WIP: Data Lakehouse AidboxTopicDestination tutorial#15

Draft
spicyfalafel wants to merge 12 commits into
mainfrom
svt/data-lakehouse-tutorial
Draft

WIP: Data Lakehouse AidboxTopicDestination tutorial#15
spicyfalafel wants to merge 12 commits into
mainfrom
svt/data-lakehouse-tutorial

Conversation

@spicyfalafel
Copy link
Copy Markdown
Collaborator

Summary

Draft tutorial for the new `Data Lakehouse AidboxTopicDestination` (kind `data-lakehouse-at-least-once`), shipping with the module PR https://github.com/HealthSamurai/topic-destination-deltalake/pull/1 in Aidbox 2605.

Covers:

  • Concept primer (Databricks, data lakehouse, Delta Lake, managed vs external tables — for readers new to the stack)
  • Both write modes side-by-side (`managed` → SQL warehouse INSERT; `external-direct` → direct Delta commit on customer bucket)
  • Auth flow sequence diagram (OAuth M2M → bearer → per-mode dispatch)
  • Required grants in mode-tabs
  • Stepped Patient export walkthrough
  • Idempotency / schema evolution / cost / troubleshooting sections

Status

Draft — opens for early review while:

Test plan

  • Render preview on GitBook staging — mermaid + tabs + stepper widgets render correctly
  • Cross-link from BigQuery / ClickHouse sibling tutorials once this lands
  • Verify the version-compat banner once the module JAR hits `gs://aidbox-modules/topic-destination-deltalake/topic-destination-deltalake-2605.0.jar`

🤖 Generated with Claude Code

spicyfalafel and others added 12 commits May 15, 2026 16:13
Covers both writeModes (managed → Databricks SQL warehouse INSERT;
external-direct → direct Delta commit on customer's bucket). Includes a
concept primer (Databricks / lakehouse / Delta / managed vs external),
an auth-flow sequence diagram, side-by-side privilege grants in tabs,
and a stepped Patient export walkthrough.

Companion to the module PR
HealthSamurai/topic-destination-deltalake#1
which ships in Aidbox 2605.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etails

- Overview / Auth flow / Compaction / Cost notes — break dense paragraphs
  into per-mode bullet lists so the differences read at a glance.
- Strip user-irrelevant implementation details (Kernel, Hadoop S3A,
  TransactionBuilder, library URLs). Replace with user-facing concepts
  (the module writes Parquet + Delta commits, stable transaction id,
  etc.). Keep only the Delta protocol spec link in Related Documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop the version-compat warning block (table + "older versions don't
  support this module"). Keep just the simple "available from 2605" hint.
- Drop the "four-minute primer" sentence — let the section header speak.
- Drop the "Common confusion: External Location vs external table" hint
  block — over-explained for the audience.
- Background: don't mention Spark or paste a REST URL into the SQL
  warehouse explanation; describe it functionally. Reword the "two-things
  bundled" closing line so it doesn't reference `managed` mode before
  it's introduced.
- Overview: split the single mode-dispatch mermaid into two diagrams,
  one per writeMode. Easier to read; matches the user's mental model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Overview: drop the "Mode summary" heading; put the explanation bullets
  immediately under each writeMode diagram with a one-line lead-in.
- Replace the "Hot path" jargon — describe what actually happens per batch.
- Promote the Delivery-guarantee {% hint %} block into a real top-level
  section. Remove the duplicate `## Idempotency` section that said the
  same thing in different words. Soft Deletes now references the single
  dedup query example instead of repeating it. Old "Delivery Guarantees
  and Retry" section becomes plain "Retry behavior" (the guarantees half
  is hoisted up).
- Required Databricks privileges: split into "Common to both modes" +
  per-mode tabs so the shared grants live in one place. external-direct
  tab now spells out the EXTERNAL LOCATION grants too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub's mermaid renderer rejects `<` (treated as arrow syntax start) and
unescaped HTML like `<br/>` inside sequenceDiagram lines. Replace with
plain prose ("under 5 min remain", "about 1h TTL") and drop the inline
endpoint paths from arrow labels — they were implementation detail anyway.

Also drop the PAT-not-supported note: user-facing docs shouldn't argue
against a feature we don't ship.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The staging table is what makes managed-mode initial bulk export work
at all — Databricks-managed tables refuse direct writes, so the module
writes Parquet to a temporary external Delta table, asks the SQL
warehouse to copy from it into the managed target, and drops it. The
previous version mentioned this in prose but never drew it.

- Add a numbered-flow mermaid in "How it works — managed mode" showing
  the staging relay (sof.view → staging → INSERT SELECT → DROP).
- Add a one-step diagram in "How it works — external-direct mode"
  so both paths are visually symmetric.
- Cross-link from the Overview managed bullets to the staging section
  so the staging concept lands before the configuration parameters.

Avoids GitHub mermaid issues by using `<br/>` only inside node labels
(supported in flowchart, unlike sequenceDiagram).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rst language

- Add a top-level architecture diagram in Overview showing how
  Aidbox / PostgreSQL / the module / Databricks / cloud storage fit
  together — readers want this before per-mode flow diagrams.
- Hoist "append-only output" (CREATE / UPDATE / DELETE → new rows with
  is_deleted flag) to Overview. Used to be buried in the Data
  Transformation section near the bottom. The Soft Deletes subsection
  there now just cross-links.
- Collapse the artificial split between "Data lakehouse" and
  "Delta Lake" subsections — Delta is what gives the lake ACID, so
  introducing them as if they were two separate concepts was misleading.
- Replace customer / we / us terminology with Aidbox / User / the
  module. The doc is for the User; talking about them in third person
  is the wrong voice.
- Choosing-modes table: drop the confusing "Target" row (both modes
  target Delta tables); add "Table type" + "Storage backends" rows
  that actually tell the reader something different per mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round of correctness + UX polish:

- Costs: we don't actually know a 2X-Small Serverless warehouse costs
  $2k/month, and "Free via Predictive Optimization" conflated "you
  don't think about it" with "you don't pay". Replace specific dollar
  numbers with "see Databricks pricing"; rephrase Predictive
  Optimization rows to talk about who runs OPTIMIZE/VACUUM, not money.
- Drop the "Best for" row from Choosing — the recommendation was a
  guess. Drop the "Choosing batch parameters" hint for the same reason.
- Architecture diagram: User is a rectangle, not a circle; module label
  trimmed; remove duplicate trailing paragraph.
- Background → SQL warehouse: don't tell readers they "send statements
  over REST" — they don't; describe it functionally and note the module
  drives it programmatically.
- External vs managed tables table: add explicit row "Can Aidbox write
  directly?" so the consequence for the module is visible alongside
  the conceptual difference.
- Initial export row in Choosing now spells out staging → INSERT SELECT
  → drop, with link to the staging diagram.
- Configuration → tabs (per-mode), with common parameters duplicated
  inside each tab so readers see one self-contained config table.
- Before you begin: drop the service-principal prerequisite — it gets
  created in the Authentication setup section, not before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Larger pass driven by review:

Structure:
- Move Authentication + Required Privileges above Configuration —
  Authentication is conceptual; Configuration is the User filling in
  values, which only makes sense after the concepts are in place.
- Add a "In a hurry? Jump to Usage Example" link at the top so readers
  who don't need the background can jump straight to setup.
- Drop the Key Features bullet list — pure filler; each bullet was
  already explained in its own section.
- Add numbered annotations under the high-level architecture diagram
  explaining what each arrow means + a forward link to Initial Export.

Accuracy:
- Costs: drop dollar figures (we don't know them); frame the section
  as "where the costs come from" with links to vendor pricing.
- Initial Export Retry: drop the misleading "capped at 30s" — with
  3 attempts the actual max delay is 4s.
- Storage backends listed for external-direct: drop MinIO + Garage
  from user-facing copy. They're test-only S3-compatible emulators;
  doc readers shouldn't think we ship a product against them.
- "writes Parquet + commits" → "writes the Delta files (Parquet + a
  transaction-log commit)" — Delta is the umbrella; Parquet alone
  isn't accurate.
- Bucket scheme example shows s3:// / gs:// / abfss:// instead of
  just s3://.

Other:
- Move the dedup SQL out of Delivery guarantees (was dense code in a
  conceptual section) and into Soft Deletes and Updates where it has
  always belonged. Delivery guarantees now describes the dedup
  pattern in words + cross-links.

Verified against module source: retry counts, backoff math, error
codes in Troubleshooting all match the implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… sections

Larger restructure addressing four review items:

1) Drop the standalone "Required Databricks privileges" section + the
   "Setting up a service principal" / "Storing the secret in Vault"
   subsections inside Authentication. All the actual setup steps now
   live in one place — Usage Example Step 4 (a-g):
     a-c: catalog / schema / table / warehouse
     d:   create the service principal
     e:   GRANT SQL (common + per-mode)
     f:   (optional) External Location for managed-mode initial export
     g:   (optional, prod) wire up the SP secret through vault
   Authentication is now purely conceptual; readers don't have to
   triangulate between three sections to figure out the setup order.

2) Merge Append-only output / Delivery guarantees / Soft Deletes-and-dedup
   into a single "Output semantics" section right after Overview:
     - Append-only (Create/Update/Delete → new row)
     - At-least-once delivery + per-mode dedup story
     - One window-function dedup query (was duplicated across two spots)
   The "Soft Deletes and Updates" subsection in Data Transformation is
   now a single cross-link.

3) Slim the Choosing-modes table from 7 rows to 3 essential ones:
   Table type / Who runs maintenance / Databricks compute cost surface.
   The rest (schema drift handling, initial-export path, storage
   backends) moves to bullets below the table.

4) Configuration tables: split per-mode tab into "Required" + collapsible
   "Advanced parameters" details block (and "Authentication parameters"
   for external-direct, since auth there is conditional). Required-first
   makes the surface area scannable.

Also drop the "Local Testing with OSS Unity Catalog" section — it
referenced the (private) module repo's docker-compose.yaml from this
public doc, which couldn't help readers. Local-test setup is module-
developer info, not user-facing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Those two bullets repeated what the Overview section's per-mode
diagrams + bullets already say, and what the "Can Aidbox write
directly?" row in the table above them encodes. Replace with a
one-line bridge to the Overview.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reflects the module change: initial bulk materialization is now an
idempotent MERGE keyed on the resource `id` natural key, not a plain
INSERT SELECT. Sequence diagrams updated; the staging-flow narrative
gains a {% hint %} block explaining the idempotency rationale and the
implicit ViewDefinition contract (must have `id`).

Output semantics also updated: managed mode is now half-idempotent —
initial bulk safe on replay, hot path still at-least-once and still
dedupped on read.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant