WIP: Data Lakehouse AidboxTopicDestination tutorial#15
Draft
spicyfalafel wants to merge 12 commits into
Draft
Conversation
Covers both writeModes (managed → Databricks SQL warehouse INSERT; external-direct → direct Delta commit on customer's bucket). Includes a concept primer (Databricks / lakehouse / Delta / managed vs external), an auth-flow sequence diagram, side-by-side privilege grants in tabs, and a stepped Patient export walkthrough. Companion to the module PR HealthSamurai/topic-destination-deltalake#1 which ships in Aidbox 2605. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etails - Overview / Auth flow / Compaction / Cost notes — break dense paragraphs into per-mode bullet lists so the differences read at a glance. - Strip user-irrelevant implementation details (Kernel, Hadoop S3A, TransactionBuilder, library URLs). Replace with user-facing concepts (the module writes Parquet + Delta commits, stable transaction id, etc.). Keep only the Delta protocol spec link in Related Documentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop the version-compat warning block (table + "older versions don't support this module"). Keep just the simple "available from 2605" hint. - Drop the "four-minute primer" sentence — let the section header speak. - Drop the "Common confusion: External Location vs external table" hint block — over-explained for the audience. - Background: don't mention Spark or paste a REST URL into the SQL warehouse explanation; describe it functionally. Reword the "two-things bundled" closing line so it doesn't reference `managed` mode before it's introduced. - Overview: split the single mode-dispatch mermaid into two diagrams, one per writeMode. Easier to read; matches the user's mental model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Overview: drop the "Mode summary" heading; put the explanation bullets
immediately under each writeMode diagram with a one-line lead-in.
- Replace the "Hot path" jargon — describe what actually happens per batch.
- Promote the Delivery-guarantee {% hint %} block into a real top-level
section. Remove the duplicate `## Idempotency` section that said the
same thing in different words. Soft Deletes now references the single
dedup query example instead of repeating it. Old "Delivery Guarantees
and Retry" section becomes plain "Retry behavior" (the guarantees half
is hoisted up).
- Required Databricks privileges: split into "Common to both modes" +
per-mode tabs so the shared grants live in one place. external-direct
tab now spells out the EXTERNAL LOCATION grants too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub's mermaid renderer rejects `<` (treated as arrow syntax start) and
unescaped HTML like `<br/>` inside sequenceDiagram lines. Replace with
plain prose ("under 5 min remain", "about 1h TTL") and drop the inline
endpoint paths from arrow labels — they were implementation detail anyway.
Also drop the PAT-not-supported note: user-facing docs shouldn't argue
against a feature we don't ship.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The staging table is what makes managed-mode initial bulk export work at all — Databricks-managed tables refuse direct writes, so the module writes Parquet to a temporary external Delta table, asks the SQL warehouse to copy from it into the managed target, and drops it. The previous version mentioned this in prose but never drew it. - Add a numbered-flow mermaid in "How it works — managed mode" showing the staging relay (sof.view → staging → INSERT SELECT → DROP). - Add a one-step diagram in "How it works — external-direct mode" so both paths are visually symmetric. - Cross-link from the Overview managed bullets to the staging section so the staging concept lands before the configuration parameters. Avoids GitHub mermaid issues by using `<br/>` only inside node labels (supported in flowchart, unlike sequenceDiagram). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rst language - Add a top-level architecture diagram in Overview showing how Aidbox / PostgreSQL / the module / Databricks / cloud storage fit together — readers want this before per-mode flow diagrams. - Hoist "append-only output" (CREATE / UPDATE / DELETE → new rows with is_deleted flag) to Overview. Used to be buried in the Data Transformation section near the bottom. The Soft Deletes subsection there now just cross-links. - Collapse the artificial split between "Data lakehouse" and "Delta Lake" subsections — Delta is what gives the lake ACID, so introducing them as if they were two separate concepts was misleading. - Replace customer / we / us terminology with Aidbox / User / the module. The doc is for the User; talking about them in third person is the wrong voice. - Choosing-modes table: drop the confusing "Target" row (both modes target Delta tables); add "Table type" + "Storage backends" rows that actually tell the reader something different per mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round of correctness + UX polish: - Costs: we don't actually know a 2X-Small Serverless warehouse costs $2k/month, and "Free via Predictive Optimization" conflated "you don't think about it" with "you don't pay". Replace specific dollar numbers with "see Databricks pricing"; rephrase Predictive Optimization rows to talk about who runs OPTIMIZE/VACUUM, not money. - Drop the "Best for" row from Choosing — the recommendation was a guess. Drop the "Choosing batch parameters" hint for the same reason. - Architecture diagram: User is a rectangle, not a circle; module label trimmed; remove duplicate trailing paragraph. - Background → SQL warehouse: don't tell readers they "send statements over REST" — they don't; describe it functionally and note the module drives it programmatically. - External vs managed tables table: add explicit row "Can Aidbox write directly?" so the consequence for the module is visible alongside the conceptual difference. - Initial export row in Choosing now spells out staging → INSERT SELECT → drop, with link to the staging diagram. - Configuration → tabs (per-mode), with common parameters duplicated inside each tab so readers see one self-contained config table. - Before you begin: drop the service-principal prerequisite — it gets created in the Authentication setup section, not before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Larger pass driven by review: Structure: - Move Authentication + Required Privileges above Configuration — Authentication is conceptual; Configuration is the User filling in values, which only makes sense after the concepts are in place. - Add a "In a hurry? Jump to Usage Example" link at the top so readers who don't need the background can jump straight to setup. - Drop the Key Features bullet list — pure filler; each bullet was already explained in its own section. - Add numbered annotations under the high-level architecture diagram explaining what each arrow means + a forward link to Initial Export. Accuracy: - Costs: drop dollar figures (we don't know them); frame the section as "where the costs come from" with links to vendor pricing. - Initial Export Retry: drop the misleading "capped at 30s" — with 3 attempts the actual max delay is 4s. - Storage backends listed for external-direct: drop MinIO + Garage from user-facing copy. They're test-only S3-compatible emulators; doc readers shouldn't think we ship a product against them. - "writes Parquet + commits" → "writes the Delta files (Parquet + a transaction-log commit)" — Delta is the umbrella; Parquet alone isn't accurate. - Bucket scheme example shows s3:// / gs:// / abfss:// instead of just s3://. Other: - Move the dedup SQL out of Delivery guarantees (was dense code in a conceptual section) and into Soft Deletes and Updates where it has always belonged. Delivery guarantees now describes the dedup pattern in words + cross-links. Verified against module source: retry counts, backoff math, error codes in Troubleshooting all match the implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… sections
Larger restructure addressing four review items:
1) Drop the standalone "Required Databricks privileges" section + the
"Setting up a service principal" / "Storing the secret in Vault"
subsections inside Authentication. All the actual setup steps now
live in one place — Usage Example Step 4 (a-g):
a-c: catalog / schema / table / warehouse
d: create the service principal
e: GRANT SQL (common + per-mode)
f: (optional) External Location for managed-mode initial export
g: (optional, prod) wire up the SP secret through vault
Authentication is now purely conceptual; readers don't have to
triangulate between three sections to figure out the setup order.
2) Merge Append-only output / Delivery guarantees / Soft Deletes-and-dedup
into a single "Output semantics" section right after Overview:
- Append-only (Create/Update/Delete → new row)
- At-least-once delivery + per-mode dedup story
- One window-function dedup query (was duplicated across two spots)
The "Soft Deletes and Updates" subsection in Data Transformation is
now a single cross-link.
3) Slim the Choosing-modes table from 7 rows to 3 essential ones:
Table type / Who runs maintenance / Databricks compute cost surface.
The rest (schema drift handling, initial-export path, storage
backends) moves to bullets below the table.
4) Configuration tables: split per-mode tab into "Required" + collapsible
"Advanced parameters" details block (and "Authentication parameters"
for external-direct, since auth there is conditional). Required-first
makes the surface area scannable.
Also drop the "Local Testing with OSS Unity Catalog" section — it
referenced the (private) module repo's docker-compose.yaml from this
public doc, which couldn't help readers. Local-test setup is module-
developer info, not user-facing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Those two bullets repeated what the Overview section's per-mode diagrams + bullets already say, and what the "Can Aidbox write directly?" row in the table above them encodes. Replace with a one-line bridge to the Overview. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reflects the module change: initial bulk materialization is now an
idempotent MERGE keyed on the resource `id` natural key, not a plain
INSERT SELECT. Sequence diagrams updated; the staging-flow narrative
gains a {% hint %} block explaining the idempotency rationale and the
implicit ViewDefinition contract (must have `id`).
Output semantics also updated: managed mode is now half-idempotent —
initial bulk safe on replay, hot path still at-least-once and still
dedupped on read.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Draft tutorial for the new `Data Lakehouse AidboxTopicDestination` (kind `data-lakehouse-at-least-once`), shipping with the module PR https://github.com/HealthSamurai/topic-destination-deltalake/pull/1 in Aidbox 2605.
Covers:
Status
Draft — opens for early review while:
Test plan
🤖 Generated with Claude Code