initial iceberg table sink design #32460

martykulma · 2025-05-09T21:34:34Z

An initial design for adding a Iceberg Table sink to Materialize.

Motivation

Much interest in this topic there is.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

bkirwi

Exciting times!

bkirwi · 2025-05-14T15:29:15Z

doc/developer/design/20250418_iceberg_sink.md

+- Support multiple object store implementations
+    - S3, GCS, ABS
+- Support control over output format
+    - File Type: Parquet, Avro, ORC


This feels like a lot of things to implement before this counts as successful! Is there any way to thin down this list further?

Oh yes - it absolutely is! The design doc just calls for success and MVP, nothing about futures, which in this case lots of these are. I'll create a Possible Futures section for some of these.

moved to out of scope

bkirwi · 2025-05-14T15:42:44Z

doc/developer/design/20250418_iceberg_sink.md

+```
+*Appending data to iceberg requires uploading metadata and data files, and reading data requires accessing metadata to know which data files need to be retrieved. It is inefficient to write very small updates, and very resource intensive for readers to perform the reads when there are many small updates.  To give users control over the dimensions of the appended data, Materialize will allow users to optionally specify a minumum size and a maximum period for the sink. Materialize will append data to the iceberg table if either:*
+- *the size of the append is above the minimum size*
+- *the maximum period of time has passed since the last append*


In my past life integrating this sort of thing, it was typically most useful to have a fixed period - ie. uploading every hour on the hour. This was because it made it easier to schedule downstream batch stuff... if the downstream cron kicks off every hour on the hour as well, aligning the periods minimizes delays and wasted work.

I can imagine it being annoying to support both styles, though. Do we have a sense of how our early users might hope to integrate this?

I wrote this up based on competitor analysis (PG WAL also happens to work this way). I don't have info on early users yet, but I'll be speaking with a potential consumer of this today. Will bring it up!

removed size as a constraint and refined this as just COMMIT INTERVAL, which for right now is wall clock time, but I'm getting some additional feedback.

bkirwi · 2025-05-14T15:49:08Z

doc/developer/design/20250418_iceberg_sink.md

+
+To manage transactions, the coordinator will tracking the size of the current iceberg append as well as the last successful append time. Once an append has reached either the minimum size or the maximum period, if set, writes are flushed and the append is committed to iceberg. Materialize will respect the MZ timestamp.  All updates that happen within the same tick will be committed together, even if that exceeds the time. 
+
+If neither minimum size or maximum period is set, Materialize will perform appends according to the MZ timestamp.


Can you expand on "according to the MZ timestamp"? Is that every time the frontier advances, or a separate append for every timestamp in the data, or?

yes, I was intending to have this be when the frontier advances! Will update.

bkirwi · 2025-05-14T15:51:48Z

doc/developer/design/20250418_iceberg_sink.md

+Appending data includes insert, update, and delete row operations. To perform deletes, Materialize will generate [equality delete files](https://iceberg.apache.org/spec/#equality-delete-files) that match the fields of the primary key.  Inserts are performed via data files.  Updates contain delete files and data files in the same commit.  Materialize will enforce a 512MB limit (which I borrowed from S3Tables) on the size of the parquet files. Appends that are larger than 512MB will be composed of multiple files.
+
+
+Appends to the iceberg table will utilize `Fast Append`, which avoids rewrites of manifest files (see [here](https://iceberg.apache.org/spec/#snapshots)). As part of the write, Materialize will store information in the snapshot [Summary properties](https://iceberg.apache.org/spec/#optional-snapshot-summary-fields). The properies field is a `HashMap<String, String>`, in which Materialize will store the timestamp and sink version number. To determine the latest append performed by Materialize, retrieve the most recent snapshots and examine properties.


Definitely nicer than trying to encode the metadata in the ID!

Is there a risk of external maintenance destroying this metadata? (eg. if the latest snapshot we've written is rewritten.)

It can be lost if the snapshot expiration is very aggressive and MZ doesn't write anything to that iceberg table over a long enough period (e.g. setting the replication factor to 0 for the cluster for a while).

The snapshot and data files are immutable once written, so a compaction would not affect us. My understanding from reading is that the snapshot summary metadata is that every client creates it separately. A compaction is expected to yield a snapshot with no mz_ keys.

If we end up in a situation with no MZ snapshots, they'll have to recreate the sink. This is expected to be documented, I'll add a blurb to the design.

doc/developer/design/20250418_iceberg_sink.md

bkirwi · 2025-05-15T14:57:32Z

doc/developer/design/20250418_iceberg_sink.md

+
+### Iceberg Table Maintenance
+
+Iceberg, being a table specification, provides [guidance on iceberg table maintenance](https://iceberg.apache.org/docs/latest/maintenance/) tasks.  It is up to the engine (Spark, Flink, etc.) to provide the implementation. S3Tables provides this functionality, where using a REST catalog would not. Implementing this functionality in Materialize would require a long running, asynchronous service.  This could run on a replica, but will also require some method of fencing to ensure that multiple instances of this service are not actively trying to perform maintenance tasks on the same table.  Because this service will need to read, compact, and write back data to the iceberg table, it will compete with dataflows in the same replica, making capacity planning more complex and affecting the performance characteristics of the sink.


If we wanted to have materialize implement this maintenance work, it seems reasonable to me to have the sink itself complete it, since it is already a long running asynchronous service that's connected to the catalog, and could order writes to avoid conflicts...

(Would still take additional resources, though. If we think we can punt this work to someone else to start, that seems very convenient!)

Co-authored-by: Ben Kirwin <[email protected]>

martykulma added 4 commits May 9, 2025 17:30

iceberg sink design

da63fbb

..consistency in update vs. append

12a49db

Add table maintenance as out of scope

528358b

Fixup Iceberg Table Maintenance wording

aa5bcea

bkirwi reviewed May 15, 2025

View reviewed changes

martykulma and others added 4 commits May 15, 2025 14:37

Update doc/developer/design/20250418_iceberg_sink.md

f52ec3a

Co-authored-by: Ben Kirwin <[email protected]>

Update doc/developer/design/20250418_iceberg_sink.md

56f0e60

Co-authored-by: Ben Kirwin <[email protected]>

Update doc/developer/design/20250418_iceberg_sink.md

83b04ee

Co-authored-by: Ben Kirwin <[email protected]>

PR comments

de89021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial iceberg table sink design #32460

initial iceberg table sink design #32460

martykulma commented May 9, 2025

bkirwi left a comment

bkirwi May 14, 2025

martykulma May 15, 2025

martykulma May 16, 2025

bkirwi May 14, 2025

martykulma May 15, 2025

martykulma May 16, 2025

bkirwi May 14, 2025

martykulma May 15, 2025

martykulma May 16, 2025

bkirwi May 14, 2025

bkirwi May 14, 2025

martykulma May 15, 2025

bkirwi May 15, 2025


		To manage transactions, the coordinator will tracking the size of the current iceberg append as well as the last successful append time. Once an append has reached either the minimum size or the maximum period, if set, writes are flushed and the append is committed to iceberg. Materialize will respect the MZ timestamp. All updates that happen within the same tick will be committed together, even if that exceeds the time.

		If neither minimum size or maximum period is set, Materialize will perform appends according to the MZ timestamp.

		Appending data includes insert, update, and delete row operations. To perform deletes, Materialize will generate [equality delete files](https://iceberg.apache.org/spec/#equality-delete-files) that match the fields of the primary key. Inserts are performed via data files. Updates contain delete files and data files in the same commit. Materialize will enforce a 512MB limit (which I borrowed from S3Tables) on the size of the parquet files. Appends that are larger than 512MB will be composed of multiple files.


		Appends to the iceberg table will utilize `Fast Append`, which avoids rewrites of manifest files (see [here](https://iceberg.apache.org/spec/#snapshots)). As part of the write, Materialize will store information in the snapshot [Summary properties](https://iceberg.apache.org/spec/#optional-snapshot-summary-fields). The properies field is a `HashMap<String, String>`, in which Materialize will store the timestamp and sink version number. To determine the latest append performed by Materialize, retrieve the most recent snapshots and examine properties.


		### Iceberg Table Maintenance

		Iceberg, being a table specification, provides [guidance on iceberg table maintenance](https://iceberg.apache.org/docs/latest/maintenance/) tasks. It is up to the engine (Spark, Flink, etc.) to provide the implementation. S3Tables provides this functionality, where using a REST catalog would not. Implementing this functionality in Materialize would require a long running, asynchronous service. This could run on a replica, but will also require some method of fencing to ensure that multiple instances of this service are not actively trying to perform maintenance tasks on the same table. Because this service will need to read, compact, and write back data to the iceberg table, it will compete with dataflows in the same replica, making capacity planning more complex and affecting the performance characteristics of the sink.

initial iceberg table sink design #32460

Are you sure you want to change the base?

initial iceberg table sink design #32460

Conversation

martykulma commented May 9, 2025

Motivation

Checklist

bkirwi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment