feat(puffin): add basic data structures and constants#588
feat(puffin): add basic data structures and constants#588zhaoxuan1994 wants to merge 1 commit intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds foundational C++ types and helpers for Iceberg Puffin file support, plus unit tests, and wires them into the build.
Changes:
- Introduces core Puffin data structures (
Blob,BlobMetadata,FileMetadata) and standard constants (StandardBlobTypes,StandardPuffinProperties). - Adds
PuffinCompressionCodecwith name conversion/parsing helpers andToString()functions. - Adds a dedicated
puffin_testsuite and integrates Puffin sources into the Iceberg library build.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/iceberg/puffin/blob.h | Defines Blob structure for uncompressed blob payload + metadata. |
| src/iceberg/puffin/blob.cc | Implements ToString(const Blob&). |
| src/iceberg/puffin/blob_metadata.h | Defines BlobMetadata footer metadata structure. |
| src/iceberg/puffin/blob_metadata.cc | Implements ToString(const BlobMetadata&). |
| src/iceberg/puffin/file_metadata.h | Defines FileMetadata footer metadata structure. |
| src/iceberg/puffin/file_metadata.cc | Implements ToString(const FileMetadata&). |
| src/iceberg/puffin/puffin_compression_codec.h | Declares codec enum + name conversion APIs. |
| src/iceberg/puffin/puffin_compression_codec.cc | Implements codec name conversion/parsing + ToString. |
| src/iceberg/puffin/types.h | Adds standard Puffin blob type/property string constants. |
| src/iceberg/puffin/CMakeLists.txt | Installs Puffin headers via CMake. |
| src/iceberg/CMakeLists.txt | Adds Puffin sources to ICEBERG_SOURCES and adds puffin subdirectory. |
| src/iceberg/test/puffin_test.cc | Adds unit tests covering the new Puffin public APIs. |
| src/iceberg/test/CMakeLists.txt | Registers puffin_test in the CMake test suite. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
82d145c to
8a7ed69
Compare
Add the foundational types for Puffin file format support: - Blob, BlobMetadata, FileMetadata structs - PuffinCompressionCodec enum with codec name conversion - StandardBlobTypes and StandardPuffinProperties constants - ToString functions for all types - 24 unit tests covering all public APIs
8a7ed69 to
7e5818d
Compare
evindj
left a comment
There was a problem hiding this comment.
Overall looks good just curious about the choice of using string literals instead of enums in some place.
| /// \brief Standard blob types defined by the Iceberg specification. | ||
| struct StandardBlobTypes { | ||
| /// A serialized form of a "compact" Theta sketch produced by the | ||
| /// Apache DataSketches library. | ||
| static constexpr std::string_view kApacheDatasketchesThetaV1 = | ||
| "apache-datasketches-theta-v1"; | ||
|
|
||
| /// A serialized deletion vector according to the Iceberg spec. | ||
| static constexpr std::string_view kDeletionVectorV1 = "deletion-vector-v1"; | ||
| }; |
There was a problem hiding this comment.
Should we do enum + toString ?
There was a problem hiding this comment.
Per the Puffin spec, type is a "JSON string" field. The spec lists known blob types under "Blob types" but doesn't restrict the field to only those values — new types can be added as the spec evolves. Both the Java and Rust implementations use String for this field rather than an enum. StandardBlobTypes provides well-known constants to avoid typos while keeping the field open for future extensibility.
| blob.input_fields); | ||
| std::format_to(std::back_inserter(repr), "snapshotId={},sequenceNumber={},", | ||
| blob.snapshot_id, blob.sequence_number); | ||
| std::format_to(std::back_inserter(repr), "dataSize={}", blob.data.size()); |
There was a problem hiding this comment.
Nit: may be add a checksum if available for fast comparison(I do understand checksum is not part of the puffin spec).
There was a problem hiding this comment.
That's an interesting idea! Currently Blob doesn't carry a checksum field (the Puffin spec doesn't include one either), so there's nothing available to display here. For equality comparison, Blob already has operator==. If we want to add checksum support down the road, it would probably be a separate discussion about the struct design itself — happy to explore that if there's a use case!
| /// including the blob's location within the file. | ||
| struct ICEBERG_EXPORT BlobMetadata { | ||
| /// Type of the blob. See StandardBlobTypes for known types. | ||
| std::string type; |
There was a problem hiding this comment.
It is weird that the type is declared as string. it makes it harder to understand what type of blobs are accepted but harder to catch typos.
There was a problem hiding this comment.
Same reasoning as StandardBlobTypes — the spec requires this to be an open string.
| int64_t sequence_number; | ||
| /// Offset in the file where the blob data starts. | ||
| int64_t offset; | ||
| /// Length of the blob data in the file (after compression, if compressed). |
There was a problem hiding this comment.
How to check if the blob is compressed, Is it if the compression_codec is set?
There was a problem hiding this comment.
Yes, exactly. If compression_codec has a value, the blob is compressed; if it's std::nullopt, it's uncompressed. I'll improve the comment to make this clearer.
| /// Length of the blob data in the file (after compression, if compressed). | ||
| int64_t length; | ||
| /// Compression codec name, or std::nullopt if uncompressed. | ||
| std::optional<std::string> compression_codec; |
There was a problem hiding this comment.
ditto same pattern string or enum
There was a problem hiding this comment.
BlobMetadata is a direct mapping of the Puffin footer JSON, where compression-codec is a JSON string (or absent). The Java implementation also uses @nullable String here. The conversion from string to PuffinCompressionCodec enum happens at a higher level via PuffinCompressionCodecFromName(), which returns Result<> to properly handle unknown codecs. This layering keeps the metadata faithful to the on-disk format and provides forward compatibility — if a future spec version adds a new codec, we can still deserialize the footer successfully and fail gracefully at decompression time rather than at parsing time.
| case PuffinCompressionCodec::kZstd: | ||
| return kZstdCodecName; | ||
| } | ||
| std::unreachable(); |
There was a problem hiding this comment.
consider returning Result<> so we can avoid throwing an exception here
There was a problem hiding this comment.
std::unreachable() here isn't actually throwing — it's a C++23 compile-time hint that this code path is impossible, since the switch already covers all enum values. This is a common pattern in the codebase (e.g., transform.cc, type.cc, expression.cc, etc.). Returning Result<> would require callers to handle an error that can't happen in practice. That said, PuffinCompressionCodecFromName does return Result<> since its string input can genuinely be invalid — so the two functions intentionally use different strategies based on their input types.
Add the foundational types for Puffin file format support: