Skip to content

feat(puffin): add basic data structures and constants#588

Open
zhaoxuan1994 wants to merge 1 commit intoapache:mainfrom
zhaoxuan1994:feat/puffin-basic-types
Open

feat(puffin): add basic data structures and constants#588
zhaoxuan1994 wants to merge 1 commit intoapache:mainfrom
zhaoxuan1994:feat/puffin-basic-types

Conversation

@zhaoxuan1994
Copy link

Add the foundational types for Puffin file format support:

  • Blob, BlobMetadata, FileMetadata structs
  • PuffinCompressionCodec enum with codec name conversion
  • StandardBlobTypes and StandardPuffinProperties constants
  • ToString functions for all types
  • 24 unit tests covering all public APIs

Copilot AI review requested due to automatic review settings March 11, 2026 08:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds foundational C++ types and helpers for Iceberg Puffin file support, plus unit tests, and wires them into the build.

Changes:

  • Introduces core Puffin data structures (Blob, BlobMetadata, FileMetadata) and standard constants (StandardBlobTypes, StandardPuffinProperties).
  • Adds PuffinCompressionCodec with name conversion/parsing helpers and ToString() functions.
  • Adds a dedicated puffin_test suite and integrates Puffin sources into the Iceberg library build.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/iceberg/puffin/blob.h Defines Blob structure for uncompressed blob payload + metadata.
src/iceberg/puffin/blob.cc Implements ToString(const Blob&).
src/iceberg/puffin/blob_metadata.h Defines BlobMetadata footer metadata structure.
src/iceberg/puffin/blob_metadata.cc Implements ToString(const BlobMetadata&).
src/iceberg/puffin/file_metadata.h Defines FileMetadata footer metadata structure.
src/iceberg/puffin/file_metadata.cc Implements ToString(const FileMetadata&).
src/iceberg/puffin/puffin_compression_codec.h Declares codec enum + name conversion APIs.
src/iceberg/puffin/puffin_compression_codec.cc Implements codec name conversion/parsing + ToString.
src/iceberg/puffin/types.h Adds standard Puffin blob type/property string constants.
src/iceberg/puffin/CMakeLists.txt Installs Puffin headers via CMake.
src/iceberg/CMakeLists.txt Adds Puffin sources to ICEBERG_SOURCES and adds puffin subdirectory.
src/iceberg/test/puffin_test.cc Adds unit tests covering the new Puffin public APIs.
src/iceberg/test/CMakeLists.txt Registers puffin_test in the CMake test suite.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@zhaoxuan1994 zhaoxuan1994 force-pushed the feat/puffin-basic-types branch from 82d145c to 8a7ed69 Compare March 11, 2026 09:04
Add the foundational types for Puffin file format support:
- Blob, BlobMetadata, FileMetadata structs
- PuffinCompressionCodec enum with codec name conversion
- StandardBlobTypes and StandardPuffinProperties constants
- ToString functions for all types
- 24 unit tests covering all public APIs
@zhaoxuan1994 zhaoxuan1994 force-pushed the feat/puffin-basic-types branch from 8a7ed69 to 7e5818d Compare March 11, 2026 09:29
Copy link
Contributor

@evindj evindj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good just curious about the choice of using string literals instead of enums in some place.

Comment on lines +29 to +38
/// \brief Standard blob types defined by the Iceberg specification.
struct StandardBlobTypes {
/// A serialized form of a "compact" Theta sketch produced by the
/// Apache DataSketches library.
static constexpr std::string_view kApacheDatasketchesThetaV1 =
"apache-datasketches-theta-v1";

/// A serialized deletion vector according to the Iceberg spec.
static constexpr std::string_view kDeletionVectorV1 = "deletion-vector-v1";
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do enum + toString ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the Puffin spec, type is a "JSON string" field. The spec lists known blob types under "Blob types" but doesn't restrict the field to only those values — new types can be added as the spec evolves. Both the Java and Rust implementations use String for this field rather than an enum. StandardBlobTypes provides well-known constants to avoid typos while keeping the field open for future extensibility.

blob.input_fields);
std::format_to(std::back_inserter(repr), "snapshotId={},sequenceNumber={},",
blob.snapshot_id, blob.sequence_number);
std::format_to(std::back_inserter(repr), "dataSize={}", blob.data.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: may be add a checksum if available for fast comparison(I do understand checksum is not part of the puffin spec).

Copy link
Author

@zhaoxuan1994 zhaoxuan1994 Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea! Currently Blob doesn't carry a checksum field (the Puffin spec doesn't include one either), so there's nothing available to display here. For equality comparison, Blob already has operator==. If we want to add checksum support down the road, it would probably be a separate discussion about the struct design itself — happy to explore that if there's a use case!

/// including the blob's location within the file.
struct ICEBERG_EXPORT BlobMetadata {
/// Type of the blob. See StandardBlobTypes for known types.
std::string type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird that the type is declared as string. it makes it harder to understand what type of blobs are accepted but harder to catch typos.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reasoning as StandardBlobTypes — the spec requires this to be an open string.

int64_t sequence_number;
/// Offset in the file where the blob data starts.
int64_t offset;
/// Length of the blob data in the file (after compression, if compressed).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to check if the blob is compressed, Is it if the compression_codec is set?

Copy link
Author

@zhaoxuan1994 zhaoxuan1994 Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. If compression_codec has a value, the blob is compressed; if it's std::nullopt, it's uncompressed. I'll improve the comment to make this clearer.

/// Length of the blob data in the file (after compression, if compressed).
int64_t length;
/// Compression codec name, or std::nullopt if uncompressed.
std::optional<std::string> compression_codec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto same pattern string or enum

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlobMetadata is a direct mapping of the Puffin footer JSON, where compression-codec is a JSON string (or absent). The Java implementation also uses @nullable String here. The conversion from string to PuffinCompressionCodec enum happens at a higher level via PuffinCompressionCodecFromName(), which returns Result<> to properly handle unknown codecs. This layering keeps the metadata faithful to the on-disk format and provides forward compatibility — if a future spec version adds a new codec, we can still deserialize the footer successfully and fail gracefully at decompression time rather than at parsing time.

case PuffinCompressionCodec::kZstd:
return kZstdCodecName;
}
std::unreachable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider returning Result<> so we can avoid throwing an exception here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::unreachable() here isn't actually throwing — it's a C++23 compile-time hint that this code path is impossible, since the switch already covers all enum values. This is a common pattern in the codebase (e.g., transform.cc, type.cc, expression.cc, etc.). Returning Result<> would require callers to handle an error that can't happen in practice. That said, PuffinCompressionCodecFromName does return Result<> since its string input can genuinely be invalid — so the two functions intentionally use different strategies based on their input types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants