-
Notifications
You must be signed in to change notification settings - Fork 299
Finish Adding Default Implementations for V1→V2 #1579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes apache#1576 ## Summary This PR implements `Default` trait implementations for `DataContentType` and `ManifestContentType` to enable proper V1→V2 manifest projection as specified in the Apache Iceberg format specification. As part of solving ManifestList projection, I did a full audit of ManifestData, ManifestList, ManifestEntry and Snapshot files. All five file types are now covered for V1 to V2 projection. ## Changes Made - Added `Default` for `DataContentType`**: Returns `DataContentType::Data` (value 0) for V1 compatibility - Added `Default` for `ManifestContentType`**: Returns `ManifestContentType::Data` (value 0) for V1 compatibility ### New Tests - ManifestFile V1→V2 projection test**: Validates `ManifestFileV1::try_into()` properly sets defaults for missing V2 fields - ManifestEntry V1→V2 projection test**: Validates `ManifestEntryV1::try_into()` correctly applies sequence number defaults - DataFile V1→V2 projection test**: Validates `DataFileSerde::try_into()` handles V1 field defaults correctly - Snapshot V1→V2 projection tests**: Two tests covering both scenarios (with/without optional summary field) ## Technical Details ### V1→V2 Field Mapping According to the Iceberg spec, when reading V1 manifests as V2, the following defaults are applied: | Component | Field | V1 Value | V2 Default | |-----------|-------|----------|------------| | ManifestFile | `content` | absent | `0` (Data) | | ManifestFile | `sequence_number` | absent | `0` | | ManifestFile | `min_sequence_number` | absent | `0` | | ManifestEntry | `sequence_number` | absent | `0` | | ManifestEntry | `file_sequence_number` | absent | `0` | | DataFile | `content` | absent | `0` (Data) | | DataFile | `equality_ids` | absent | `[]` (empty) | | Snapshot | `sequence_number` | absent | `0` | ### Existing Infrastructure The V1→V2 projection was already working through existing conversion logic: - `ManifestFileV1::try_into()` - converts V1 manifest list entries to V2 - `ManifestEntryV1::try_into()` - converts V1 manifest entries to V2 - `DataFileSerde` with `#[serde(default)]` - handles V1 data file defaults - `SnapshotV1::try_into()` - converts V1 snapshots to V2 This PR adds the missing `Default` implementations and comprehensive tests to validate the behavior. ## Testing ## All new tests pass and validate: - ✅ V1→V2 conversion applies correct defaults for missing fields - ✅ Existing V1 fields are preserved during conversion - ✅ Different scenarios (with/without optional fields) work correctly - ✅ Round-trip compatibility is maintained
Working on trying to figure out the failing check. Can't recreate on my Linux computer, but seems to be failing on my Mac. Will send a new commit once I have resolved the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Kurtiscwright for this pr, LGTM!
Thnks for working on this @Kurtiscwright. The changes make sense to me, but I would also expect seting more |
Thank you for taking the time to review the PR @liurenjie1024 and @Fokko. My understanding (could be completely incorrect) is that with the annotations of #default where the schema is declared always provides that value unless overridden. For example, whenever ManifestContentType is called you always get a value of Documentation: https://doc.rust-lang.org/std/default/trait.Default.html |
Hi, @Fokko I'm also confusing about this question. Are you talking about scanning table with fields that has |
Thanks @liurenjie1024 for pinging me here. I think I did not a great job at explaining my original intent at the issue. This PR is great, and allows converting the V1 structs into V2, which I think is also valuable. My goal was to have an Avro reader that both can read V1 and V2, and read them into a V2 struct right away. This way we don't have to go through the hassle of looking up the format-version in the Avro metadata, and then pick the right reader. For example, if the I'm okay merging this one, and then we can close the gap in a separate PR. WDYT? |
Thanks for working on this @Kurtiscwright, and thanks @liurenjie1024 for the review. I'll do another pass on #2004 later today to identify the missing parts 👍 |
Thank you both for the reviews. I started a slack thread to continue the discussion around a 1 Avro reader to rule them all task. I would like to pick it up, at minimum adding more unit tests to make sure it works. |
Which issue does this PR close?
What changes are included in this PR?
This PR implements
Default
trait implementations forDataContentType
andManifestContentType
to enable proper V1→V2 manifest projection as specified in the Apache Iceberg format specification. As part of solving ManifestList projection, I did a full audit of ManifestData, ManifestList, ManifestEntry and Snapshot files. All five file types are now covered for V1 to V2 projection.Changes Made
Default
forDataContentType
**: ReturnsDataContentType::Data
(value 0) for V1 compatibilityDefault
forManifestContentType
**: ReturnsManifestContentType::Data
(value 0) for V1 compatibilityAre these changes tested?
ManifestFileV1::try_into()
properly sets defaults for missing V2 fieldsManifestEntryV1::try_into()
correctly applies sequence number defaultsDataFileSerde::try_into()
handles V1 field defaults correctly