Skip to content

Finish Adding Default Implementations for V1→V2 #1579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 6, 2025

Conversation

Kurtiscwright
Copy link
Contributor

Which issue does this PR close?

What changes are included in this PR?

This PR implements Default trait implementations for DataContentType and ManifestContentType to enable proper V1→V2 manifest projection as specified in the Apache Iceberg format specification. As part of solving ManifestList projection, I did a full audit of ManifestData, ManifestList, ManifestEntry and Snapshot files. All five file types are now covered for V1 to V2 projection.

Changes Made

  • Added Default for DataContentType**: Returns DataContentType::Data (value 0) for V1 compatibility
  • Added Default for ManifestContentType**: Returns ManifestContentType::Data (value 0) for V1 compatibility

Are these changes tested?

  • ManifestFile V1→V2 projection test**: Validates ManifestFileV1::try_into() properly sets defaults for missing V2 fields
  • ManifestEntry V1→V2 projection test**: Validates ManifestEntryV1::try_into() correctly applies sequence number defaults
  • DataFile V1→V2 projection test**: Validates DataFileSerde::try_into() handles V1 field defaults correctly
  • Snapshot V1→V2 projection tests**: Two tests covering both scenarios (with/without optional summary field)

Fixes apache#1576

 ## Summary

This PR implements `Default` trait implementations for `DataContentType`
and `ManifestContentType` to enable proper V1→V2 manifest projection as
specified in the Apache Iceberg format specification. As part of solving
ManifestList projection, I did a full audit of ManifestData, ManifestList,
ManifestEntry and Snapshot files. All five file types are now covered for
V1 to V2 projection.

 ## Changes Made
- Added `Default` for `DataContentType`**: Returns `DataContentType::Data`
    (value 0) for V1 compatibility
- Added `Default` for `ManifestContentType`**: Returns `ManifestContentType::Data`
    (value 0) for V1 compatibility

 ### New Tests
- ManifestFile V1→V2 projection test**: Validates `ManifestFileV1::try_into()`
    properly sets defaults for missing V2 fields
- ManifestEntry V1→V2 projection test**: Validates `ManifestEntryV1::try_into()`
    correctly applies sequence number defaults
- DataFile V1→V2 projection test**: Validates `DataFileSerde::try_into()` handles V1
    field defaults correctly
- Snapshot V1→V2 projection tests**: Two tests covering both scenarios (with/without
    optional summary field)

 ## Technical Details

 ### V1→V2 Field Mapping
According to the Iceberg spec, when reading V1 manifests as V2, the following defaults are applied:

| Component | Field | V1 Value | V2 Default |
|-----------|-------|----------|------------|
| ManifestFile | `content` | absent | `0` (Data) |
| ManifestFile | `sequence_number` | absent | `0` |
| ManifestFile | `min_sequence_number` | absent | `0` |
| ManifestEntry | `sequence_number` | absent | `0` |
| ManifestEntry | `file_sequence_number` | absent | `0` |
| DataFile | `content` | absent | `0` (Data) |
| DataFile | `equality_ids` | absent | `[]` (empty) |
| Snapshot | `sequence_number` | absent | `0` |

 ### Existing Infrastructure
The V1→V2 projection was already working through existing conversion logic:
- `ManifestFileV1::try_into()` - converts V1 manifest list entries to V2
- `ManifestEntryV1::try_into()` - converts V1 manifest entries to V2
- `DataFileSerde` with `#[serde(default)]` - handles V1 data file defaults
- `SnapshotV1::try_into()` - converts V1 snapshots to V2

This PR adds the missing `Default` implementations and comprehensive tests to validate the behavior.

 ## Testing ##

All new tests pass and validate:
 - ✅ V1→V2 conversion applies correct defaults for missing fields
 - ✅ Existing V1 fields are preserved during conversion
 - ✅ Different scenarios (with/without optional fields) work correctly
 - ✅ Round-trip compatibility is maintained
@Kurtiscwright
Copy link
Contributor Author

Working on trying to figure out the failing check. Can't recreate on my Linux computer, but seems to be failing on my Mac. Will send a new commit once I have resolved the issue.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Kurtiscwright for this pr, LGTM!

@Fokko
Copy link
Contributor

Fokko commented Aug 4, 2025

Thnks for working on this @Kurtiscwright. The changes make sense to me, but I would also expect seting more initial-default's like here: https://github.com/apache/iceberg-rust/pull/1482/files#diff-2bde106cc8d1cc2e8c7a327cb117fb8dc5e4b143898360504f6e585d1904acbbR238

@Kurtiscwright
Copy link
Contributor Author

Thnks for working on this @Kurtiscwright. The changes make sense to me, but I would also expect seting more initial-default's like here: https://github.com/apache/iceberg-rust/pull/1482/files#diff-2bde106cc8d1cc2e8c7a327cb117fb8dc5e4b143898360504f6e585d1904acbbR238

Thank you for taking the time to review the PR @liurenjie1024 and @Fokko. My understanding (could be completely incorrect) is that with the annotations of #default where the schema is declared always provides that value unless overridden. For example, whenever ManifestContentType is called you always get a value of 0 unless its overridden. I am open to adding more initial-default's if my explanation above is incorrect. I tried adding extremely explicit tests to make sure that the assumption holds. If we agree on this approach I would be happy to work on a fast follow PR to align all the other default values to this same implementation.

Documentation: https://doc.rust-lang.org/std/default/trait.Default.html

@liurenjie1024
Copy link
Contributor

Thnks for working on this @Kurtiscwright. The changes make sense to me, but I would also expect seting more initial-default's like here: https://github.com/apache/iceberg-rust/pull/1482/files#diff-2bde106cc8d1cc2e8c7a327cb117fb8dc5e4b143898360504f6e585d1904acbbR238

Hi, @Fokko I'm also confusing about this question. Are you talking about scanning table with fields that has initial-default value, which is added by schema evolution, and the field is missing in data file? If so, I agree that we need to add some tests for it, but it should not happen in this pr.

@Fokko
Copy link
Contributor

Fokko commented Aug 6, 2025

Thanks @liurenjie1024 for pinging me here. I think I did not a great job at explaining my original intent at the issue. This PR is great, and allows converting the V1 structs into V2, which I think is also valuable.

My goal was to have an Avro reader that both can read V1 and V2, and read them into a V2 struct right away. This way we don't have to go through the hassle of looking up the format-version in the Avro metadata, and then pick the right reader. For example, if the sequence-number is missing because it is a V1 entry, we can just set it to 0.

I'm okay merging this one, and then we can close the gap in a separate PR. WDYT?

@Fokko Fokko merged commit 299cfed into apache:main Aug 6, 2025
26 of 27 checks passed
@Fokko
Copy link
Contributor

Fokko commented Aug 6, 2025

Thanks for working on this @Kurtiscwright, and thanks @liurenjie1024 for the review. I'll do another pass on #2004 later today to identify the missing parts 👍

@Kurtiscwright
Copy link
Contributor Author

Thank you both for the reviews. I started a slack thread to continue the discussion around a 1 Avro reader to rule them all task. I would like to pick it up, at minimum adding more unit tests to make sure it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read V1 ManifestList with V2 projection
3 participants