Skip to content

perf(reader): Avoid second create_parquet_record_batch_stream_builder() call for migrated tables#2176

Open
mbutrovich wants to merge 10 commits intoapache:mainfrom
mbutrovich:double_open_fix
Open

perf(reader): Avoid second create_parquet_record_batch_stream_builder() call for migrated tables#2176
mbutrovich wants to merge 10 commits intoapache:mainfrom
mbutrovich:double_open_fix

Conversation

@mbutrovich
Copy link
Collaborator

@mbutrovich mbutrovich commented Feb 24, 2026

Which issue does this PR close?

What changes are included in this PR?

Introduces open_parquet_file(), which opens the file once and returns both the ArrowFileReader and ArrowReaderMetadata. The caller inspects the metadata in-memory for field IDs, optionally rebuilds ArrowReaderMetadata with a custom schema for migrated tables, then passes the original ArrowFileReader to ParquetRecordBatchStreamBuilder::new_with_metadata(). This eliminates the redundant file open that previously occurred for migrated tables.

Are these changes tested?

Existing tests. Also ran full Iceberg Java suite via Comet.

@mbutrovich mbutrovich changed the title Double open fix perf(reader): Avoid second create_parquet_record_batch_stream_builder() call for migrated tables Feb 24, 2026
@mbutrovich mbutrovich self-assigned this Feb 24, 2026
@mbutrovich mbutrovich marked this pull request as ready for review February 26, 2026 02:17
@mbutrovich mbutrovich requested a review from blackmwk February 26, 2026 02:17
Copy link
Contributor

@jdockerty jdockerty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM!

You'll have to wait for a more authoritative review before merging, but I thought I'd at least give this a look 😄

Some non-blocking questions below.

# Conflicts:
#	crates/iceberg/src/arrow/delete_file_loader.rs
#	crates/iceberg/src/arrow/reader.rs
@mbutrovich
Copy link
Collaborator Author

Re-running Iceberg Java tests via Comet at apache/datafusion-comet#3642. I ran this change previously as part of apache/datafusion-comet#3551 which has a lot of the changes discussed in #2172, but those have mostly merged/been refactored as individual PRs, so I'll run just this PR.

@mbutrovich
Copy link
Collaborator Author

It passed all the Iceberg Java tests on Comet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants