-
Notifications
You must be signed in to change notification settings - Fork 302
feat(datafusion): implement the project node to add the partition columns #1602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(datafusion): implement the project node to add the partition columns #1602
Conversation
b3a8601
to
40a225a
Compare
…umns defined in Iceberg. Implement physical execution plan node that projects Iceberg partition columns from source data, supporting nested fields and all Iceberg transforms.
40a225a
to
4d59f87
Compare
let field_path = Self::find_field_path(&self.table_schema, source_field.id)?; | ||
let index_path = Self::resolve_arrow_index_path(batch_schema.as_ref(), &field_path)?; | ||
|
||
let source_column = Self::extract_column_by_index_path(batch, &index_path)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very interesting! I actually came across the similar issue when implementing the sort node, and I was leaning toward implementing a new SchemaWithPartnerVisitor
, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect 👌
I was initially thinking this was needed just for this implementation, but it seems the right place would be closer to the Schema definition. Since this is a standard method for accessing column values by index, it makes sense to generalize!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I drafted a PartitionValueVisitor
here to help extract partition values from a record batch in tree-traversal style
Pleast let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just saw this implementation to extract partition values and it actually makes more sense to me that it leverages the existing RecordBatchProjector
: #1040
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, thanks for sharing. I will use #1040 when merged!
} | ||
|
||
/// Find the path to a field by its ID (e.g., ["address", "city"]) in the Iceberg schema | ||
fn find_field_path(table_schema: &Schema, field_id: i32) -> DFResult<Vec<String>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…n containing all the partitions values
Which issue does this PR close?
What changes are included in this PR?
Implement a physical execution plan node that projects Iceberg partition columns from source data, supporting nested fields and all Iceberg transforms.
Are these changes tested?
Yes, with unit tests