Skip to content

Conversation

fvaleye
Copy link
Contributor

@fvaleye fvaleye commented Aug 13, 2025

Which issue does this PR close?

What changes are included in this PR?

Implement a physical execution plan node that projects Iceberg partition columns from source data, supporting nested fields and all Iceberg transforms.

Are these changes tested?

Yes, with unit tests

@fvaleye fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from b3a8601 to 40a225a Compare August 13, 2025 13:17
…umns defined in Iceberg.

Implement physical execution plan node that projects Iceberg partition columns from source
data, supporting nested fields and all Iceberg transforms.
@fvaleye fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from 40a225a to 4d59f87 Compare August 13, 2025 14:50
Comment on lines +148 to +151
let field_path = Self::find_field_path(&self.table_schema, source_field.id)?;
let index_path = Self::resolve_arrow_index_path(batch_schema.as_ref(), &field_path)?;

let source_column = Self::extract_column_by_index_path(batch, &index_path)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very interesting! I actually came across the similar issue when implementing the sort node, and I was leaning toward implementing a new SchemaWithPartnerVisitor, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect 👌
I was initially thinking this was needed just for this implementation, but it seems the right place would be closer to the Schema definition. Since this is a standard method for accessing column values by index, it makes sense to generalize!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I drafted a PartitionValueVisitor here to help extract partition values from a record batch in tree-traversal style

Pleast let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw this implementation to extract partition values and it actually makes more sense to me that it leverages the existing RecordBatchProjector: #1040

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, thanks for sharing. I will use #1040 when merged!

}

/// Find the path to a field by its ID (e.g., ["address", "city"]) in the Iceberg schema
fn find_field_path(table_schema: &Schema, field_id: i32) -> DFResult<Vec<String>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to consider this function as well @CTTY following our discussion here.
It may not be the right place at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Project Node: Caculate partition value
2 participants