feat(datafusion): implement the partitioning node for DataFusion to define the partitioning #1620

fvaleye · 2025-08-21T13:37:42Z

Which issue does this PR close?

Closes Implement Repartition Node: Decide when the partitioning mode for the best parallelism #1543

What changes are included in this PR?

Implement a physical execution repartition node that determines the relevant DataFusion partitioning strategy based on the Iceberg table schema and metadata.

Implement hash partitioning for partitioned/bucketed tables
Use round-robin partitioning for unpartitioned tables
Support range distribution mode approximation via sort columns

Minor change: I created a new schema_ref() helper method.

Are these changes tested?

Yes, with unit tests

…the best partition strategy for Iceberg for writing - Implement hash partitioning for partitioned/bucketed tables - Use round-robin partitioning for unpartitioned tables - Support range distribution mode approximation via sort columns

liurenjie1024 · 2025-08-22T10:29:12Z

crates/integrations/datafusion/src/physical_plan/repartition.rs

+/// - Automatically detects optimal partition count from DataFusion's SessionConfig
+/// - Preserves column order (partitions first, then buckets) for consistent file layout
+#[derive(Debug)]
+pub struct IcebergRepartitionExec {


There already exists an RepartitionExec, why we need to create a new one?

I think we just need to extend PhysicalExpr?

Thank you for your comments!

RepartitionExec is a generic operator: it reshuffles rows based on a distribution requirement.
Iceberg has stricter requirements for how data must be partitioned, bucketed, sorted, and grouped before writing. So, we must select the relevant partition strategy.

We have special requirements before writing:

Use Iceberg table metadata:

Partition specifications (identity transforms, bucket transforms)

Sort orders

Write distribution mode (hash, range, none)

Select the appropriate partitioning strategy:

Hash partitioning on partition/bucket columns for partitioned tables

Round-robin for unpartitioned tables

Range approximation using sort order columns

Some other requirements to preserve the" partition–bucket–range ordering" semantics required by Iceberg:

Partition columns must be respected in the physical file layout

Bucketing/range partitioning needs to be reproducible and consistent

File grouping must align with Iceberg metadata expectations

Repartitioning is a plan-level operator, not an expression:

PhysicalExpr can help compute the partition/bucket key for a row.

Reshuffling rows into partitions is still an execution node (ExecutionPlan).

If we only extend PhysicalExpr, we'll have an expression that can calculate partition/bucket values, but we still need an Exec node to do the actual shuffle/repartitioning.

So, in a nutshell, why we need our "Iceberg-aware" strategy (IcebergRepartitionExec) to determine the best partitioning, and we use it for Datafusion (calling RepartitionExec with our selection), and we use PhysicalExpr for determining it:

IcebergRepartitionExec (strategy selection, Iceberg-aware) ↳ chooses partitioning (hash/round-robin/range) ↳ uses Iceberg metadata (partition spec, sort order, mode) ↓ DataFusion RepartitionExec (generic shuffle operator) ↳ actually reshuffles rows into partitions ↓ PhysicalExpr (partition/bucket key computation) ↳ hash/range/bucket expressions evaluated per row

Of course, if we decide to rely 100% on DataFusion, we need to consider:

RepartitionExec implements generic distributions without understanding the Iceberg specificities (bucket, partitions, range vs sort)

Iceberg requires that bucketing and range partitioning be reproducible and consistent across writers

Iceberg expects hierarchical ordering: partition → bucket → range

Data Inconsistency risk? may not be reproductible?

If Iceberg semantics aren’t enforced at write time, we will need extra cleanup/repair jobs later (e.g., repartitioning files offline or rewriting manifests for metadata) or custom implementation

liurenjie1024 · 2025-08-25T10:28:40Z

crates/integrations/datafusion/src/physical_plan/repartition.rs

+    ///
+    /// If no suitable hash columns are found (e.g., unpartitioned, non-bucketed table),
+    /// falls back to round-robin batch partitioning for even load distribution.
+    fn determine_partitioning_strategy(


This is interesting to see. At first I just thought two cases:

If it's partitioned table, we should just hash partition.

If it's not partitioned, we should just use round robin partition.

However, this reminds me another case: range only partition, e.g. we only has partitions like date, time. I think in this case we should also use round robin partition since in this case most data are focused in several partitions.

Also I don't think we should take into account write.distribution-mode for now. The example you use are for spark, but not applicable for datafusion.

However, this reminds me another case: range only partition, e.g. we only has partitions like date, time. I think in this case we should also use round robin partition since in this case most data are focused in several partitions.

Hum. You are right. The range partitions concentrate data in recent partitions, making hash partitioning counterproductive (considering a date with a temporal partition).
Since DataFusion doesn't provide Range, the fallback is round-robin and not hashing.

Briefly:

Hash partition: Only on bucket columns (partition spec + sort order)

Round-robin: Everything else (unpartitioned, range, identity, temporal transforms)

Also I don't think we should take into account write.distribution-mode for now. The example you use are for spark, but not applicable for datafusion.

Oh, good point, I misunderstood this. I thought it was an iceberg-rust table property.

…robin for range partitions Signed-off-by: Florian Valeye <[email protected]>

fvaleye changed the title ~~feat(datafusion): implement repartition node for DataFusion with~~ feat(datafusion): implement the partitioning node for DataFusion to define the partitioning Aug 21, 2025

liurenjie1024 reviewed Aug 25, 2025

View reviewed changes

feat(datafusion): remove spark-specific distribution-mode, use round-…

b6d5c2a

…robin for range partitions Signed-off-by: Florian Valeye <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(datafusion): implement the partitioning node for DataFusion to define the partitioning #1620

feat(datafusion): implement the partitioning node for DataFusion to define the partitioning #1620

Uh oh!

fvaleye commented Aug 21, 2025

Uh oh!

liurenjie1024 Aug 22, 2025

Uh oh!

liurenjie1024 Aug 25, 2025

Uh oh!

fvaleye Aug 25, 2025

Uh oh!

liurenjie1024 Aug 25, 2025

Uh oh!

fvaleye Aug 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

feat(datafusion): implement the partitioning node for DataFusion to define the partitioning #1620

Are you sure you want to change the base?

feat(datafusion): implement the partitioning node for DataFusion to define the partitioning #1620

Uh oh!

Conversation

fvaleye commented Aug 21, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

liurenjie1024 Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

fvaleye Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

fvaleye Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fvaleye Aug 25, 2025 •

edited

Loading