Add `statistics_by_partition API` to ExecutionPlan #15503

xudong963 · 2025-03-31T11:22:51Z

Which issue does this PR close?

Closes Add statistics_by_partition API to ExecutionPlan #15495

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

berkaysynnada · 2025-03-31T20:05:48Z

I suggest modifying the existing API, not a new one.

xudong963 · 2025-04-01T01:44:41Z

I suggest modifying the existing API, not a new one.

Which one do you think is suitable to modify, statistics()?

berkaysynnada · 2025-04-01T05:57:21Z

I suggest modifying the existing API, not a new one.

Which one do you think is suitable to modify, statistics()?

yes, is there something blocker?

xudong963 · 2025-04-01T10:28:04Z

I suggest modifying the existing API, not a new one.

Which one do you think is suitable to modify, statistics()?

yes, is there something blocker?

I think keeping statistics_by_partition as a separate API from statistics is the better approach for several reasons:

Clear separation of concerns: The two methods serve different purposes - one provides global statistics for the entire execution plan, while the other provides partition-level details.
Backward compatibility: Modifying statistics() to handle both cases would likely be a breaking change.
API clarity: Having separate methods makes the intent clearer when calling the API.

alamb · 2025-04-03T12:52:29Z

Eventually I agree with @berkaysynnada that it would be nice to have a single API that returned information about the overall and per partition statistics. However, I also agree with @xudong963 that there is no clear way to do that at the moment that isn't a major API change

How about we proceed with a new API and maybe plan some future work to unify them 🤔

alamb

Thanks @xudong963 -- I think this is a good start. I left some comments on the basic API setup but I think overall this is looking good.

One thing that would be really nice it to find some way to avoid cloning statistics. We already clone a lot of information on each call to ExecutionPlan::statistics() and I worry the new statistics_per_partiton API will only make this worse

alamb · 2025-04-03T12:58:03Z

datafusion/physical-plan/src/execution_plan.rs

+    /// Returns statistics for each partition of this `ExecutionPlan` node.
+    /// If statistics are not available, returns an array of
+    /// [`Statistics::new_unknown`] for each partition.
+    fn statistics_by_partition(&self) -> Result<Vec<Statistics>> {


Can we please make a structure rather than directly using Vec<Statistics> here?

I think doing so will make it easier / less breaking if we want to evolve how these statistics are handled. This was a lesson learned from our work with LexOrdering / EquivalenceProperties.

Something like the following

/// Statistics for each partitition struct PartitionedStatistics { inner: Vec<Statistics> } impl PartitionedStatistics { fn len(&self) -> usize { self.inner.len() } /// return the statistics for the specified partition fn statistics(&self) -> &Statistics { ... } }

alamb · 2025-04-03T12:59:55Z

datafusion/physical-plan/src/joins/cross_join.rs

@@ -344,6 +345,26 @@ impl ExecutionPlan for CrossJoinExec {
        ))
    }

+    fn statistics_by_partition(&self) -> Result<Vec<Statistics>> {


I don't think I saw a test for this code. Maybe I missed it

I missed it lol

Added in 910454a

datafusion/core/tests/physical_optimizer/partition_statistics.rs

alamb · 2025-04-03T13:06:21Z

datafusion/core/tests/physical_optimizer/partition_statistics.rs

+        assert_eq!(statistics[0].num_rows, Precision::Exact(4));
+        assert_eq!(statistics[0].column_statistics.len(), 2);
+        assert_eq!(statistics[0].total_byte_size, Precision::Exact(220));
+        assert_eq!(
+            statistics[0].column_statistics[0].null_count,
+            Precision::Exact(0)
+        );
+        assert_eq!(
+            statistics[0].column_statistics[0].max_value,
+            Precision::Exact(ScalarValue::Int32(Some(4)))
+        );
+        assert_eq!(
+            statistics[0].column_statistics[0].min_value,
+            Precision::Exact(ScalarValue::Int32(Some(1)))
+        );


I found it hard to read here what the expected statistics were

What do you think about a pattern like this (to create the expected statistcs)

```suggestion let expected_statistics = Statistics { num_rows: Precision::Exact(4), total_bute_size:Precision::Exact(220), column_statistics: vec![ ColumnStatistics {... }] }; assert_eq!(statistics, expected_statistics);

Thanks for the suggestion, I believe after 667337b, it'll be clear.

alamb · 2025-04-03T13:08:31Z

datafusion/core/tests/physical_optimizer/partition_statistics.rs

+        let filter: Arc<dyn ExecutionPlan> =
+            Arc::new(FilterExec::try_new(predicate, scan)?);
+        let _full_statistics = filter.statistics()?;
+        // The full statistics is invalid, at least, we can improve the selectivity estimation of the filter


I don't understand this comment. Should we file a ticket to track whatever the expected result is?

I think the result of full_statistics can be improved, I'll open an issue to follow the further improvement.

alamb · 2025-04-03T13:12:26Z

datafusion/physical-plan/src/statistics.rs

+    }
+}
+
+/// If the given value is numerically greater than the original maximum value,


This seems somewhat duplicated with Precision::max 🤔

Good find. I think we can only keep one. Will open a separated PR to do it

Recorded an issue: #15615

alamb · 2025-04-03T13:13:20Z

datafusion/datasource/src/source.rs

+        {
+            for (idx, file_group) in file_config.file_groups.iter().enumerate() {
+                if let Some(stat) = file_group.statistics() {
+                    statistics[idx] = stat.clone();


I am also growing worried about the amount of cloning happening for each Statistics object... they are deep clones at the moment

Yes, I have some thoughts.

I believe we should fix this as soon as possible, since the Statistics part of the codebase is under heavy focus at these moments by many people, and in the near future, I expect that we will have many Statistics related PR's and developments. So, to not bring an inherent regression, we should bring the infra to a safely extensible state.

Wrapping the Statistics with Arc<>'s can be a solution?

Some structs like FileGroup, PartitionedData etc. caches the Statistics. So, if the source operators can access those, they should return over them. However, for other intermediate operators, perhaps we can utilize PlanProperties? The Statistics will be initiated once and cached like other planning properties

Yes, at least we can use Arc now. I've done a pre PR to add Arc for the statistics FileGroup #15564

we can utilize PlanProperties? The Statistics will be initiated once and cached like other planning properties

I've had a general look at it and it should work.

Record an issue #15614

xudong963 · 2025-04-03T14:16:34Z

@alamb Sorry for the confusion about the tests, I'll refactor and document them

xudong963 · 2025-04-03T14:24:38Z

@alamb @berkaysynnada My thought about unifying the two methods:

/// Specifies what statistics to compute
pub enum StatisticsType {
    /// Only compute global statistics
    Global,
    /// Only compute per-partition statistics
    Partition,
    /// Compute both global and per-partition statistics
    Both,
}

/// Holds both global and per-partition statistics
pub struct ExecutionPlanStatistics {
    /// Global statistics for the entire plan
    pub global: Option<Statistics>,
    /// Statistics broken down by partition
    pub partition: Option<Vec<Statistics>>,
}

/// Returns statistics for this `ExecutionPlan` node based on the requested type.
/// Only computes what is requested to avoid unnecessary calculations.
fn statistics(&self, stat_type: StatisticsType) -> Result<ExecutionPlanStatistics> {
    match stat_type {
        StatisticsType::Global => Ok(ExecutionPlanStatistics {
            global: Some(Statistics::new_unknown(&self.schema())),
            partition: None,
        }),
        StatisticsType::Partition => {
            let partition_stats = vec![
                Statistics::new_unknown(&self.schema());
                self.properties().partitioning.partition_count()
            ];
            
            Ok(ExecutionPlanStatistics {
                global: None,
                partition: Some(partition_stats),
            })
        },
        StatisticsType::Both => {
            let partition_stats = vec![
                Statistics::new_unknown(&self.schema());
                self.properties().partitioning.partition_count()
            ];
            
            // Could merge partition stats here for global stats if needed
            let global_stats = Statistics::new_unknown(&self.schema());
            
            Ok(ExecutionPlanStatistics {
                global: Some(global_stats),
                partition: Some(partition_stats),
            })
        }
    }
}

xudong963 · 2025-04-03T14:34:14Z

Also, cc @suremarc to join the discussion

suremarc

The statistics_by_partition impls look mostly correct to me with a couple of exceptions, and the tests look good 👍 I added a comment about a potential improvement there.

On the topic of the proposed API for a new ExecutionPlan::statistics() signature & method:

fn statistics(&self, stat_type: StatisticsType) -> Result<ExecutionPlanStatistics>

I'm going to take it for granted that computing and cloning lots of statistics is expensive, as seems to be implied, though if anyone could point me to prior discussion on that issue that would be appreciated.

I see that this avoids computing partition-level statistics if it doesn't need to, which is nice, but I don't love it, mainly because someone could specify StatisticsType::Global or StatisticsType::Partition, and then send the ExecutionPlanStatistics to downstream code that is expecting both types, or the wrong type. Then that code will just fail. So you might just end up always specifying StatisticsType::Both, which is inefficient.

IMO the nicest API for the user that still avoids unnecessary computation would be something like this:

// Lazily compute statistics on-demand
// Could potentially cache results
// Each `ExecutionPlan` puts logic for computing their statistics inside an implementation of `ExecutionPlanStatistics`. 
pub trait ExecutionPlanStatistics {
    fn global(&self) -> Result<Statistics>
    fn by_partition(&self) -> Result<PartitionStatistics>
}

pub trait ExecutionPlan {
// ...
    fn statistics(&self) -> Result<&dyn ExecutionPlanStatistics>
}

Though I will admit it puts more burden on the implementor, and it might be a new pattern in the codebase.

I feel like I am discussing hypotheticals so honestly I am not married to either approach, but just thought I'd offer my 2 cents since @xudong963 asked 😅 Anyway this probably warrants further discussion on an issue.

suremarc · 2025-04-05T07:50:43Z

datafusion/physical-plan/src/coalesce_batches.rs

+    fn statistics_by_partition(&self) -> Result<Vec<Statistics>> {
+        Ok(vec![self.statistics()?])
+    }


Seeing as CoalesceBatchesExec doesn't change partitioning of its input I think this is the incorrect number of partitions? It should be repeated N times, once for each partition.

Good find. I was misled by the coalesce : )

suremarc · 2025-04-05T07:58:55Z

datafusion/physical-plan/src/execution_plan.rs

+        Ok(vec![
+            Statistics::new_unknown(&self.schema());
+            self.properties().partitioning.partition_count()
+        ])


Previously I had considered repeating the global statistics for each partition, but I am not sure if this is correct. At a minimum we would need to relax any exact statistics, and I guess we would also need to consider how the node's partitioning could affect how data is distributed.

I guess it is safe no matter what to return unknown statistics and there aren't any real benefits yet to trying to implement a better default. Curious if you have any thoughts to add.

I noticed the original default implementation from you; it's a bit risky if a node doesn't have a specific implementation for the API, and I also think most of the nodes' partition statistics don't follow the "repeating the global statistics for each partition". So out of safety, I change it to unknown statistics.

suremarc · 2025-04-05T08:26:44Z

datafusion/core/tests/physical_optimizer/partition_statistics.rs

+    async fn test_statistics_by_partition_of_data_source() -> datafusion_common::Result<()>
+    {
+        let scan = generate_listing_table_with_statistics(Some(2)).await;
+        let statistics = scan.statistics_by_partition()?;
+        let expected_statistic_partition_1 =
+            create_partition_statistics(2, 110, 3, 4, true);
+        let expected_statistic_partition_2 =
+            create_partition_statistics(2, 110, 1, 2, true);
+        // Check the statistics of each partition
+        assert_eq!(statistics.len(), 2);
+        assert_eq!(statistics[0], expected_statistic_partition_1);
+        assert_eq!(statistics[1], expected_statistic_partition_2);
+        Ok(())
+    }


What do you think about adding a function that runs all partitions of an ExecutionPlan (using execute_stream_partitioned) and checks if the min/max/etc. of each partition actually is consistent with the statistics_by_partition? It would be a useful way to check that the implementation matches what statistics_by_partition predicts, and I think we should be able to write a single function to check this in all cases here.

The idea makes a lot of sense, thank you!

Done in d94c149, now the tests look very promising!

berkaysynnada

Thank you @xudong963 for this big effort. I've some suggestions, and I'm sure we will find the best design after some iterations

datafusion/core/tests/physical_optimizer/partition_statistics.rs

berkaysynnada · 2025-04-06T18:47:45Z

datafusion/datasource/src/source.rs

+        {
+            for (idx, file_group) in file_config.file_groups.iter().enumerate() {
+                if let Some(stat) = file_group.statistics() {
+                    statistics[idx] = stat.clone();


I believe we should fix this as soon as possible, since the Statistics part of the codebase is under heavy focus at these moments by many people, and in the near future, I expect that we will have many Statistics related PR's and developments. So, to not bring an inherent regression, we should bring the infra to a safely extensible state.

berkaysynnada · 2025-04-06T18:58:15Z

datafusion/datasource/src/source.rs

+        {
+            for (idx, file_group) in file_config.file_groups.iter().enumerate() {
+                if let Some(stat) = file_group.statistics() {
+                    statistics[idx] = stat.clone();


Wrapping the Statistics with Arc<>'s can be a solution?

Some structs like FileGroup, PartitionedData etc. caches the Statistics. So, if the source operators can access those, they should return over them. However, for other intermediate operators, perhaps we can utilize PlanProperties? The Statistics will be initiated once and cached like other planning properties

berkaysynnada · 2025-04-06T20:05:34Z

datafusion/physical-plan/src/execution_plan.rs

+    /// Returns statistics for each partition of this `ExecutionPlan` node.
+    /// If statistics are not available, returns an array of
+    /// [`Statistics::new_unknown`] for each partition.
+    fn statistics_by_partition(&self) -> Result<Vec<Statistics>> {


I'm still on the side of unifying these two API's. Maybe you can have a better proposal, but those are what is in my mind now:

If we don't prefer dealing with new enum's or structs for Statistics, we can modify the API such:

fn statistics(&self, partition: Option<usize>) -> Result<Statistics>

This would give the option of not implementing all operators and express clearly what we have, and get all stuff closer.

If we prefer using the same API, but enriching the Statistics definition, then:

Rename Statistics as TableStatistics in the statistics() API.

struct TableStatistics { first_partition: Statistics others: Vec<Statistics> }

This option requires some methods for ease of use for TableStatistics as you imagine

Another alternative is completely separating table statistics from partition statistics, if their methods are like more distinct, and not used commonly very often

xudong963 · 2025-04-07T08:16:14Z

Thanks for your thoughts about unifying the two APIs @berkaysynnada @suremarc.

As @alamb said, how about we proceed with a new API and maybe plan some future work to unify them?

To be honest, I don't have a very strong preference for these thoughts. I wanna open a new issue to discuss how to unify them. Currently, we have three different thoughts, and I hope we can attract more users to participate in the discussion. Then, we can make the final decision from the developers' and users' combined perspectives.

I believe once we decide the unified way, the real unified work is so easy: just move the current separated API code to the new one, and change the API called from the tests.

What do you think?

berkaysynnada · 2025-04-08T08:40:31Z

What do you think?

I'm still thinking we should unify the API's. We've discussed the issue with our team, and got a common opinion:

If I give an example of my concerns, there will be many duplications like this:

    fn statistics(&self) -> Result<Statistics> {
        let stats = Self::statistics_helper(
            self.schema(),
            self.input().statistics()?,
            self.predicate(),
            self.default_selectivity,
        )?;
        Ok(stats.project(self.projection.as_ref()))
    }

    fn statistics_by_partition(&self) -> Result<PartitionedStatistics> {
        let input_stats = self.input.statistics_by_partition()?;

        let stats: Result<Vec<Arc<Statistics>>> = input_stats
            .iter()
            .map(|stat| {
                let stat = Self::statistics_helper(
                    self.schema(),
                    stat.clone(),
                    self.predicate(),
                    self.default_selectivity,
                )
                .map(|stat| stat.project(self.projection.as_ref()))?;
                Ok(Arc::new(stat))
            })
            .collect();

        Ok(PartitionedStatistics::new(stats?))
    }

There are/will be duplications of statistics() logics in each operator like this, because the calculations are the same, whether the stats are coming from the whole table or just for one partition. We can avoid the duplications and write efficient functional statistics() implementations if we adopt

fn statistics(&self, partition: Option<usize>) -> Result<Statistics>

style. So, that alternative wins clearly for me against other alternatives. It also does not modify the other existing structs/API's, and propose an extensible way while enabling the statistics access over any partition.

TLDR, updating the API as fn statistics(&self, partition: Option<usize>) -> Result<Statistics> has a minimal change, doesn't force us to follow an immature design path, reduce duplications, and enables partition-based stats access, that's the main goal.

xudong963 · 2025-04-08T09:05:22Z

fn statistics(&self, partition: Option<usize>) -> Result<Statistics>

Thanks @berkaysynnada ! I'm a little confused about the API, the original statistics_by_partition is to collect all partitions' statistics. IIUC, the new statistics API works like this:

fn statistics(&self, partition: Option<usize>) -> Result<Statistics> {
    match partition {
        Some(idx) => {
            // Validate partition index
            if idx >= self.properties().partitioning.partition_count() {
                return exec_err!("Invalid partition index: {}", idx);
            }
            // Default implementation: return unknown statistics for the specific partition
            Ok(Statistics::new_unknown(&self.schema()))
        }
        None => {
            // Return statistics for the entire plan (existing behavior)
            Ok(Statistics::new_unknown(&self.schema()))
        }
    }
}

How does it return all partitions' statistics?

alamb · 2025-04-08T10:07:25Z

How does it return all partitions' statistics?

I think the idea like

let all_partition_statistics = plan.statistics(None);

// get only statistics for partition 3
let partition_statistics = plan.statistics(Some(3));

xudong963 · 2025-04-19T15:42:59Z

I may start a new branch based on the branch to experiment with @berkaysynnada's suggestion to see if there are some challenges, then we can decide the next direction. /cc @alamb @suremarc @wiedld (Hope we can make the optimized SPM cross the finish line and include it in DF48 🚀 )

berkaysynnada · 2025-04-21T07:33:28Z

I may start a new branch based on the branch to experiment with @berkaysynnada's suggestion to see if there are some challenges, then we can decide the next direction. /cc @alamb @suremarc @wiedld (Hope we can make the optimized SPM cross the finish line and include it in DF48 🚀 )

Sounds good @xudong963. I'm looking forward to see your findings

alamb · 2025-04-29T01:35:16Z

I believe this is superceded by #15852 so marking as a draft

xudong963 marked this pull request as draft March 31, 2025 11:22

github-actions bot added the datasource Changes to the datasource crate label Mar 31, 2025

xudong963 force-pushed the statistic_per_partition_api branch from e7aa6fa to 4d18715 Compare April 1, 2025 11:15

github-actions bot added the core Core DataFusion crate label Apr 1, 2025

xudong963 force-pushed the statistic_per_partition_api branch from 6531341 to f5f9f2c Compare April 2, 2025 04:33

xudong963 mentioned this pull request Apr 2, 2025

Fix: after repartitioning, the PartitionedFile and FileGroup statistics should be inexact/recomputed #15539

Closed

xudong963 force-pushed the statistic_per_partition_api branch from ac5ae71 to d2112b2 Compare April 3, 2025 07:58

xudong963 changed the title ~~WIP: Add statistics_by_partition API to ExecutionPlan~~ Add statistics_by_partition API to ExecutionPlan Apr 3, 2025

xudong963 marked this pull request as ready for review April 3, 2025 09:07

xudong963 requested a review from alamb April 3, 2025 09:08

alamb reviewed Apr 3, 2025

View reviewed changes

suremarc reviewed Apr 5, 2025

View reviewed changes

berkaysynnada reviewed Apr 6, 2025

View reviewed changes

xudong963 force-pushed the statistic_per_partition_api branch from 75c1c83 to 07869d8 Compare April 7, 2025 08:52

This was referenced Apr 7, 2025

Reduce the clone() cost of Statistics by caching #15614

Open

Unify Precision::max and set_max_if_greater methods #15615

Closed

alamb mentioned this pull request Apr 7, 2025

Weekly Plan (Andrew Lamb) April 7, 2025 #15616

Closed

12 tasks

xudong963 force-pushed the statistic_per_partition_api branch from d94c149 to e122df0 Compare April 8, 2025 08:08

xudong963 added 20 commits April 25, 2025 13:46

save

f75c7dc

save

2b4a14f

save

992baaf

functional way

5854095

fix sort

2fdb663

adding test

b3d16e6

add tests

8a8f350

save

491046a

update

9d1beb9

add PartitionedStatistics structure

fc73cd1

use Arc

fd73b97

refine tests

48eebaa

save

7c14496

resolve conflicts

a69b241

use PartitionedStatistics

7e3d1e6

impl index and len for PartitionedStatistics

0cac2df

add test for cross join

495d73e

fix clippy

33cf80a

Check the statistics_by_partition with real results

de08c3b

rebase main and fix cross join test

5e40270

xudong963 force-pushed the statistic_per_partition_api branch from e122df0 to 0ba5ef5 Compare April 25, 2025 05:58

resolve conflicts

5726954

xudong963 force-pushed the statistic_per_partition_api branch from 0ba5ef5 to 5726954 Compare April 25, 2025 06:09

github-actions bot added the optimizer Optimizer rules label Apr 25, 2025

xudong963 mentioned this pull request Apr 25, 2025

Feat: introduce ExecutionPlan::partition_statistics API #15852

Merged

alamb marked this pull request as draft April 29, 2025 01:35

xudong963 closed this Apr 29, 2025

Add statistics_by_partition API to ExecutionPlan #15503

Add statistics_by_partition API to ExecutionPlan #15503

Conversation

xudong963 commented Mar 31, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

berkaysynnada commented Mar 31, 2025

xudong963 commented Apr 1, 2025

berkaysynnada commented Apr 1, 2025 • edited Loading

xudong963 commented Apr 1, 2025

alamb commented Apr 3, 2025

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xudong963 Apr 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xudong963 commented Apr 3, 2025

xudong963 commented Apr 3, 2025

xudong963 commented Apr 3, 2025

suremarc left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berkaysynnada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xudong963 commented Apr 7, 2025

berkaysynnada commented Apr 8, 2025 • edited Loading

xudong963 commented Apr 8, 2025

alamb commented Apr 8, 2025

xudong963 commented Apr 19, 2025 • edited Loading

berkaysynnada commented Apr 21, 2025

alamb commented Apr 29, 2025

Add `statistics_by_partition API` to ExecutionPlan #15503

Add `statistics_by_partition API` to ExecutionPlan #15503

berkaysynnada commented Apr 1, 2025 •

edited

Loading

xudong963 Apr 3, 2025 •

edited

Loading

suremarc left a comment •

edited

Loading

berkaysynnada commented Apr 8, 2025 •

edited

Loading

xudong963 commented Apr 19, 2025 •

edited

Loading