Skip to content

Feat: introduce ExecutionPlan::partition_statistics API #15852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Apr 29, 2025

Conversation

xudong963
Copy link
Member

@xudong963 xudong963 commented Apr 25, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate datasource Changes to the datasource crate labels Apr 25, 2025
@xudong963
Copy link
Member Author

cc @berkaysynnada PTAL, I didn't see any challenges during refactoring, the process is smooth.

The tests are failing due to #15689

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @xudong963, thank you ! I’ll just ask about specifying deprecation notices for the obsolete API before merging.

@xudong963 xudong963 force-pushed the support_partition_statistics branch from b943802 to 9f28472 Compare April 28, 2025 06:15
@github-actions github-actions bot added optimizer Optimizer rules and removed optimizer Optimizer rules labels Apr 28, 2025
@xudong963 xudong963 added the api change Changes the API exposed to users of the crate label Apr 28, 2025
@xudong963 xudong963 force-pushed the support_partition_statistics branch from b1f701f to 2144cf2 Compare April 28, 2025 07:51
@xudong963
Copy link
Member Author

xudong963 commented Apr 28, 2025

Github is down, my recent update is delayed

@xudong963 xudong963 force-pushed the support_partition_statistics branch 2 times, most recently from 9d72f0e to e3a14ee Compare April 28, 2025 12:39
@alamb alamb mentioned this pull request Apr 28, 2025
26 tasks
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xudong963 and @berkaysynnada

I also think you should add a note to the upgrade guide for this change (basically people will need to implement partition_statistics if they have implemented statistics -- and I think the compiler may not flag the newly introduced method.

Something that might be a better UX for downstream users might be to NOT provide a default implementation of partition_statistics and force them to implement it:

    fn partition_statistics(&self, partition: Option<usize>) -> Result<Statistics>;

But then they would need some way to avoid the boiler plate code 🤔

@alamb alamb changed the title Feat: introduce partition statistics API Feat: introduce ExecutionPlan::partition_statistics API Apr 29, 2025
@xudong963 xudong963 force-pushed the support_partition_statistics branch from e3a14ee to 71777d4 Compare April 29, 2025 08:09
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 29, 2025
@xudong963
Copy link
Member Author

I also think you should add a note to the upgrade guide for this change

Added to #15771

and I think the compiler may not flag the newly introduced method.

Because we marked the old statistics method as deprecated, the compiler will mention users and navigate them to the new method.

Something that might be a better UX for downstream users might be to NOT provide a default implementation of partition_statistics and force them to implement it:

Yes, I've experienced this during upgrading DF45, the version introduced a new method for LexOrdering IIRC, and gave it a default implementation; the default isn't suitable for our case, and I didn't notice the new method; it took me a long time to debug. QAQ

But then they would need some way to avoid the boiler plate code 🤔

Also agree, it's a trade-off, given that the compiler will metion as I said above and we'll also metion the change in the upgrade doc, life will be easier.

Or after merging the PR, we can implement the new method in all places and then remove the default implemation.

@alamb
Copy link
Contributor

alamb commented Apr 29, 2025

Let's get this one in -- it has been outstanding for too long. Thanks again @xudong963 and @berkaysynnada

@alamb alamb merged commit 324be53 into apache:main Apr 29, 2025
27 checks passed
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* save

* save

* save

* functional way

* fix sort

* adding test

* add tests

* save

* update

* add PartitionedStatistics structure

* use Arc

* refine tests

* save

* resolve conflicts

* use PartitionedStatistics

* impl index and len for PartitionedStatistics

* add test for cross join

* fix clippy

* Check the statistics_by_partition with real results

* rebase main and fix cross join test

* resolve conflicts

* Feat: introduce partition statistics API

* address comments

* deprecated statistics API

* rebase main and fix tests

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate datasource Changes to the datasource crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add statistics_by_partition API to ExecutionPlan
3 participants