-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: metadata handling for aggregates and window functions #15911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: metadata handling for aggregates and window functions #15911
Conversation
7bfcb7b
to
8f0dc24
Compare
FYI @paleolimbot @crystalxyz since this impacts both of your work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this! (Particularly for the heroic updates to all the tests 😬 )
I took a look through and I didn't spot anything out of place (although I'm new to datafusion!).
I'm not sure if the tests are intended to be examples, but it may be worth a slightly more intuitive one (e.g., a sum()
that propagates a metadata key unit
from the source to the destination).
I also still find it strange to use a Field
for these since the name is ignored. Using a dedicated type decoupled from arrow-rs might give you more future flexibility if you need to modify this approach in any way (e.g., to reduce the amount of JSON parsing required when invoking a function on an extension type with parameters). Obviously an optional suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @timsaucer - this is epic.
I think this is good and we should proceed. Despite the potential API disruption I don't think there is any way to support user defined / extension types otherwise. I took the liberty of merging up from main to resolve some conflicts and pushing a few commits to tweak docs and try to avoid quite as many string copies.
The only thing I am worried about with this change is the number of Field
s that are passed around -- each Field
contains an owned String
so each time we copy a Field
it will force a copy of the String
and will potentially slow down planning, especially for large and complex queries.
Do you think it is feasible to update the scalar, aggregate, and window function APIs to use FieldRef
instead of Field? That way we can avoid most string copies.
🤖 |
🤖: Benchmark completed Details
|
Do you think it's reasonable to merge this as is and do switch to |
Absolutely -- it makes sense to me -- it would also be great to see if we can update the scalar function API while we are at it |
Thanks. I'll re-try CI after pulling in main. This passes every time on my local machine. I'm not sure why it's failing here. |
I just merged a fix for CI on main, and remerged this PR. Hopefully it will now be good to go |
Which issue does this PR close?
Rationale for this change
This change is a follow on to #15646. With this change we can now handle metadata for both window and aggregate functions. It enables the use of extension types via this metadata handling.
What changes are included in this PR?
Instead of passing
DataType
to aggregates and windows we now passField
in their arguments.return_type
has been replaced withreturn_field
so we can get metadata out of these functions as well.Are these changes tested?
All existing unit tests pass. New unit tests are added.
Are there any user-facing changes?
Yes, the migration guide contains information on the updates that the user will need to make for their user defined functions.
TODO before moving to ready to review
fn return_type()
that returns aField
)