-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support User-Defined Sorting #14828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So i dug a little deeper on how we could implement this functionality. arrow-rsFirstly, I think we require two "flavors" of sorting - one for AFAIK, in I think this extension could happen in one of two ways:
Maybe one way to go forward is implement 1. and then implement 2. if we think this is sensible. I am a bit unsure on how we can directly use the arrow extension types for this use case. If anyone has a better idea here, I'd love to hear your take on that (@mbrobbel maybe you have an opinion if this is within the scope of arrow extension types). DataFusionHere I think we should go with user defined types and attach the sorting information to that type. This should be possible as we can have a "central registry" (e.g., I don't have more details on how we could implement this. I think trying to implement 1. in arrow-rs is the first step, and then we should check on how we can connect that to user defined types in DataFusion. If this sounds good, I can start to work on a draft. Maybe lmk if you think that extending |
I think this type of registration is what will be needed for implementing user defined types In terms of arrow-rs I am not sure we should add anything there yet -- I think we should start the implementation in DataFusion and then port stuff uptream to arrow-rs when it makes sense / we have sorted out the interface Specifically, I think we could avoid using the arrow-row format when it is not supported for some particular type, that wanted to override comparison behavior, for example |
Sounds like a good plan. I'll see what I can come up with. |
Here is a start for a discussing a possible implementation: 15106 (more infos in the PR) |
Is your feature request related to a problem or challenge?
In our system we are working heavily with tagged unions. Basically every column in our results are a
DataType::Union
. However, comparing unions is not natively supported by DF/arrow-rs. This makes sense in a general setting; however, we only have a single union type that we know how to compare.This issue should address the following two problems in this context:
I think 2. is a consequence of 1. as somewhere sorting is required during the execution of the query.
We have built a workaround for 1. by project to a "sortable" value that DF can support natively. While we may can do some similar workaround with
Distinct::On
for 2., we hope to find a better solution to our problem.An extension of this issue could also allow users to override the default sorting behavior for certain types.
Describe the solution you'd like
I have read this blog post on sorting in
arrow-rs
. I think it would be nice if we can extend this mechanism in DF/arrow-rs
. Maybe something like providing a byte encoding for a particularDataType
? However, I am not really experienced in this area.Looking forward to your opinions!
Describe alternatives you've considered
I think one good way to achieve this would be to integrate this work with logical and extension types (#12622, #12644). However, we would love to be able to use these capabilities before getting full support for logical/extension types.
Another way is also to work with the workarounds (basically we could materialize the comparison byte arrays in a UDF). However, using this projection technique is cumbersome as every internal call to sort (see problems with distinct) can cause crashes as the actual column is not sortable.
Additional context
Maybe we also need a "sister-issue" in arrow-rs for tracking this issue. I'd also be interested in helping out to implement this feature. However, I'd need some more experienced input before tackling it.
Here are some crude tests that raise the following error.
Test Case (Distinct)
Here is a test case that can create this problem:
Test Case for Sort
This test fails with the same error. Note that I think this test should also fail in the future. However, by extending the row converter procedure, we hope to get this test running.
The text was updated successfully, but these errors were encountered: