Skip to content

Support User-Defined Sorting #14828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tobixdev opened this issue Feb 22, 2025 · 5 comments
Open

Support User-Defined Sorting #14828

tobixdev opened this issue Feb 22, 2025 · 5 comments
Labels
enhancement New feature or request

Comments

@tobixdev
Copy link
Contributor

tobixdev commented Feb 22, 2025

Is your feature request related to a problem or challenge?

In our system we are working heavily with tagged unions. Basically every column in our results are a DataType::Union. However, comparing unions is not natively supported by DF/arrow-rs. This makes sense in a general setting; however, we only have a single union type that we know how to compare.

This issue should address the following two problems in this context:

  1. Sorting by a column
  2. Support distinct on a column

I think 2. is a consequence of 1. as somewhere sorting is required during the execution of the query.

We have built a workaround for 1. by project to a "sortable" value that DF can support natively. While we may can do some similar workaround with Distinct::On for 2., we hope to find a better solution to our problem.

An extension of this issue could also allow users to override the default sorting behavior for certain types.

Describe the solution you'd like

I have read this blog post on sorting in arrow-rs. I think it would be nice if we can extend this mechanism in DF/arrow-rs. Maybe something like providing a byte encoding for a particular DataType? However, I am not really experienced in this area.

Looking forward to your opinions!

Describe alternatives you've considered

I think one good way to achieve this would be to integrate this work with logical and extension types (#12622, #12644). However, we would love to be able to use these capabilities before getting full support for logical/extension types.

Another way is also to work with the workarounds (basically we could materialize the comparison byte arrays in a UDF). However, using this projection technique is cumbersome as every internal call to sort (see problems with distinct) can cause crashes as the actual column is not sortable.

Additional context

Maybe we also need a "sister-issue" in arrow-rs for tracking this issue. I'd also be interested in helping out to implement this feature. However, I'd need some more experienced input before tackling it.

Here are some crude tests that raise the following error.

Error: ArrowError(NotYetImplemented("Row format support not yet implemented for: [SortField { options: SortOptions { descending: false, nulls_first: true }, data_type: Union([(0, Field { name: \"A\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), (1, Field { name: \"B\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} })], Dense) }]"), None)

Test Case (Distinct)

Here is a test case that can create this problem:

#[tokio::test]
async fn test_distinct_on_union() -> Result<()> {
    let fields = [
        (0, Arc::new(Field::new("A", DataType::Int32, false))),
        (1, Arc::new(Field::new("B", DataType::Float64, false))),
    ]
    .into_iter()
    .collect();
    let schema = Schema::new(vec![Field::new(
        "my_union",
        DataType::Union(fields, UnionMode::Dense),
        false,
    )]);

    let mut builder = UnionBuilder::new_dense();
    builder.append::<Int32Type>("A", 1)?;
    builder.append::<Float64Type>("B", 3.0)?;
    builder.append::<Int32Type>("A", 1)?;
    builder.append::<Float64Type>("B", 3.0)?;
    let union = builder.build()?;

    let ctx = SessionContext::new();
    ctx.register_table(
        "test_table",
        Arc::new(MemTable::try_new(
            Arc::new(schema.clone()),
            vec![vec![RecordBatch::try_new(
                Arc::new(schema),
                vec![Arc::new(union)],
            )?]],
        )?),
    )?;
    let result = ctx.table("test_table").await?.distinct()?.count().await?;
    assert_eq!(result, 1);

    Ok(())
}

Test Case for Sort

This test fails with the same error. Note that I think this test should also fail in the future. However, by extending the row converter procedure, we hope to get this test running.

#[tokio::test]
async fn test_sort_on_union() -> Result<()> {
    let fields = [
        (0, Arc::new(Field::new("A", DataType::Int32, false))),
        (1, Arc::new(Field::new("B", DataType::Float64, false))),
    ]
    .into_iter()
    .collect();
    let schema = Schema::new(vec![Field::new(
        "my_union",
        DataType::Union(fields, UnionMode::Dense),
        false,
    )]);

    let mut builder = UnionBuilder::new_dense();
    builder.append::<Int32Type>("A", 1)?;
    builder.append::<Float64Type>("B", 3.0)?;
    builder.append::<Int32Type>("A", 1)?;
    builder.append::<Float64Type>("B", 3.0)?;
    let union = builder.build()?;

    let ctx = SessionContext::new();
    ctx.register_table(
        "test_table",
        Arc::new(MemTable::try_new(
            Arc::new(schema.clone()),
            vec![vec![RecordBatch::try_new(
                Arc::new(schema),
                vec![Arc::new(union)],
            )?]],
        )?),
    )?;
    ctx.table("test_table")
        .await?
        .sort_by(vec![Expr::from(datafusion::common::Column::from(
            "my_union",
        ))])?
        .execute_stream()
        .await?
        .next()
        .await
        .unwrap()?;

    Ok(())
}
@tobixdev tobixdev added the enhancement New feature or request label Feb 22, 2025
@alamb
Copy link
Contributor

alamb commented Feb 23, 2025

I think one good way to achieve this would be to integrate this work with logical and extension types (#12622, #12644). However, we would love to be able to use these capabilities before getting full support for logical/extension types.

I agree this is a good usecase for "user defined types"

@tobixdev
Copy link
Contributor Author

So i dug a little deeper on how we could implement this functionality.

arrow-rs

Firstly, I think we require two "flavors" of sorting - one for arrow-ord and one for arrow-row as DataFusion uses both of these APIs.

AFAIK, in arrow-rs we don't have a "user defined type" registry that we could leverage. So when a user calls, for example, lexsort, there is also no way to "lookup" whether we have a user defined type that can be applied to a particular column. Therefore, we must pass in the information on how to sort a column to the sorting procedure. One possible way for getting this information into the called functions is to extend SortField (arrow-row) and SortColumn (arrow-ord).

I think this extension could happen in one of two ways:

  1. Provide an Implementation: In this approach users directly provide an implementation. In arrow-row this could be a custom byte encoder, in arrow-ord this could be a function that maps to a Ordering.
  2. Provide a User Defined Type: In this approach users provide a user defined type. Somehow we would then need the ability to deduce the implementations from 1. based on the given type.

Maybe one way to go forward is implement 1. and then implement 2. if we think this is sensible. I am a bit unsure on how we can directly use the arrow extension types for this use case. If anyone has a better idea here, I'd love to hear your take on that (@mbrobbel maybe you have an opinion if this is within the scope of arrow extension types).

DataFusion

Here I think we should go with user defined types and attach the sorting information to that type. This should be possible as we can have a "central registry" (e.g., SessionContext) that holds all available user defined types and we can look up whether this particular column has a user defined type with a user defined ordering. If we have a use case, we could also think of adding the ability to override this ordering behavior.

I don't have more details on how we could implement this. I think trying to implement 1. in arrow-rs is the first step, and then we should check on how we can connect that to user defined types in DataFusion.

If this sounds good, I can start to work on a draft. Maybe lmk if you think that extending SortField and SortColumn is a reasonable approach here.

@alamb
Copy link
Contributor

alamb commented Feb 25, 2025

Here I think we should go with user defined types and attach the sorting information to that type. This should be possible as we can have a "central registry" (e.g., SessionContext) that holds all available user defined types and we can look up whether this particular column has a user defined type with a user defined ordering. If we have a use case, we could also think of adding the ability to override this ordering behavior.

I think this type of registration is what will be needed for implementing user defined types

In terms of arrow-rs I am not sure we should add anything there yet -- I think we should start the implementation in DataFusion and then port stuff uptream to arrow-rs when it makes sense / we have sorted out the interface

Specifically, I think we could avoid using the arrow-row format when it is not supported for some particular type, that wanted to override comparison behavior, for example

@tobixdev
Copy link
Contributor Author

In terms of arrow-rs I am not sure we should add anything there yet -- I think we should start the implementation in DataFusion and then port stuff uptream to arrow-rs when it makes sense / we have sorted out the interface

Specifically, I think we could avoid using the arrow-row format when it is not supported for some particular type, that wanted to override comparison behavior, for example

Sounds like a good plan. I'll see what I can come up with.

@tobixdev
Copy link
Contributor Author

tobixdev commented Mar 9, 2025

Here is a start for a discussing a possible implementation: 15106 (more infos in the PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants