Skip to content

Add union_tag scalar function #14687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 1, 2025
Merged

Add union_tag scalar function #14687

merged 6 commits into from
May 1, 2025

Conversation

gstvg
Copy link
Contributor

@gstvg gstvg commented Feb 16, 2025

Which issue does this PR close?

Rationale for this change

Retrieve the name of the currently selected field on a union, as there's no way to do it today

What changes are included in this PR?

union_tag scalar function implementation

Are these changes tested?

Yes, with sqllogictests when possible, and with unit tests for union scalars, which are not supported in SQL yet

Are there any user-facing changes?

A new scalar function union_tag

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Feb 16, 2025
@alamb alamb mentioned this pull request Feb 17, 2025
// Union fields type IDs only constraints are being unique and in the 0..128 range:
// They may not start at 0, be sequential, or even contiguous.
// Therefore, we allocate a values vector with a length equal to the highest type ID plus one,
// ensuring that each field's name can be placed at the index corresponding to its type ID.
Copy link
Contributor Author

@gstvg gstvg Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The union column used on the sqllogictests contains a single field with type id 3, so this is put to the test

fn register_union_table(ctx: &SessionContext) {
let union = UnionArray::try_new(
UnionFields::new(vec![3], vec![Field::new("int", DataType::Int32, false)]),
ScalarBuffer::from(vec![3, 3]),
None,
vec![Arc::new(Int32Array::from(vec![1, 2]))],
)
.unwrap();
let schema = Schema::new(vec![Field::new(
"union_column",
union.data_type().clone(),
false,
)]);
let batch =
RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::new(union)]).unwrap();
ctx.register_batch("union_table", batch).unwrap();
}

"union_function.slt" => {
info!("Registering table with union column");
register_union_table(test_ctx.session_ctx())
}

@Omega359
Copy link
Contributor

@alamb - here is another function coming in (xxhash, regexp_extract (both versions of it), array_min/array_max functions) where it is not clear what should be accepted and what shouldn't be. Since we've already accepted union_extract this may be a case of fleshing that series of functions out.

I strongly think we need 'official' documentation as to what will be and won't be accepted and an recommended repository/sub project where additional functions can be located. Hopefully under the apache umbrella such that they hopefully can maintained by the community and work across multiple DF versions.

@alamb
Copy link
Contributor

alamb commented Feb 19, 2025

@alamb - here is another function coming in (xxhash, regexp_extract (both versions of it), array_min/array_max functions) where it is not clear what should be accepted and what shouldn't be. Since we've already accepted union_extract this may be a case of fleshing that series of functions out.

I strongly think we need 'official' documentation as to what will be and won't be accepted and an recommended repository/sub project where additional functions can be located. Hopefully under the apache umbrella such that they hopefully can maintained by the community and work across multiple DF versions.

Yeah I agree. I think we should file a "discussion" type ticket to have this discussion. I can file one at some point later (I am low on time this week) or if you can that would be sweet.

we have some generic guidance here: https://datafusion.apache.org/contributor-guide/index.html#what-contributions-are-good-fits

@Omega359
Copy link
Contributor

Omega359 commented Feb 19, 2025

Yeah I agree. I think we should file a "discussion" type ticket to have this discussion. I can file one at some point later (I am low on time this week) or if you can that would be sweet.

we have some generic guidance here: https://datafusion.apache.org/contributor-guide/index.html#what-contributions-are-good-fits

on it. #14777

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 22, 2025
@Omega359
Copy link
Contributor

@alamb, thoughts on this?

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Apr 24, 2025
@alamb alamb changed the title Add union_tag scalar function Add union_tag scalar function Apr 29, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think we should merge this @Omega359 I am fine with doing so too.

I have merged up from main to get a clean CI run, and hopefully we'll be good.

I think this will nicely round out the UnionArray related functions

@Omega359
Copy link
Contributor

I'll review it today @alamb

@Omega359
Copy link
Contributor

LGTM. I think I'd like to see a test with multiple columns but the logic looks solid to me. I believe the use of unsafe is indeed ok given the conditions outlined.

@@ -23,7 +26,8 @@ query ?I
select union_column, union_extract(union_column, 'int') from union_table;
----
{int=1} 1
{int=2} 2
{string=bar} NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new row to the table so we could test union_tag with a field that did not exist, per @Omega359 's suggestion

select union_column, union_tag(union_column) from union_table;
----
{int=1} int
{string=bar} string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new test as suggested

@alamb
Copy link
Contributor

alamb commented Apr 29, 2025

LGTM. I think I'd like to see a test with multiple columns but the logic looks solid to me. I believe the use of unsafe is indeed ok given the conditions outlined.

Thanks again for the review. I added the requested test. I think the CI is going to fail because of #15149 (comment)

Once that is fixed I'll refresh the PR

@alamb alamb merged commit a726e5a into apache:main May 1, 2025
29 checks passed
@alamb
Copy link
Contributor

alamb commented May 1, 2025

Thank you @Omega359 and @gstvg for your patience

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* feat: add union_tag scalar function

* update for new api

* Add test for second field type

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add union_tag function
3 participants