Skip to content

feat: consume and produce Substrait type extensions #11510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 19, 2024

Conversation

Blizzara
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

#10646 introduced Substrait UDTs for interval types. However as that PR didn't yet write proper Substrait extensions for those types, it's hard to produce matching UDTs from other Substrait producers. Until substrait-io/substrait#665 gets merged and passed through to Java & DataFusion, I'd need a UDT to represent Spark's CalendarInterval / DF's IntervalMonthDayNano.

What changes are included in this PR?

Produce and consume SimpleExtensionDeclarations for types in addition to functions. This makes user-defined types interoperable across different consumers/producers, assuming they utilize matching names for the same UDT.

Also adds most of the support for doing it for type variations, but doesn't use that yet.

Similarly as for the existing support for functions, this doesn't read any actual extension files, so the extension URL references are bogus (max int). However this makes it easier to add that support in future.

Are these changes tested?

Existing unit tests, especially the round trip tests for IntervalMonthDayNano type and literal

Are there any user-facing changes?

@Blizzara Blizzara marked this pull request as ready for review July 17, 2024 11:04
@github-actions github-actions bot added the substrait Changes to the substrait crate label Jul 17, 2024
/// For [`DataType::Interval`] with [`IntervalUnit::MonthDayNano`].
///
/// [`DataType::Interval`]: datafusion::arrow::datatypes::DataType::Interval
/// [`IntervalUnit::MonthDayNano`]: datafusion::arrow::datatypes::IntervalUnit::MonthDayNano
pub const INTERVAL_MONTH_DAY_NANO_TYPE_URL: &str = "interval-month-day-nano";
pub const INTERVAL_MONTH_DAY_NANO_TYPE_NAME: &str = "interval-month-day-nano";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substrait uses "url" for referring to the yaml file containing the declaration (which we don't have/use in DF yet), "name" is better for referring to the name of the type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good -- a url to another file sounds a lot like XML .xsd files from back in the day 😱

@Blizzara
Copy link
Contributor Author

fyi @alamb @waynexia

@Blizzara Blizzara force-pushed the avo/substrait-type-extensions branch from 3f6416d to 4b0be1d Compare July 18, 2024 13:33
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @Blizzara -- I don't entirely follow the uri vs Identifier thing but the tests are passing and I think this is a good step forward

One thing I didn't see in this PR was any tests for the new functionaly, though I see you say "Existing unit tests, especially the round trip tests for IntervalMonthDayNano type and literal"

I think we need to add some tests prior to merging this PR to demonstrate the new functionality to register extension types, unless there is some reason we can't (or maybe they are coming in a follow on PR or something)

I also think it would be good to add some comments to Extensions

use substrait::proto::extensions::SimpleExtensionDeclaration;

#[derive(Default)]
pub struct Extensions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please document this struct (I think you have a lot of context on the PR description you could just copy here)

Specifically I think it would help to document what the u32 was / represented (is it a substrait id or something?

Likewise I think we should document the various functions below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added docs in 21654ec


for (t_anchor, t_name) in val.types {
let type_extension = ExtensionType {
extension_uri_reference: u32::MAX, // We don't register proper extension URIs yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe file a ticket to track adding URIs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None => not_impl_err!("Cannot parse empty extension"),
})
.collect::<Result<HashMap<_, _>>>()?;
let extensions = Extensions::try_from(&plan.extensions)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice 👨‍🍳 👌

/// For [`DataType::Interval`] with [`IntervalUnit::MonthDayNano`].
///
/// [`DataType::Interval`]: datafusion::arrow::datatypes::DataType::Interval
/// [`IntervalUnit::MonthDayNano`]: datafusion::arrow::datatypes::IntervalUnit::MonthDayNano
pub const INTERVAL_MONTH_DAY_NANO_TYPE_URL: &str = "interval-month-day-nano";
pub const INTERVAL_MONTH_DAY_NANO_TYPE_NAME: &str = "interval-month-day-nano";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good -- a url to another file sounds a lot like XML .xsd files from back in the day 😱

@@ -55,6 +55,7 @@ pub const DECIMAL_256_TYPE_VARIATION_REF: u32 = 1;
/// [`DataType::Interval`]: datafusion::arrow::datatypes::DataType::Interval
/// [`IntervalUnit::YearMonth`]: datafusion::arrow::datatypes::IntervalUnit::YearMonth
/// [`ScalarValue::IntervalYearMonth`]: datafusion::common::ScalarValue::IntervalYearMonth
#[deprecated]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can add a note here that they were deprecated in version 41 and leave a suggestion of what to do with them instead?

Something like

# so they are covered in `datafusion/physical-expr/src/aggregate/array_agg_distinct.rs`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@Blizzara Blizzara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @alamb - I added docs in 21654ec and tests in 1042941, and filed two followup issues #11544 and #11545

use substrait::proto::extensions::SimpleExtensionDeclaration;

#[derive(Default)]
pub struct Extensions {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added docs in 21654ec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thank you very much @Blizzara

@alamb alamb merged commit a4c9bb4 into apache:main Jul 19, 2024
23 checks passed
@Blizzara Blizzara deleted the avo/substrait-type-extensions branch July 19, 2024 21:19
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
* support reading type extensions in consumer

* read extension for UDTs

* support also type extensions in producer

* produce extensions for MonthDayNano UDT

* unify extensions between consumer and producer

* fixes

* add doc comments

* add extension tests

* fix

* fix docs

* fix test

* fix clipppy
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
* support reading type extensions in consumer

* read extension for UDTs

* support also type extensions in producer

* produce extensions for MonthDayNano UDT

* unify extensions between consumer and producer

* fixes

* add doc comments

* add extension tests

* fix

* fix docs

* fix test

* fix clipppy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants