-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: consume and produce Substrait type extensions #11510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// For [`DataType::Interval`] with [`IntervalUnit::MonthDayNano`]. | ||
/// | ||
/// [`DataType::Interval`]: datafusion::arrow::datatypes::DataType::Interval | ||
/// [`IntervalUnit::MonthDayNano`]: datafusion::arrow::datatypes::IntervalUnit::MonthDayNano | ||
pub const INTERVAL_MONTH_DAY_NANO_TYPE_URL: &str = "interval-month-day-nano"; | ||
pub const INTERVAL_MONTH_DAY_NANO_TYPE_NAME: &str = "interval-month-day-nano"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
substrait uses "url" for referring to the yaml file containing the declaration (which we don't have/use in DF yet), "name" is better for referring to the name of the type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good -- a url to another file sounds a lot like XML .xsd
files from back in the day 😱
3f6416d
to
4b0be1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @Blizzara -- I don't entirely follow the uri
vs Identifier thing but the tests are passing and I think this is a good step forward
One thing I didn't see in this PR was any tests for the new functionaly, though I see you say "Existing unit tests, especially the round trip tests for IntervalMonthDayNano type and literal"
I think we need to add some tests prior to merging this PR to demonstrate the new functionality to register extension types, unless there is some reason we can't (or maybe they are coming in a follow on PR or something)
I also think it would be good to add some comments to Extensions
use substrait::proto::extensions::SimpleExtensionDeclaration; | ||
|
||
#[derive(Default)] | ||
pub struct Extensions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please document this struct (I think you have a lot of context on the PR description you could just copy here)
Specifically I think it would help to document what the u32
was / represented (is it a substrait id or something?
Likewise I think we should document the various functions below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added docs in 21654ec
|
||
for (t_anchor, t_name) in val.types { | ||
let type_extension = ExtensionType { | ||
extension_uri_reference: u32::MAX, // We don't register proper extension URIs yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe file a ticket to track adding URIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None => not_impl_err!("Cannot parse empty extension"), | ||
}) | ||
.collect::<Result<HashMap<_, _>>>()?; | ||
let extensions = Extensions::try_from(&plan.extensions)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice 👨🍳 👌
/// For [`DataType::Interval`] with [`IntervalUnit::MonthDayNano`]. | ||
/// | ||
/// [`DataType::Interval`]: datafusion::arrow::datatypes::DataType::Interval | ||
/// [`IntervalUnit::MonthDayNano`]: datafusion::arrow::datatypes::IntervalUnit::MonthDayNano | ||
pub const INTERVAL_MONTH_DAY_NANO_TYPE_URL: &str = "interval-month-day-nano"; | ||
pub const INTERVAL_MONTH_DAY_NANO_TYPE_NAME: &str = "interval-month-day-nano"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good -- a url to another file sounds a lot like XML .xsd
files from back in the day 😱
@@ -55,6 +55,7 @@ pub const DECIMAL_256_TYPE_VARIATION_REF: u32 = 1; | |||
/// [`DataType::Interval`]: datafusion::arrow::datatypes::DataType::Interval | |||
/// [`IntervalUnit::YearMonth`]: datafusion::arrow::datatypes::IntervalUnit::YearMonth | |||
/// [`ScalarValue::IntervalYearMonth`]: datafusion::common::ScalarValue::IntervalYearMonth | |||
#[deprecated] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can add a note here that they were deprecated in version 41 and leave a suggestion of what to do with them instead?
Something like
# so they are covered in `datafusion/physical-expr/src/aggregate/array_agg_distinct.rs` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use substrait::proto::extensions::SimpleExtensionDeclaration; | ||
|
||
#[derive(Default)] | ||
pub struct Extensions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added docs in 21654ec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thank you very much @Blizzara
* support reading type extensions in consumer * read extension for UDTs * support also type extensions in producer * produce extensions for MonthDayNano UDT * unify extensions between consumer and producer * fixes * add doc comments * add extension tests * fix * fix docs * fix test * fix clipppy
* support reading type extensions in consumer * read extension for UDTs * support also type extensions in producer * produce extensions for MonthDayNano UDT * unify extensions between consumer and producer * fixes * add doc comments * add extension tests * fix * fix docs * fix test * fix clipppy
Which issue does this PR close?
Closes #.
Rationale for this change
#10646 introduced Substrait UDTs for interval types. However as that PR didn't yet write proper Substrait extensions for those types, it's hard to produce matching UDTs from other Substrait producers. Until substrait-io/substrait#665 gets merged and passed through to Java & DataFusion, I'd need a UDT to represent Spark's CalendarInterval / DF's IntervalMonthDayNano.
What changes are included in this PR?
Produce and consume SimpleExtensionDeclarations for types in addition to functions. This makes user-defined types interoperable across different consumers/producers, assuming they utilize matching names for the same UDT.
Also adds most of the support for doing it for type variations, but doesn't use that yet.
Similarly as for the existing support for functions, this doesn't read any actual extension files, so the extension URL references are bogus (max int). However this makes it easier to add that support in future.
Are these changes tested?
Existing unit tests, especially the round trip tests for IntervalMonthDayNano type and literal
Are there any user-facing changes?