-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Enable non-uniform field type for structs created in DataFusion #8463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
waynexia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good to me.
BTW, I think we also need a slt test for struct() function in the original issue
datafusion/expr/src/signature.rs
Outdated
| /// One or more arguments with arbitrary types | ||
| VariadicAny, | ||
| /// arbitrary number of arguments out of a possibly different but limited set of valid types | ||
| VariadicLimited(Vec<DataType>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we have Variadic, VariadicLimited and VariadicEqual, VariadicAny. Their behaviors are similar, and I'm wondering how we can better reflect their difference in naming.
The tail Limited also applies to Variadic. So Variadic is in actual "variadic" + "equal" + "limited", and the new one is "variadic" + "any" + "limited". How about changing Variadic to VariadicLimitedEqual and naming the new one VariadicLimitedAny? But they are a bit verbose, is there any other options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had similar thoughts about the ambiguity of the names and went with VariadicLimited because I didn't want to change too much.
I am pro-verbosity: tab-completion makes verbosity less painful but there is no similar thing for ambiguity.
I wasn't sure where such a test would go or how it would look. Can you point me to an example? |
|
To resolve these conflicts should I upmerge main? (I started from |
|
I believe this will also close the underlying issue reported in #7012 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution @dlovell . Very much appreciated.
I think any change in DataFusion needs to have some end to end tests in sqllogictests -- see https://github.com/apache/arrow-datafusion/tree/main/datafusion/sqllogictest -- perhaps you can add in the test from #8118
In this case I was thinking about it, and I don't see any reason we need to restrict the types of arguments to the struct function at all. Arrow imposes no restrictions of the types of fields in a StructArray so I don't see why DataFusion is imposing any restrictions either
When I took out all the checks for argument type like this:
diff --git a/datafusion/expr/src/built_in_function.rs b/datafusion/expr/src/built_in_function.rs
index d48e9e7a6..1a9822d76 100644
--- a/datafusion/expr/src/built_in_function.rs
+++ b/datafusion/expr/src/built_in_function.rs
@@ -959,8 +959,7 @@ impl BuiltinScalarFunction {
],
self.volatility(),
),
- BuiltinScalarFunction::Struct => Signature::variadic(
- struct_expressions::SUPPORTED_STRUCT_TYPES.to_vec(),
+ BuiltinScalarFunction::Struct => Signature::variadic_any(
self.volatility(),
),
BuiltinScalarFunction::Concat
diff --git a/datafusion/physical-expr/src/struct_expressions.rs b/datafusion/physical-expr/src/struct_expressions.rs
index 0eed1d16f..1d30702bb 100644
--- a/datafusion/physical-expr/src/struct_expressions.rs
+++ b/datafusion/physical-expr/src/struct_expressions.rs
@@ -34,31 +34,14 @@ fn array_struct(args: &[ArrayRef]) -> Result<ArrayRef> {
.enumerate()
.map(|(i, arg)| {
let field_name = format!("c{i}");
- match arg.data_type() {
- DataType::Utf8
- | DataType::LargeUtf8
- | DataType::Boolean
- | DataType::Float32
- | DataType::Float64
- | DataType::Int8
- | DataType::Int16
- | DataType::Int32
- | DataType::Int64
- | DataType::UInt8
- | DataType::UInt16
- | DataType::UInt32
- | DataType::UInt64 => Ok((
+ Ok((
Arc::new(Field::new(
field_name.as_str(),
arg.data_type().clone(),
true,
)),
arg.clone(),
- )),
- data_type => {
- not_impl_err!("Struct is not implemented for type '{data_type:?}'.")
- }
- }
+ ))
})
.collect::<Result<Vec<_>>>()?;The test case from @yukkit seems to create the field types unchanged (as verified using arrow_typeof):
DataFusion CLI v33.0.0
❯ CREATE TABLE values(
c0 INT,
c1 String,
c2 String
) AS VALUES
(1, 'a', 'a'),
(2, 'b', 'b'),
(3, 'c', 'c');
0 rows in set. Query took 0.015 seconds.
❯ select arrow_typeof(struct(c0, c1, c2)) from VALUES;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| arrow_typeof(struct(values.c0,values.c1,values.c2)) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) |
| Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) |
| Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set. Query took 0.005 seconds.Maybe @Ted-Jiang who seems to have added struct in #2389 has some contet about why the types were restricted
| KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations | ||
| under the License. | ||
| --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this changelog file was modified in this PR. Perhaps it was a mistake 🤔
It seems to be some part of a change added in #8144
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working from branch-33 so I could build datafusion-python 33.0.0 against the changes.
Should I rebase onto main?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working from
branch-33so I could builddatafusion-python33.0.0against the changes.Should I rebase onto
main?
Yes please (when we are ready to merge this)
| Signature::one_of(vec![VariadicAny, Any(0)], self.volatility()) | ||
| } | ||
| BuiltinScalarFunction::Struct => Signature::variadic( | ||
| BuiltinScalarFunction::Struct => Signature::variadic_limited( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using Signature::variadic_any? That I think would pass along the argument types directly (not try to coerce them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the simplest solution. If I make that change, I presume I should remove TypeSignature::VariadicLimited?
Ah, I wasn't aware of |
|
@alamb Looking into it, its not clear to me how @yukkit 's test is different from struct scalar function with columns #1, which already exists but also passes. |
|
@alamb I think I get it: I should test the explain output (which does change with the changes to struct), not the value that is written to the slt file by |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dlovell -- this looks great!
| Signature::one_of(vec![VariadicAny, Any(0)], self.volatility()) | ||
| } | ||
| BuiltinScalarFunction::Struct => Signature::variadic( | ||
| struct_expressions::SUPPORTED_STRUCT_TYPES.to_vec(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also remove SUPPORTED_STRUCT_TYPES entirely (I don't think it is used anywhere else)
We can do so as a follow on PR too.
| {c0: 2, c1: 2.2, c2: b} | ||
| {c0: 3, c1: 3.3, c2: c} | ||
|
|
||
| # explain struct scalar function with columns #1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I double checked and before this PR this query results in casts
❯ explain select struct(a, b, c) from values;
+---------------+----------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: struct(CAST(values.a AS Utf8), CAST(values.b AS Utf8), values.c) |
| | TableScan: values projection=[a, b, c] |
| physical_plan | ProjectionExec: expr=[struct(CAST(a@0 AS Utf8), CAST(b@1 AS Utf8), c@2) as struct(values.a,values.b,values.c)] |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+----------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.009 seconds.
|
@dlovell -- this PR has conflicts that need to be resolved before I can merge into main (basically we need to rebase this PR against main) Is that something you would like to do? |
2536cda to
4474b07
Compare
4474b07 to
323ceb0
Compare
|
@alamb I did the rebase, let me know if there's anything that needs to be changed. |
|
Thanks again @dlovell -- BTW I love bug fixes that involve deleting code ❤️ |
|
🤔 looks like we need to run |
|
This looks great! |
|
@alamb I ran |
1 similar comment
|
@alamb I ran |
|
@alamb apologies, just found out about |
|
No worries -- thanks for sticking with it @dlovell |
…he#8463) * feat: struct: implement variadic_any solution, enable all struct field types * fix: run cargo-fmt * cln: remove unused imports
…he#8463) * feat: struct: implement variadic_any solution, enable all struct field types * fix: run cargo-fmt * cln: remove unused imports
…he#8463) * feat: struct: implement variadic_any solution, enable all struct field types * fix: run cargo-fmt * cln: remove unused imports
Which issue does this PR close?
Closes #8118
Rationale for this change
Datafusion can read in data with / have structs with fields of different types, but currently can't create a struct column with fields of different types: struct creation casts of all the struct fields to the same type.
What changes are included in this PR?
This PR defines
TypeSignature::VariadicLimitedand assigns it as the return type ofBuiltinScalarFunction::Struct.TypeSignature::VariadicLimitedis like aTypeSignature::VariadicAnywith a limited definition ofAnyand is distinct fromTypeSignature::Variadicin that it does not cast any fields.Are these changes tested?
There is a test for a successful and an unsuccessful invocation of
get_valid_typesforTypeSignature::VariadicLimitedAre there any user-facing changes?
The type of the generated struct column will be different: users who are expecting a cast will be surprised.