-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix: Map functions crash on out of bounds cases #16203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Detect if array elements contain structs with null-typed fields and emit a proper null array instead of trying to extract malformed data. This prevents Arrow's NullArray buffer validation from panicking.
1. Reproduce the state in the issue and test the fix has worked. 2. Case where the array has mixed null fields. 3. Case with out-of-bounds array access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR resolves a crash occurring during array indexing when map functions encounter structs with Null-typed fields. Key changes include adding detection of structs containing Null fields in extract.rs and returning a properly typed null array, along with comprehensive tests for the crash scenario, mixed null/valid fields, and out-of-bounds access.
| let mut null_array_data = Vec::with_capacity(array.len()); | ||
| for _ in 0..array.len() { | ||
| null_array_data.push(None::<i32>); | ||
| } | ||
|
|
Copilot
AI
May 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'null_array_data' is initialized and populated but never used, which may lead to confusion. Consider removing this unused code to simplify the function.
| let mut null_array_data = Vec::with_capacity(array.len()); | |
| for _ in 0..array.len() { | |
| null_array_data.push(None::<i32>); | |
| } |
| } | ||
|
|
||
| if let DataType::Struct(fields) = values.data_type() { | ||
| let has_null_fields = fields.iter().any(|field| field.data_type() == &Null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please help me to understand why we check by structure field types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, now I realize the problem happens only if the struct has with with null as the value
| let has_null_fields = fields.iter().any(|field| field.data_type() == &Null); | ||
| if has_null_fields { | ||
| // Instead of trying to extract from malformed struct data and return appropriate nulls | ||
| let mut null_array_data = Vec::with_capacity(array.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to return I think we can return NullArray like here?
if values.data_type().is_null() {
return Ok(Arc::new(NullArray::new(array.len())));
}
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @krishvishal
Please add SQL tests as well to array.slt file
what comes to my mind is
select [named_struct('a', 1, 'b', null)][0];
select [named_struct('a', 1, 'b', null)][1];
select [named_struct('a', 1, 'b', null)][-1];
select [named_struct('a', 1, 'b', null)][2]
select map_values(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[0]
select map_values(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[1]
select map_values(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[-1]
select map_values(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[2]
select map_keys(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[0]
select map_keys(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[1]
select map_keys(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[-1]
select map_keys(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[2]
;
|
@krishvishal are you still planning to wrap this PR up? |
returning all nulls
Improves upon the previous fix that detected null-typed fields in
structs but incorrectly returned all null results. Now properly
extracts valid struct elements while handling null-typed fields
gracefully.
Changes:
- Replace blanket null return with handle_struct_with_null_fields()
- Process each struct field individually using MutableArrayData for valid fields
- Create proper NullArrays only for null-typed fields
- Maintain correct null buffer for the overall result
This allows queries like [named_struct('a', 1, 'b', null)][1] to correctly
return {a: 1, b: NULL} instead of NULL.
Fixes: apache#16187
acess of struct arrays with null fields
|
@comphead I've changed the implementation a bit to handle nulls properly. Previous fix just outputs I've also added tests to I think its ready. Apologies for the delay. |
|
Thanks @krishvishal the latest version becomes much more complicated compared to prev one. This can be a subject to check the performance. What is the reason for adding the specific handler, some tests not passing? |
|
@comphead I've had to add the handler because the previous fix caused wrong behavior. For example the following query currently returns: > select [named_struct('a', 1, 'b', null)][1];
+-----------------------------------------------------------------------+
| make_array(named_struct(Utf8("a"),Int64(1),Utf8("b"),NULL))[Int64(1)] |
+-----------------------------------------------------------------------+
| {a: 1, b: NULL} |
+-----------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.This is correct. But my previous fix (commit: d9a699f) returned |
|
|
||
| // array is null | ||
| if len == O::usize_as(0) { | ||
| if array.is_null(row_index) || len == O::usize_as(0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems these 2 if statements can be merged into 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding
is_null() checks if an element at an index is NULL.
2nd condition checks if the element is an empty list.
They are checking different things right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right I had a feeling we can merge them like
if array.is_null(row_index) || len == O::usize_as(0) || indexes.is_null(row_index) {
mutable.extend_nulls(1);
continue;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I understand what you mean. Yes they can be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| i64: TryInto<O>, | ||
| { | ||
| let index: O = index.try_into().map_err(|_| { | ||
| DataFusionError::Execution(format!("array_element got invalid index: {index}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| DataFusionError::Execution(format!("array_element got invalid index: {index}")) | |
| exec_err!("array_element got invalid index: {index}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
- merge conditionals - use `exec_err!` - Fix warning raised by clippy
|
@comphead I've pushed additional fixes you've asked. I think the PR is ready. Let me know if it is okay. |
|
Thanks @krishvishal let me quickly double check if we can do it with less data massaging. Since this will happen on execution layer it would be called for every batch of data possibly hitting the performance |
|
@comphead, can you please tell if there is something I can do to move this forward? Are there any relevant benches I could either run or adapt for this case? |
Sorry I didn't have a chance to look into that, I'm planning to do it this week |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
|
Closing in favor of #16802 |
Which issue does this PR close?
nullvalue #16187Rationale for this change
This PR fixes a panic crash in array indexing operations when accessing elements from
map_values()results that contain structs withNull-typed fields.What changes are included in this PR?
datafusion/functions-nested/src/extract.rs: Added null field detection and safe handlingAre these changes tested?
Yes - unit tests are added covering:
Null-typed fieldsAre there any user-facing changes?
No, only behavior change.