You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We move buffer pointer of offset buffer when slicing a string array and keep data buffer pointer unchanged. When exporting it through FFI, we simply export the moved pointer of the offset buffer.
When importing the array, we calculate the length of data buffer by taking the difference of last offset and first offset in the (slice) offset buffer. Note that the calculated length is not correct.
For example, the original string array's data buffer is 346536 bytes, last offset is 346536. We take a slice of 8192 strings from it, the slice of offsets are [147456, ..., 294912]. The calculated length is 294912 - 147456 = 147456. But actually the length of data buffer is 346536. So the data buffer of the imported array has incorrect length.
It doesn't cause issues so far because we access imported data buffer using pointers at most time (and we don't actually check the range). But for some cases where we access the data as slice (i.e., []), for example when we extend it, it will cause runtime panic like:
---- ffi::tests_from_ffi::test_extend_imported_string_slice stdout ----
thread 'ffi::tests_from_ffi::test_extend_imported_string_slice' panicked at arrow-data/src/transform/variable_size.rs:38:29:
range end index 10890 out of range for slice of length 5500
Note test_extend_imported_string_slice is new test I added in apache/arrow#5895.
It also makes the exported array possibly cannot be imported to be used in other Arrow implementations like Java Arrow: apache/arrow-java#74. Because in Java Arrow, its buffer implementation ArrowBuf possibly checks index range.
Describe the bug
We move buffer pointer of offset buffer when slicing a string array and keep data buffer pointer unchanged. When exporting it through FFI, we simply export the moved pointer of the offset buffer.
When importing the array, we calculate the length of data buffer by taking the difference of last offset and first offset in the (slice) offset buffer. Note that the calculated length is not correct.
For example, the original string array's data buffer is 346536 bytes, last offset is 346536. We take a slice of 8192 strings from it, the slice of offsets are
[147456, ..., 294912]
. The calculated length is294912 - 147456 = 147456
. But actually the length of data buffer is346536
. So the data buffer of the imported array has incorrect length.It doesn't cause issues so far because we access imported data buffer using pointers at most time (and we don't actually check the range). But for some cases where we access the data as slice (i.e.,
[]
), for example when we extend it, it will cause runtime panic like:Note
test_extend_imported_string_slice
is new test I added in apache/arrow#5895.It also makes the exported array possibly cannot be imported to be used in other Arrow implementations like Java Arrow: apache/arrow-java#74. Because in Java Arrow, its buffer implementation
ArrowBuf
possibly checks index range.To Reproduce
Expected behavior
Additional context
The bug was found during debugging apache/datafusion-comet#540.
The text was updated successfully, but these errors were encountered: