[WIP][SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays#53822
[WIP][SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays#53822Yicong-Huang wants to merge 4 commits intoapache:masterfrom
Conversation
JIRA Issue Information=== Bug SPARK-55056 === This comment was automatically generated by GitHub Actions |
1f8c172 to
33ef752
Compare
| // SPARK-55056: Arrow format requires ListArray offset buffer to have N+1 entries. | ||
| // Even when N=0, the buffer must contain [0]. Initialize offset buffer at construction | ||
| // to ensure it exists even if no elements are written. | ||
| valueVector.getOffsetBuffer.setInt(0, 0) |
There was a problem hiding this comment.
Might the offset buffer be empty? Should we check if it is allocated with required size?
There was a problem hiding this comment.
It should never be empty. according to arrow's columnar documentation and its example, the List offset is always N+1. As a list contains at least 0 elements, its offset is at least 1.
There was a problem hiding this comment.
I think we don't need to check the allocated size.
The offset buffer is guaranteed to be allocated at this point. In ArrowWriter.create(), we call vector.allocateNew() before createFieldWriter():
def create(root: VectorSchemaRoot): ArrowWriter = {
val children = root.getFieldVectors().asScala.map { vector =>
vector.allocateNew() // allocates all buffers including nested children
createFieldWriter(vector) }
...
}
Arrow's ListVector.allocateNew() recursively allocates buffers for all nested child vectors, so when the ArrayWriter constructor runs, the offset buffer already exists with sufficient capacity.
There was a problem hiding this comment.
Hmm, when allocating offset buffer for ListVector, it should already set zero to the index 0.
|
We fixed it in the upstream arrow-java apache/arrow-java#967 |
What changes were proposed in this pull request?
Initialize the Arrow ListArray offset buffer at
ArrayWriterconstruction and after reset to ensure it contains[0]even when no elements are written.Why are the changes needed?
Arrow format requires ListArray offset buffer to have N+1 entries. Even when N=0, the buffer must contain
[0].When an outer array is empty, nested
ArrayWriters are never invoked, so their count stays 0. Arrow Java'sgetBufferSizeFor(0)returns 0, causing the offset buffer to be omitted in IPC serialization — violating Arrow spec. This causes SIGSEGV when PyArrow tries to read the malformed Arrow data.The fix directly initializes the offset buffer with
valueVector.getOffsetBuffer.setInt(0, 0)at construction and after reset, ensuring the buffer exists regardless of whether any elements are written.Does this PR introduce any user-facing change?
Yes.
toPandas()on triple-nested empty arrays no longer crashes.How was this patch tested?
ArrowWriterSuitetest_arrow.pyWas this patch authored or co-authored using generative AI tooling?
No