[WIP][SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays by Yicong-Huang · Pull Request #53822 · apache/spark

Yicong-Huang · 2026-01-15T21:55:04Z

What changes were proposed in this pull request?

Initialize the Arrow ListArray offset buffer at ArrayWriter construction and after reset to ensure it contains [0] even when no elements are written.

Why are the changes needed?

Arrow format requires ListArray offset buffer to have N+1 entries. Even when N=0, the buffer must contain [0].

When an outer array is empty, nested ArrayWriters are never invoked, so their count stays 0. Arrow Java's getBufferSizeFor(0) returns 0, causing the offset buffer to be omitted in IPC serialization — violating Arrow spec. This causes SIGSEGV when PyArrow tries to read the malformed Arrow data.

The fix directly initializes the offset buffer with valueVector.getOffsetBuffer.setInt(0, 0) at construction and after reset, ensuring the buffer exists regardless of whether any elements are written.

Does this PR introduce any user-facing change?

Yes. toPandas() on triple-nested empty arrays no longer crashes.

How was this patch tested?

2 Scala unit tests in ArrowWriterSuite
4 Python integration tests in test_arrow.py

Was this patch authored or co-authored using generative AI tooling?

No

github-actions · 2026-01-15T21:55:15Z

JIRA Issue Information

=== Bug SPARK-55056 ===
Summary: toPandas() crashes with SIGSEGV on nested empty arrays
Assignee: None
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

viirya · 2026-01-16T02:27:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

+  // SPARK-55056: Arrow format requires ListArray offset buffer to have N+1 entries.
+  // Even when N=0, the buffer must contain [0]. Initialize offset buffer at construction
+  // to ensure it exists even if no elements are written.
+  valueVector.getOffsetBuffer.setInt(0, 0)


Might the offset buffer be empty? Should we check if it is allocated with required size?

It should never be empty. according to arrow's columnar documentation and its example, the List offset is always N+1. As a list contains at least 0 elements, its offset is at least 1.

I think we don't need to check the allocated size.

The offset buffer is guaranteed to be allocated at this point. In ArrowWriter.create(), we call vector.allocateNew() before createFieldWriter():

def create(root: VectorSchemaRoot): ArrowWriter = { val children = root.getFieldVectors().asScala.map { vector => vector.allocateNew() // allocates all buffers including nested children createFieldWriter(vector) } ... }

Arrow's ListVector.allocateNew() recursively allocates buffers for all nested child vectors, so when the ArrayWriter constructor runs, the offset buffer already exists with sufficient capacity.

Hmm, when allocating offset buffer for ListVector, it should already set zero to the index 0.

Yicong-Huang · 2026-01-27T01:01:40Z

We fixed it in the upstream arrow-java apache/arrow-java#967

Yicong-Huang added 2 commits January 15, 2026 13:51

fix: empty array offset should be 0

50963c3

fix: format

b4b7a4b

github-actions bot added SQL PYTHON labels Jan 15, 2026

Yicong-Huang added 2 commits January 15, 2026 16:41

fix: change constructor instead

9c926d3

fix: add the same for reset

33ef752

Yicong-Huang force-pushed the SPARK-55056/fix/nested-empty-array-sigsegv branch from 1f8c172 to 33ef752 Compare January 16, 2026 01:02

zhengruifeng requested review from HyukjinKwon and viirya January 16, 2026 02:13

viirya reviewed Jan 16, 2026

View reviewed changes

Yicong-Huang mentioned this pull request Jan 16, 2026

[WIP][SPARK-55059][PYTHON] Remove empty table workaround in toPandas #53824

Draft

Yicong-Huang marked this pull request as draft January 21, 2026 22:38

Yicong-Huang changed the title ~~[SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays~~ [WIP][SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays Jan 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays#53822

[WIP][SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays#53822
Yicong-Huang wants to merge 4 commits intoapache:masterfrom
Yicong-Huang:SPARK-55056/fix/nested-empty-array-sigsegv

Yicong-Huang commented Jan 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

viirya Jan 16, 2026

Uh oh!

Yicong-Huang Jan 16, 2026

Uh oh!

Yicong-Huang Jan 16, 2026

Uh oh!

viirya Jan 16, 2026

Uh oh!

Yicong-Huang commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yicong-Huang commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Jan 15, 2026

JIRA Issue Information

Uh oh!

viirya Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

viirya Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicong-Huang commented Jan 15, 2026 •

edited

Loading