[WIP][SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in `df.toPandas` #52680

zhengruifeng · 2025-10-21T10:31:30Z

What changes were proposed in this pull request?

Avoid intermediate pandas dataframe creation in df.toPandas

before: batches -> table -> intermediate pdf -> result pdf (based on pa.Table.to_pandas)

after: batches -> table -> result pdf (based on pa.ChunkedArray.to_pandas)

Why are the changes needed?

the intermediate pandas dataframe can be skipped

simple benchmark in my local


spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "false")
spark.conf.set("spark.sql.execution.arrow.pyspark.selfDestruct.enabled", "true")

import time
from pyspark.sql import functions as sf


df = spark.range(1000000).select(
    (sf.col("id") % 2).alias("key"), sf.col("id").alias("v")
)
cols = {f"col_{i}": sf.lit(f"c{i}") for i in range(100)}
df = df.withColumns(cols)
df.cache()
df.count()



pdf = df.toPandas() # warm up


start_arrow = time.perf_counter()
for i in range(100):
    pdf = df.toPandas()

time.perf_counter() - start_arrow

master: 304.49954012501985 secs
this PR: 285.2997682078276 secs

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

Was this patch authored or co-authored using generative AI tooling?

no

zhengruifeng · 2025-10-23T02:15:34Z

cc @Yicong-Huang

zhengruifeng added 3 commits October 21, 2025 18:27

test

90859f0

test

dea1fad

test

adc17b5

github-actions bot added SQL PYTHON labels Oct 21, 2025

zhengruifeng changed the title ~~[WIP][PYTHON] Avoid intermediate pandas dataframe creation in df.toPandas~~ [SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in df.toPandas Oct 21, 2025

zhengruifeng requested review from HyukjinKwon and ueshin and removed request for HyukjinKwon October 23, 2025 02:15

HyukjinKwon approved these changes Oct 23, 2025

View reviewed changes

zhengruifeng marked this pull request as draft October 24, 2025 01:45

zhengruifeng changed the title ~~[SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in df.toPandas~~ [WIP][SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in df.toPandas Oct 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in `df.toPandas` #52680

[WIP][SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in `df.toPandas` #52680

zhengruifeng commented Oct 21, 2025 •

edited

Loading

Uh oh!

zhengruifeng commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP][SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in df.toPandas #52680

Are you sure you want to change the base?

[WIP][SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in df.toPandas #52680

Conversation

zhengruifeng commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP][SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in `df.toPandas` #52680

[WIP][SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in `df.toPandas` #52680

zhengruifeng commented Oct 21, 2025 •

edited

Loading