Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Oct 21, 2025

What changes were proposed in this pull request?

Avoid intermediate pandas dataframe creation in df.toPandas

before: batches -> table -> intermediate pdf -> result pdf (based on pa.Table.to_pandas)

after: batches -> table -> result pdf (based on pa.ChunkedArray.to_pandas)

Why are the changes needed?

the intermediate pandas dataframe can be skipped

simple benchmark in my local


spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "false")
spark.conf.set("spark.sql.execution.arrow.pyspark.selfDestruct.enabled", "true")

import time
from pyspark.sql import functions as sf


df = spark.range(1000000).select(
    (sf.col("id") % 2).alias("key"), sf.col("id").alias("v")
)
cols = {f"col_{i}": sf.lit(f"c{i}") for i in range(100)}
df = df.withColumns(cols)
df.cache()
df.count()



pdf = df.toPandas() # warm up


start_arrow = time.perf_counter()
for i in range(100):
    pdf = df.toPandas()

time.perf_counter() - start_arrow

master: 304.49954012501985 secs
this PR: 285.2997682078276 secs

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng zhengruifeng changed the title [WIP][PYTHON] Avoid intermediate pandas dataframe creation in df.toPandas [SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in df.toPandas Oct 21, 2025
@zhengruifeng zhengruifeng requested review from HyukjinKwon and ueshin and removed request for HyukjinKwon October 23, 2025 02:15
@zhengruifeng
Copy link
Contributor Author

cc @Yicong-Huang

@zhengruifeng zhengruifeng marked this pull request as draft October 24, 2025 01:45
@zhengruifeng zhengruifeng changed the title [SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in df.toPandas [WIP][SPARK-53967][PYTHON] Avoid intermediate pandas dataframe creation in df.toPandas Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants