You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-52811][PYTHON] Optimize ArrowTableToRowsConversion.convert to improve its performance
### What changes were proposed in this pull request?
Optimizes `ArrowTableToRowsConversion.convert` to improve its performance, similar to apache#51482.
- Calculate `fields` in advance
- Move conversions to `columnar_data` creation
- Make creation of `rows` for-comprehension to avoid expensive `list.append` calls
### Why are the changes needed?
`ArrowTableToRowsConversion.convert` has several performance overhead.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The existing tests, and manual benchmarks.
```py
def profile(f, *args, _n=10, **kwargs):
import cProfile
import pstats
import gc
st = None
for _ in range(5):
f(*args, **kwargs)
for _ in range(_n):
gc.collect()
with cProfile.Profile() as pr:
ret = f(*args, **kwargs)
if st is None:
st = pstats.Stats(pr)
else:
st.add(pstats.Stats(pr))
st.sort_stats("time", "cumulative").print_stats()
return ret
from pyspark.sql.conversion import ArrowTableToRowsConversion, LocalDataToArrowConversion
from pyspark.sql.types import *
data = [
(i if i % 1000 else None, str(i), i)
for i in range(1000000)
]
schema = (
StructType()
.add("i", IntegerType(), nullable=True)
.add("s", StringType(), nullable=True)
.add("ii", IntegerType(), nullable=False)
)
def to_arrow():
return LocalDataToArrowConversion.convert(data, schema, use_large_var_types=False)
def from_arrow(tbl):
return ArrowTableToRowsConversion.convert(tbl, schema)
tbl = to_arrow()
profile(from_arrow, tbl)
```
- before
```
100983380 function calls in 24.509 seconds
```
- after
```
70655910 function calls in 16.947 seconds
```
### Was this patch authored or co-authored using generative AI tooling?
No.
Closesapache#51508 from ueshin/issues/SPARK-52811/convert.
Authored-by: Takuya Ueshin <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
0 commit comments