-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-48096 [Python][Parquet] Expose new WriterProperties::max_rows_per_page to Python bindings #48101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-48096 [Python][Parquet] Expose new WriterProperties::max_rows_per_page to Python bindings #48101
Conversation
|
As was mentioned above, PyArrow doesn't store any metadata about the parquet pages. To ensure that the argument works, I've create this sample parquet file: import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({"x": [1,2,3,4,5,6,7]})
pq.write_table(table, "/tmp/max_rows_per_page.parquet", max_rows_per_page=1)Then, I've used > parquet-cli pages /tmp/max_rows_per_page.parquet
Column: x
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict S _ 7 8.00 B 56 B
0-1 data S R 1 9.00 B 9 B 0 "1" / "1"
0-2 data S R 1 9.00 B 9 B 0 "2" / "2"
0-3 data S R 1 9.00 B 9 B 0 "3" / "3"
0-4 data S R 1 9.00 B 9 B 0 "4" / "4"
0-5 data S R 1 9.00 B 9 B 0 "5" / "5"
0-6 data S R 1 9.00 B 9 B 0 "6" / "6"
0-7 data S R 1 9.00 B 9 B 0 "7" / "7"So the Python binding does indeed produce one page per row in the case above. |
raulcd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! The docs job is failing, can you add it to the docstrings?
pyarrow.parquet.core.write_table
PR01: Parameters {'max_rows_per_page'} not documented
pyarrow.parquet.core.ParquetWriter
PR01: Parameters {'max_rows_per_page'} not documented
Total number of docstring violations: 2
Thanks for pointing that out, added the argument to the docstrings. |
raulcd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a minor nit about the docs. @pitrou page information is not exposed to pyarrow, any idea if we could do anything more meaningful on the test? apart from testing that the property can be used.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good with @raulcd 's suggestion. Thank you!
A crude test would be that a smaller |
I went ahead and implemented this test as you suggested. The size difference for 1000000 rows table was abysmal, albeit noticeable. |
raulcd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM, I am fine both with the test and without. I'll let others take a look before merging. I'll merge next week if no more comments.
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 277faa9. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
See #48096, exposes
parquet.WriterProperties max_rows_per_pageargument to Python's API.What changes are included in this PR?
Added the argument
Are these changes tested?
Yes, since the metadata doesn't have any info about the number of pages, a naive end-to-end test was used to ensure the implementation correctness.
Are there any user-facing changes?
The ability to set the
max_rows_per_pagedirectly from PyArrow.