Skip to content

Conversation

@rmnskb
Copy link
Contributor

@rmnskb rmnskb commented Nov 12, 2025

Rationale for this change

See #48096, exposes parquet.WriterProperties max_rows_per_page argument to Python's API.

What changes are included in this PR?

Added the argument

Are these changes tested?

Yes, since the metadata doesn't have any info about the number of pages, a naive end-to-end test was used to ensure the implementation correctness.

Are there any user-facing changes?

The ability to set the max_rows_per_page directly from PyArrow.

@rmnskb
Copy link
Contributor Author

rmnskb commented Nov 12, 2025

As was mentioned above, PyArrow doesn't store any metadata about the parquet pages. To ensure that the argument works, I've create this sample parquet file:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({"x": [1,2,3,4,5,6,7]})
pq.write_table(table, "/tmp/max_rows_per_page.parquet", max_rows_per_page=1)

Then, I've used parquet-cli from Parquet Java to see the pages info. Here's the output:

> parquet-cli pages /tmp/max_rows_per_page.parquet

Column: x
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  S _  7       8.00 B     56 B
  0-1    data  S R  1       9.00 B     9 B                 0       "1" / "1"
  0-2    data  S R  1       9.00 B     9 B                 0       "2" / "2"
  0-3    data  S R  1       9.00 B     9 B                 0       "3" / "3"
  0-4    data  S R  1       9.00 B     9 B                 0       "4" / "4"
  0-5    data  S R  1       9.00 B     9 B                 0       "5" / "5"
  0-6    data  S R  1       9.00 B     9 B                 0       "6" / "6"
  0-7    data  S R  1       9.00 B     9 B                 0       "7" / "7"

So the Python binding does indeed produce one page per row in the case above.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The docs job is failing, can you add it to the docstrings?

 pyarrow.parquet.core.write_table
PR01: Parameters {'max_rows_per_page'} not documented

pyarrow.parquet.core.ParquetWriter
PR01: Parameters {'max_rows_per_page'} not documented

Total number of docstring violations: 2

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Nov 12, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 12, 2025
@rmnskb
Copy link
Contributor Author

rmnskb commented Nov 12, 2025

Thanks for the PR! The docs job is failing, can you add it to the docstrings?

Thanks for pointing that out, added the argument to the docstrings.

@rmnskb rmnskb requested a review from raulcd November 12, 2025 21:00
Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor nit about the docs. @pitrou page information is not exposed to pyarrow, any idea if we could do anything more meaningful on the test? apart from testing that the property can be used.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Nov 13, 2025
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good with @raulcd 's suggestion. Thank you!

@pitrou
Copy link
Member

pitrou commented Nov 13, 2025

@pitrou page information is not exposed to pyarrow, any idea if we could do anything more meaningful on the test? apart from testing that the property can be used

A crude test would be that a smaller max_rows_per_page produces a larger file. But for most Parquet properties we don't do any such testing anyway.

@rmnskb
Copy link
Contributor Author

rmnskb commented Nov 14, 2025

A crude test would be that a smaller max_rows_per_page produces a larger file. But for most Parquet properties we don't do any such testing anyway.

I went ahead and implemented this test as you suggested. The size difference for 1000000 rows table was abysmal, albeit noticeable.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM, I am fine both with the test and without. I'll let others take a look before merging. I'll merge next week if no more comments.

@raulcd raulcd merged commit 277faa9 into apache:main Nov 17, 2025
14 checks passed
@raulcd raulcd removed the awaiting merge Awaiting merge label Nov 17, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 277faa9.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants