Skip to content

refactor(python)!: Change Partition API to base_path and file_path #21888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 30, 2025

Conversation

coastalwhite
Copy link
Collaborator

@coastalwhite coastalwhite commented Mar 21, 2025

This PR makes a breaking change to the currently unstable API for PartitionMaxSize, PartitionByKey and PartitionParted.

Tip

TLDR: You now provide a base_path an optional file_path callback. This allows a type-checked, more flexible and pythonic API.

This essentially exchanges the previous adhoc implementation with a more permanent one. This allows for way more flexibility and possibilities around creating files.

# Before: 
lf.sink_parquet(PartitionMaxSize('./{part}.parquet'))

# Now:
lf.sink_parquet(PartitionMaxSize('./'))

You now modify the path per file with a callback.

lf.sink_parquet(PartitionByKey(
    './',
    file_path=lambda ctx: f"{ctx.keys[0].str_value}.parquet",
    by="year",
))

Fixes #21886.
Fixes #21868.
Fixes #21848.

@coastalwhite coastalwhite changed the title refactor!: Change Partition API to base_path and file_path. refactor! Change Partition API to base_path and file_path Mar 21, 2025
@coastalwhite coastalwhite changed the title refactor! Change Partition API to base_path and file_path refactor(python)!: Change Partition API to base_path and file_path Mar 21, 2025
@github-actions github-actions bot added breaking Change that breaks backwards compatibility internal An internal refactor or improvement python Related to Python Polars and removed title needs formatting labels Mar 21, 2025
@ion-elgreco
Copy link
Contributor

How come PartitionByKey and PartitionMaxSize are actually separate things? If you do hive partitioning, in many cases you still want to split by row-size, filesize etc.

Wouldn't it be simpler to move these things into a kind of WriterProperties class e.g. https://delta-io.github.io/delta-rs/api/delta_writer/#deltalake.WriterProperties

@coastalwhite
Copy link
Collaborator Author

How come PartitionByKey and PartitionMaxSize are actually separate things? If you do hive partitioning, in many cases you still want to split by row-size, filesize etc.

I agree, I am not against adding options for max row size on PartitionByKey / PartitionParted. The implementations just does not support that at the moment (would definitely be possible though). I am quite against throwing it all on only pile of options and hoping we have implemented all option combinations correctly.

The original (and still current) idea was to have basic options like partition_by, partition_max_size etc. on the sink method which would translate to these partition classes automatically. But only for the options that are available on all partitioned writers. Then, if you needed specific features, you can use the classes for more control.

@coastalwhite
Copy link
Collaborator Author

Wouldn't it be simpler to move these things into a kind of WriterProperties class e.g. https://delta-io.github.io/delta-rs/api/delta_writer/#deltalake.WriterProperties

I seem to remember you having opened an issue on this, but I cannot find it now so I will just comment on it here.

TLDR: I think it is a good idea to do, but we not for the partitions and we need to make sure that options are writer agnostic and non-conflicting.

In line with my other comment, I completely agree. I did the same recently with the optimisation flags and SinkOptions and it makes implementations a lot simpler and less error-prone.

One thing I do find important is that these options apply to all sinks/writers and that to a reasonable level all options are compatible with each other. I don't want to have to have to deal in the future with the fact that it becomes very difficult to ship things because we have to figure out to get existing options to fit into new sinks/writers.

I think we have been making good and steady progress into unifying the reader and writer interfaces, minimising the amount of differences between them and minimising the work needed the work needed to add new sans-IO features. There is still a way to go, but we are getting there.

@ion-elgreco
Copy link
Contributor

@coastalwhite this was the issue for reference: #21777

I do believe from UX perspective having a single Partitioning class would be easier, I already got confused there were 3 classes now 😛

Copy link

codecov bot commented Mar 28, 2025

Codecov Report

Attention: Patch coverage is 71.07438% with 105 lines in your changes missing coverage. Please review.

Project coverage is 80.45%. Comparing base (3d6752b) to head (903d772).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-plan/src/dsl/options.rs 65.88% 29 Missing ⚠️
...lars-stream/src/nodes/io_sinks/partition/parted.rs 0.00% 23 Missing ⚠️
py-polars/polars/io/partition.py 60.00% 21 Missing and 1 partial ⚠️
...lars-stream/src/nodes/io_sinks/partition/by_key.rs 61.90% 16 Missing ⚠️
crates/polars-python/src/lazyframe/sink.rs 68.96% 9 Missing ⚠️
crates/polars-utils/src/python_function.rs 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #21888      +/-   ##
==========================================
- Coverage   80.48%   80.45%   -0.03%     
==========================================
  Files        1636     1636              
  Lines      236591   236842     +251     
  Branches     2693     2696       +3     
==========================================
+ Hits       190416   190556     +140     
- Misses      45541    45652     +111     
  Partials      634      634              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This PR makes a breaking change to the currently unstable API for
`PartitionMaxSize`, `PartitionByKey` and `PartitionParted`.

This essentially exchanges the previous adhoc implementation with a hopefully
more permanent one. This allows for way more flexibility and possibilities
around creating files.

Fixes pola-rs#21886.
Fixes pola-rs#21868.
@coastalwhite coastalwhite force-pushed the refactor/partition-callback branch from 857da85 to 04c2df0 Compare March 28, 2025 11:56
@coastalwhite coastalwhite marked this pull request as ready for review March 28, 2025 12:23
@coastalwhite coastalwhite merged commit 5bb225b into pola-rs:main Mar 30, 2025
29 checks passed
@coastalwhite coastalwhite deleted the refactor/partition-callback branch March 30, 2025 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Change that breaks backwards compatibility internal An internal refactor or improvement python Related to Python Polars
Projects
None yet
2 participants