-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
refactor(python)!: Change Partition API to base_path
and file_path
#21888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(python)!: Change Partition API to base_path
and file_path
#21888
Conversation
base_path
and file_path
.base_path
and file_path
base_path
and file_path
base_path
and file_path
How come PartitionByKey and PartitionMaxSize are actually separate things? If you do hive partitioning, in many cases you still want to split by row-size, filesize etc. Wouldn't it be simpler to move these things into a kind of WriterProperties class e.g. https://delta-io.github.io/delta-rs/api/delta_writer/#deltalake.WriterProperties |
I agree, I am not against adding options for max row size on PartitionByKey / PartitionParted. The implementations just does not support that at the moment (would definitely be possible though). I am quite against throwing it all on only pile of options and hoping we have implemented all option combinations correctly. The original (and still current) idea was to have basic options like |
I seem to remember you having opened an issue on this, but I cannot find it now so I will just comment on it here. TLDR: I think it is a good idea to do, but we not for the partitions and we need to make sure that options are writer agnostic and non-conflicting. In line with my other comment, I completely agree. I did the same recently with the optimisation flags and SinkOptions and it makes implementations a lot simpler and less error-prone. One thing I do find important is that these options apply to all sinks/writers and that to a reasonable level all options are compatible with each other. I don't want to have to have to deal in the future with the fact that it becomes very difficult to ship things because we have to figure out to get existing options to fit into new sinks/writers. I think we have been making good and steady progress into unifying the reader and writer interfaces, minimising the amount of differences between them and minimising the work needed the work needed to add new sans-IO features. There is still a way to go, but we are getting there. |
@coastalwhite this was the issue for reference: #21777 I do believe from UX perspective having a single Partitioning class would be easier, I already got confused there were 3 classes now 😛 |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #21888 +/- ##
==========================================
- Coverage 80.48% 80.45% -0.03%
==========================================
Files 1636 1636
Lines 236591 236842 +251
Branches 2693 2696 +3
==========================================
+ Hits 190416 190556 +140
- Misses 45541 45652 +111
Partials 634 634 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This PR makes a breaking change to the currently unstable API for `PartitionMaxSize`, `PartitionByKey` and `PartitionParted`. This essentially exchanges the previous adhoc implementation with a hopefully more permanent one. This allows for way more flexibility and possibilities around creating files. Fixes pola-rs#21886. Fixes pola-rs#21868.
857da85
to
04c2df0
Compare
This PR makes a breaking change to the currently unstable API for
PartitionMaxSize
,PartitionByKey
andPartitionParted
.Tip
TLDR: You now provide a
base_path
an optionalfile_path
callback. This allows a type-checked, more flexible and pythonic API.This essentially exchanges the previous adhoc implementation with a more permanent one. This allows for way more flexibility and possibilities around creating files.
You now modify the path per file with a callback.
Fixes #21886.
Fixes #21868.
Fixes #21848.