Discussion: adopt datafusion-python crate #3156
Replies: 8 comments 1 reply
-
@timsaucer thoughts on this? |
Beta Was this translation helpful? Give feedback.
-
My general recommendation is to keep datafusion rust repo as your primary rust dependency and (if needed) to use datafusion-python as a python dependency. But I suspect I don't fully understand the use case of what you're trying to do here. Can you give an example of what you're currently trying to do that is hampered by not using datafusion-python? |
Beta Was this translation helpful? Give feedback.
-
Some folks want to be able to configure the datafusion sessioncontext or register UDFs, neither is possible at the moment. But I also don't want to build this integration from the ground-up, therefore depending on datafusion-python crate would solve that |
Beta Was this translation helpful? Give feedback.
-
My gut feeling would be that our best bet is to expose record batch iterators ideally for both log and data. most modern engines / dataframe libraries should be able to consume these. The main challenge would be how to do predicate pushdown when the query would ideally only be issues at the "frontend" library. For some engines native support is also on the way, and hopefully that happens more :). When it comes to UDFs, I guess there only need to be available in the engine that does the downstream processing? But maybe we can expose some sort of python table provider (factory) that integrates with datafusion, hoping that this can be done w/o coupling the datafusion version with datafusion-python. I'd be curious to learn, what the most requested config on the session is - do we have some insights on this? |
Beta Was this translation helpful? Give feedback.
-
Yeah I wasn't looking at the reader side, but rather operations from py->rust. But I agree, Datafusion-python (python library, not the crate) has already native support. Polars also has somewhat native support, not the most optimal but it's already tons faster than pyArrow dataset
Yeah udfs in this case are only to be used during our Delta operations that use Datafusion. Then each operation in python can set the session context as well |
Beta Was this translation helpful? Give feedback.
-
If the case is that some users want to modify the session context or register functions, why do those users not get at this via datafusion python package? In general, pulling in |
Beta Was this translation helpful? Give feedback.
-
@timsaucer They were asking for the functions to be executed within the delta-rs operations (MERGE, write etc) |
Beta Was this translation helpful? Give feedback.
-
I have been following this discussion and thinking about it a fair bit. @timsaucer I am curious to hear your perspective on what it would look like if we actually broke up the Python code from a monolithic I wonder if we made The idea is still a little fuzzy on how it might work in my head, and I have inflicted the most overhead on myself with the meta-/sub-crates architecture for 🦀 but I think our Python code is getting to the point where being more decomposable might be really helpful. |
Beta Was this translation helpful? Give feedback.
-
Description
Use Case
We can adopt the datafusion-python crate, for the deltalake-python. A couple benefits would be this:
pro's:
cons:
Related Issue(s)
Beta Was this translation helpful? Give feedback.
All reactions