DataFusion Ray can be extended to leverage datasource integrations that are not built into DataFusion itself, but could be brought in as features, such as:
However, adding a custom table provider to a distributed DataFusion engine requires two things:
- registering the corresponding
TableProviderFactory with the DataFusion SessionContext
- for integrations that define custom
ExecutionPlan nodes, registering the corresponding PhysicalExtensionCodecs
The current code does not allow for such extensions, because:
- the datafusion
SessionContext (which is wrapped in a PySessionContext) is created outside the datafusion-ray library and can only be used via its python interface (i.e. invoking named methods on PyAny flavours of session context and execution plan)
- the only extension codec used for plan serialization is the
ShuffleCodec, which only handles the shuffle read/write nodes of datafusion-ray itself
The solution that we came up with for addressing the above limitations in our fork involves the following changes:
- Add (back) the rust dependency on
datafusion-python and provide a python function that creates a PySessionContext from within datafusion_ray itself (which can be customized with the enabled table factories and whatnot and can also be downcast so we can use the rust reference directly).
The "external" python datafusion.SessionContext() will continue to be supported, it will just have no support for any "extensions" that datafusion-ray was compiled with (since we don't want to attempt unsafe downcasts).
- Add an
Extension trait (not necessarily limited to table providers) that gets implemented by each such "extension" in order to:
a. customize the SessionContext before it gets returned to python (e.g. registering table provider factories, catalog providers etc.)
b. provide a list of physical extension codecs required for serializing its custom physical plan nodes, if any
- create a composite
Extensions singleton that prepares session contexts created by the new datafusion_ray.extended_session_context() python function and also maintains a composite PhysicalExtensionCodec to serialize both the built-in shuffle nodes as well as any additional codecs provided by the enabled extensions.
I'd be glad to open a PR for contributing this, unless the feature request is out of scope or there are plans to address it differently.
DataFusion Ray can be extended to leverage datasource integrations that are not built into DataFusion itself, but could be brought in as features, such as:
However, adding a custom table provider to a distributed DataFusion engine requires two things:
TableProviderFactorywith the DataFusionSessionContextExecutionPlannodes, registering the correspondingPhysicalExtensionCodecsThe current code does not allow for such extensions, because:
SessionContext(which is wrapped in aPySessionContext) is created outside the datafusion-ray library and can only be used via its python interface (i.e. invoking named methods onPyAnyflavours of session context and execution plan)ShuffleCodec, which only handles the shuffle read/write nodes of datafusion-ray itselfThe solution that we came up with for addressing the above limitations in our fork involves the following changes:
datafusion-pythonand provide a python function that creates aPySessionContextfrom withindatafusion_rayitself (which can be customized with the enabled table factories and whatnot and can also be downcast so we can use the rust reference directly).The "external" python
datafusion.SessionContext()will continue to be supported, it will just have no support for any "extensions" that datafusion-ray was compiled with (since we don't want to attempt unsafe downcasts).Extensiontrait (not necessarily limited to table providers) that gets implemented by each such "extension" in order to:a. customize the
SessionContextbefore it gets returned to python (e.g. registering table provider factories, catalog providers etc.)b. provide a list of physical extension codecs required for serializing its custom physical plan nodes, if any
Extensionssingleton that prepares session contexts created by the newdatafusion_ray.extended_session_context()python function and also maintains a compositePhysicalExtensionCodecto serialize both the built-in shuffle nodes as well as any additional codecs provided by the enabled extensions.I'd be glad to open a PR for contributing this, unless the feature request is out of scope or there are plans to address it differently.