-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Related
apache/datafusion-ballista#479
#3885
TLDR Recommendations
This is a complicated issue and I don't have a magic answer. However I have some concrete suggestions
Some suggested steps:
- Config Cleanup: Remove TaskProperties and KV structure, keep key=value serialization #4382
- We should consolidate ExecutionProps and TaskProperties also.
- Allow configuring parquet filter pushdown dynamically #3821
- Consolidate SessionConfig and ConfigOptions #3887
- Make
ConfigOptionseasier to work with #3886 - Make ConfigOption names into an Enum #4517
I think consolidating SessionConfig/Config options is likely to be the most controversial / cause the most chrun but it will provide immense benefits I think (like runtime visibility into the current settings)
Then we can further improve from there
Introduction
"Configuration" in DataFusion has a few usecases:
- set a values from a string provided by the user (
set XX = YYin datafusion-cli) - set a value from environment variables -- See
ConfigOptions::from_env - display as a string (e.g.
SHOWin datafusion-cli`) - Documented in the datafusion docs: https://arrow.apache.org/datafusion/user-guide/configs.html
- serialize / deserialize over the network (e.g. Ballista)
- settable programmatically (e.g. via the dataframe API)
There are also two overlapping "levels" of configuration that are needed
- Session level (e.g. that can be reused from one query execution to the next)
- Statement/Task level (e.g. that is needed to plan a query and doesn't change for the duration of a statement such as "the value of now()" and target batch sizes, etc). Statement level configuration is typically a superset of the session level configuration
Current state of configuration in DataFusion
The current state is .... inconsistent to put it mildly.
The core structure is SessionContext which is the final glue and entry point to interacting with datafusion (e.g. tables provided, etc).
Within the SessionContext there is the some combination of SessionState, SessionConfig, ConfigOptions. Part of the hierarchy is like this:
SessionContext
-- Has a SessionState
-- SessionStartTime
-- SessionConfig
-- ConfigOptions
SessionConfig is effectively the Session level configuration I describe above.
TaskContext is the statement level (aka per task / per query) level context. If you look hard you can see has a copy of the SessionConfig (buried in TaskPropertoes) or also maybe is backed by KVPairs.
Desire
I would like to have a clear configuration system that cleanly separates the statement level config from the task level config and allows configuration values to be set in a uniform manner and that are easy to view programmatically