Skip to content

Conversation

@tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 22, 2022

Which issue does this PR close?

Closes #3886
Closes #3909
Relates to #4349
Relates to #4617

Rationale for this change

Having shared mutable state makes reasoning about mutation difficult (#4617), the locking is verbose and potentially error prone (#3886),

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Dec 22, 2022
})
})?;

let config_options = ctx.session_config().config_options();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fetch this at execution time, in order that datafusion-proto can still deserialize ParquetExec without a SessionState. Longer term as we strip out the overrides this will make more sense anyway so 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable to look at the session configuration while executing 🤷

It certainly seems better than the current state of master where the config options (attached to session state) are read via interior mutability

message CsvFormat {
bool has_header = 1;
string delimiter = 2;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend viewing this with whitespace disabled

image

self
))
})? {
&FileFormatType::Parquet(protobuf::ParquetFormat {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plumbing for this override was actually incorrect, it would convert false -> None, the other overrides aren't present, and we plan to remove this override mechanism as part of #4349 so I just opted to remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree serializing the same config options multiple times (once in the main session context and then once again as part of the file format) is undesirable for many reasons

@tustvold tustvold force-pushed the no-shared-config-options branch 2 times, most recently from b650b86 to 3327d11 Compare December 22, 2022 12:31
@tustvold tustvold force-pushed the no-shared-config-options branch from 3327d11 to 00a9b28 Compare December 22, 2022 12:43
@tustvold tustvold added the api change Changes the API exposed to users of the crate label Dec 22, 2022
impl ParquetScanOptions {
/// Returns a [`SessionConfig`] with the given options
pub fn config(&self) -> SessionConfig {
SessionConfig::new()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated simply removing ParquetScanOptions in favour of SessionConfig but figured this PR was large enough as it was

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I agree this PR is already large. I also think the ParquetScanOptions predated the config options.

I think removing the ParquetScanOptions as a follow on PR is a good idea 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨‍🍳 👌

This looks really good @tustvold -- thank you for helping sort out the configuration situation

Pin<Box<dyn Stream<Item = Result<ActionType, Status>> + Send + Sync + 'static>>;
type DoExchangeStream =
Pin<Box<dyn Stream<Item = Result<FlightData, Status>> + Send + Sync + 'static>>;
type HandshakeStream = BoxStream<'static, Result<HandshakeResponse, Status>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot this was here -- I have to give this example love to give this after my work to make arrow-flight easier to use


/// Return true if pruning is enabled
pub fn enable_pruning(&self) -> bool {
pub fn enable_pruning(&self, config_options: &ConfigOptions) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

default_schema: String,
/// Configuration options
pub config_options: Arc<RwLock<ConfigOptions>>,
config_options: ConfigOptions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

})
})?;

let config_options = ctx.session_config().config_options();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable to look at the session configuration while executing 🤷

It certainly seems better than the current state of master where the config options (attached to session state) are read via interior mutability

CurrentDate=70;
CurrentTime=71;
Uuid=72;
Abs = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace!


message FileScanExecConf {
// Was repeated ConfigOption options = 10;
reserved 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

self
))
})? {
&FileFormatType::Parquet(protobuf::ParquetFormat {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree serializing the same config options multiple times (once in the main session context and then once again as part of the file format) is undesirable for many reasons

impl ParquetScanOptions {
/// Returns a [`SessionConfig`] with the given options
pub fn config(&self) -> SessionConfig {
SessionConfig::new()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I agree this PR is already large. I also think the ParquetScanOptions predated the config options.

I think removing the ParquetScanOptions as a follow on PR is a good idea 👍

@tustvold tustvold merged commit 07f4980 into apache:master Dec 23, 2022
@ursabot
Copy link

ursabot commented Dec 23, 2022

Benchmark runs are scheduled for baseline = afb1ae2 and contender = 07f4980. 07f4980 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make ConfigOptions easier to work with

3 participants