-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: only use stats for required cols #3210
Conversation
Signed-off-by: Ion Koutsouris <[email protected]>
97a19ba
to
c64c431
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3210 +/- ##
==========================================
- Coverage 72.18% 72.12% -0.07%
==========================================
Files 138 138
Lines 45292 45320 +28
Branches 45292 45320 +28
==========================================
- Hits 32694 32686 -8
- Misses 10535 10563 +28
- Partials 2063 2071 +8 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Ion Koutsouris <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my suggestion about bounding stats collection should be bounded, but otherwise LGTM
|
||
let inclusion_stats_cols = if let Some(stats_cols) = stats_cols { | ||
stats_cols | ||
} else if num_index_cols == -1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if no columns are set collect stats for every column? This might be expensive and users might not realize this, might be better to bound it to like 32 (like databricks does) and let the num index cols go higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no columns are set, we use num_index_cols, where the default is indeed 32. But if it's -1 then it's use all column stats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delta-rs/crates/core/src/table/config.rs
Lines 43 to 49 in cf5f38a
/// The number of columns for Delta Lake to collect statistics about for data skipping. | |
/// A value of -1 means to collect statistics for all columns. Updating this property does | |
/// not automatically collect statistics again; instead, it redefines the statistics schema | |
/// of the Delta table. Specifically, it changes the behavior of future statistics collection | |
/// (such as during appends and optimizations) as well as data skipping (such as ignoring column | |
/// statistics beyond this number, even when such statistics exist). | |
DataSkippingNumIndexedCols, |
Sorry @ion-elgreco I removed this from the merge queue just to get your opinion about the change I proposed above. |
Description
The description of the main changes of your pull request
Related Issue(s)