Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: only use stats for required cols #3210

Merged
merged 2 commits into from
Feb 12, 2025

Conversation

ion-elgreco
Copy link
Collaborator

@github-actions github-actions bot added the binding/python Issues for the Python package label Feb 11, 2025
Copy link

codecov bot commented Feb 11, 2025

Codecov Report

Attention: Patch coverage is 0% with 63 lines in your changes missing coverage. Please review.

Project coverage is 72.12%. Comparing base (cf5f38a) to head (4494d7c).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
python/src/lib.rs 0.00% 63 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3210      +/-   ##
==========================================
- Coverage   72.18%   72.12%   -0.07%     
==========================================
  Files         138      138              
  Lines       45292    45320      +28     
  Branches    45292    45320      +28     
==========================================
- Hits        32694    32686       -8     
- Misses      10535    10563      +28     
- Partials     2063     2071       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Ion Koutsouris <[email protected]>
Copy link
Collaborator

@hntd187 hntd187 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my suggestion about bounding stats collection should be bounded, but otherwise LGTM


let inclusion_stats_cols = if let Some(stats_cols) = stats_cols {
stats_cols
} else if num_index_cols == -1 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if no columns are set collect stats for every column? This might be expensive and users might not realize this, might be better to bound it to like 32 (like databricks does) and let the num index cols go higher.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no columns are set, we use num_index_cols, where the default is indeed 32. But if it's -1 then it's use all column stats

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// The number of columns for Delta Lake to collect statistics about for data skipping.
/// A value of -1 means to collect statistics for all columns. Updating this property does
/// not automatically collect statistics again; instead, it redefines the statistics schema
/// of the Delta table. Specifically, it changes the behavior of future statistics collection
/// (such as during appends and optimizations) as well as data skipping (such as ignoring column
/// statistics beyond this number, even when such statistics exist).
DataSkippingNumIndexedCols,

@ion-elgreco ion-elgreco added this pull request to the merge queue Feb 12, 2025
@hntd187 hntd187 removed this pull request from the merge queue due to a manual request Feb 12, 2025
@hntd187
Copy link
Collaborator

hntd187 commented Feb 12, 2025

Sorry @ion-elgreco I removed this from the merge queue just to get your opinion about the change I proposed above.

@hntd187 hntd187 added this pull request to the merge queue Feb 12, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025
@ion-elgreco ion-elgreco added this pull request to the merge queue Feb 12, 2025
Merged via the queue into delta-io:main with commit b3efdfc Feb 12, 2025
26 checks passed
@ion-elgreco ion-elgreco deleted the fix/stats-pushdown branch February 12, 2025 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
2 participants