-
Notifications
You must be signed in to change notification settings - Fork 389
fix checking physical type for Decimal type in StatsAggregator #2515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Changes Made Added changes to conditionally handle breaking changes in pyiceberg 0.10.0 for how to initialize `DataFile` and `Record` and constraints on `name` for `Schema` fields and `PartitionField`. I did not update dev dependencies at this point, because there are actually still issues with Decimal. I have a [PR](apache/iceberg-python#2515) open that hopefully addresses this. To test, I had to temporarily change dev dependencies to 0.10.0, and run tests without decimal type. I need to update to 0.10.0 because support for anonymous was added (needed on my side). Note sure what the best path is here considering that pyiceberg releases on a much slower cadence. ## Related Issues Closes #5223 ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
|
Hey @gmweaver thanks for adding this. Could you also add a test for this? This ensures that we don't break it in the future. |
|
@Fokko test added and ensured I ran |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iceberg-python/pyiceberg/io/pyarrow.py
Lines 2507 to 2518 in e5e7453
| if isinstance(stats_col.iceberg_type, DecimalType) and statistics.physical_type != "FIXED_LEN_BYTE_ARRAY": | |
| scale = stats_col.iceberg_type.scale | |
| col_aggs[field_id].update_min( | |
| unscaled_to_decimal(statistics.min_raw, scale) | |
| ) if statistics.min_raw is not None else None | |
| col_aggs[field_id].update_max( | |
| unscaled_to_decimal(statistics.max_raw, scale) | |
| ) if statistics.max_raw is not None else None | |
| else: | |
| col_aggs[field_id].update_min(statistics.min) | |
| col_aggs[field_id].update_max(statistics.max) | |
can we simplify this logic too? can be a follow up PR
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thanks for fixing this issue.
Rationale for this change
Fix for #2057. It looks like the initial fix #1839 might have missed updating here to handle. I could use feedback on if this is the best fix, it is at least simple.
Are these changes tested?
Are there any user-facing changes?
No