-
Notifications
You must be signed in to change notification settings - Fork 1.5k
ListingTable statistics improperly merges statistics when files have different schemas #15689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
take |
@alamb Thank you for writing the issue in detail |
Hi @friendlymatthew , friendly ping --how's it going? |
Fyi, I'm cooking a PR to fix the issue |
Hey @xudong963 -- I'm currently switching jobs so time's been a bit constrained at the moment. I've only had time to look over this last weekend, but didn't make much progress. Feel free to take the issue, I'm eager to read your PR! |
oh @xudong963 if there's any follow up issues, feel free to ping me. I'm off next week and want to dive deeper into statistics |
Sure, I think there will be |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
As @xudong963 mentions in
And also brought up again in
When table_schema is different from file_schema then the current statistics merging code will incorrectly merge statistics
Specifically, it merges column statistics based on their ordinal position (order in the file)
Currently this isn't a huge problem as the statistics are only used in a limited way for some optimizations, but as we start to rely on statistics for correctness, such as #6672 it is more important
To Reproduce
if we have two files
(a int32, b int32)
(b int32, a int32)
I think the code on main will combine statistics for columns a in File 1 and column
b
in File 2 together.Expected behavior
I expect that only statistics from the same logical column are merged together.
Additional context
After #15661 is merged, I suggest:
ColumnStatistics::new_unnown
) before combining themMaybe we can simply reuse the existing
SchemaMapper
/ factory 🤔 so we are sure the statistics merging is consistent with runtimeThe text was updated successfully, but these errors were encountered: