Skip to content

ListingTable statistics improperly merges statistics when files have different schemas #15689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #6672
alamb opened this issue Apr 11, 2025 · 7 comments · Fixed by #15865
Closed
Tracked by #6672

ListingTable statistics improperly merges statistics when files have different schemas #15689

alamb opened this issue Apr 11, 2025 · 7 comments · Fixed by #15865
Assignees
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Apr 11, 2025

Describe the bug

As @xudong963 mentions in

And also brought up again in

When table_schema is different from file_schema then the current statistics merging code will incorrectly merge statistics

Specifically, it merges column statistics based on their ordinal position (order in the file)

Currently this isn't a huge problem as the statistics are only used in a limited way for some optimizations, but as we start to rely on statistics for correctness, such as #6672 it is more important

To Reproduce

if we have two files

  • File 1: (a int32, b int32)
  • File 2: (b int32, a int32)

I think the code on main will combine statistics for columns a in File 1 and column b in File 2 together.

Expected behavior

I expect that only statistics from the same logical column are merged together.

Additional context

After #15661 is merged, I suggest:

  1. adding some function that knows how to map columns from a file schema --> table schema (filling in any missing columns with ColumnStatistics::new_unnown) before combining them
  2. Adding testst

Maybe we can simply reuse the existing SchemaMapper / factory 🤔 so we are sure the statistics merging is consistent with runtime

@friendlymatthew
Copy link
Contributor

take

@xudong963
Copy link
Member

@alamb Thank you for writing the issue in detail
@friendlymatthew Thank you for taking it

@xudong963
Copy link
Member

Hi @friendlymatthew , friendly ping --how's it going?

@xudong963
Copy link
Member

Fyi, I'm cooking a PR to fix the issue

@friendlymatthew
Copy link
Contributor

Hey @xudong963 -- I'm currently switching jobs so time's been a bit constrained at the moment. I've only had time to look over this last weekend, but didn't make much progress.

Feel free to take the issue, I'm eager to read your PR!

@friendlymatthew
Copy link
Contributor

oh @xudong963 if there's any follow up issues, feel free to ping me. I'm off next week and want to dive deeper into statistics

@xudong963
Copy link
Member

oh @xudong963 if there's any follow up issues, feel free to ping me. I'm off next week and want to dive deeper into statistics

Sure, I think there will be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants