-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As part of the transition to the faster "row format" (#1861 ) , @yjshen implemented a Row based Hash Aggregate implementation in #2375 ❤️
However, the implementation currently implements support for a subset of the data types that DataFusion supports. This made the code significantly faster for some cases but has some downsides:
- Not all data types benefit from the row format performance
- There are two parallel similar but not the same implementations of hash aggregate --
row_hash.rs
andhash.rs
You can already see the potential challenge in PRs like #2716 where test coverage may miss one of the hash aggregate implementations by accident
Describe the solution you'd like
I would like to consolidate the hash aggregate implementations -- success is to delete hash.rs
by adding the additional remaining type support to row_hash.rs
I think this would be a nice project for someone new to DataFusion to work on as the pattern is already defined, the outcome will be better performance, and they will get good experience with the code.
It will also increase the type support for row format and make it easier to roll out through the rest of the codebase
Describe alternatives you've considered
N/A
Additional context
More context about the ongoing row format conversion is #1861