Skip to content

Consolidate GroupByHash implementations row_hash.rs and hash.rs (remove duplication) #2723

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As part of the transition to the faster "row format" (#1861 ) , @yjshen implemented a Row based Hash Aggregate implementation in #2375 ❤️

However, the implementation currently implements support for a subset of the data types that DataFusion supports. This made the code significantly faster for some cases but has some downsides:

  1. Not all data types benefit from the row format performance
  2. There are two parallel similar but not the same implementations of hash aggregate -- row_hash.rs and hash.rs

You can already see the potential challenge in PRs like #2716 where test coverage may miss one of the hash aggregate implementations by accident

Describe the solution you'd like
I would like to consolidate the hash aggregate implementations -- success is to delete hash.rs by adding the additional remaining type support to row_hash.rs

I think this would be a nice project for someone new to DataFusion to work on as the pattern is already defined, the outcome will be better performance, and they will get good experience with the code.

It will also increase the type support for row format and make it easier to roll out through the rest of the codebase

Describe alternatives you've considered
N/A

Additional context
More context about the ongoing row format conversion is #1861

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions