Skip to content

Improved performance for streaming grouping with single string columns #9195

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Follow on to #7064

The GroupsValues for aggregates need to handle "emitTo" for streaming groups so that the can flush groups that have already been built but will never be seen again.

The initial implementation of the specialized accumulator for Uft8/LargeUtf8 in #8827 is inefficient in that it copies / rehashes any strings remaining in the set after emission

This is likely not a large performance overhead in practice as most groups should be emitted so only a few groups will need to be rehashed. However, if it turns out it is a problem, we can make something more optimized

Describe the solution you'd like

Optimize emitTo for binary groups

#9188

Describe alternatives you've considered

I have one proposal in #9188 (look at ArrowStringSet::emit_first_n) -- it works and passes tests but I think is very complicated and hard to convince onesself that the unsafe usage is sound

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions