Description
Is your feature request related to a problem or challenge?
Follow on to #7064
The GroupsValues
for aggregates need to handle "emitTo" for streaming groups so that the can flush groups that have already been built but will never be seen again.
The initial implementation of the specialized accumulator for Uft8/LargeUtf8 in #8827 is inefficient in that it copies / rehashes any strings remaining in the set after emission
This is likely not a large performance overhead in practice as most groups should be emitted so only a few groups will need to be rehashed. However, if it turns out it is a problem, we can make something more optimized
Describe the solution you'd like
Optimize emitTo for binary groups
Describe alternatives you've considered
I have one proposal in #9188 (look at ArrowStringSet::emit_first_n) -- it works and passes tests but I think is very complicated and hard to convince onesself that the unsafe
usage is sound
Additional context
No response