Is your feature request related to a problem or challenge?
Is your feature request related to a problem or challenge?
Currently RepartitionExec: partitioning=Hash will be added whenever for aggregates in FinalPartitioned and SinglePartitioned
The benefit is increased parallelism, but at the cost of copying the entire table (in a not-so efficient way).
We should consider lowering the cost of repartitioning by not having to copy the input.
Dependencies
Describe the solution you'd like
Instead of repartitioning the input in RepartitionExec, support repartitioning the inputs based on a selection vector.
Instead of taking the RecordBatch, we can consider doing the following:
- Add a (boolean) selection vector as output column for each output partition. I.e.
true means the row is selected for the partition.
- The rest of the
RecordBatch remains unchanged (i.e. no copy).
- CoalesceBatchesExec is no longer needed for the output (reducing another copy)
- In the hash aggregate code handle the selection vector.
Describe alternatives you've considered
The partitioning could be done inside the hash aggregate (at the cost of more complexity inside it).
Additional context
No response
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
Is your feature request related to a problem or challenge?
Is your feature request related to a problem or challenge?
Currently
RepartitionExec: partitioning=Hashwill be added whenever for aggregates inFinalPartitionedandSinglePartitionedThe benefit is increased parallelism, but at the cost of copying the entire table (in a not-so efficient way).
We should consider lowering the cost of repartitioning by not having to copy the input.
Dependencies
Describe the solution you'd like
Instead of repartitioning the input in
RepartitionExec, support repartitioning the inputs based on a selection vector.Instead of
takingtheRecordBatch, we can consider doing the following:truemeans the row is selected for the partition.RecordBatchremains unchanged (i.e. no copy).Describe alternatives you've considered
The partitioning could be done inside the hash aggregate (at the cost of more complexity inside it).
Additional context
No response
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response