Skip to content

Commit 345117b

Browse files
Support vectorized append and compare for multi group by (#12996)
* simple support vectorized append. * fix tests. * some logs. * add `append_n` in `MaybeNullBufferBuilder`. * impl basic append_batch * fix equal to. * define `GroupIndexContext`. * define the structs useful in vectorizing. * re-define some structs for vectorized operations. * impl some vectorized logics. * impl chekcing hashmap stage. * fix compile. * tmp * define and impl `vectorized_compare`. * fix compile. * impl `vectorized_equal_to`. * impl `vectorized_append`. * finish the basic vectorized ops logic. * impl `take_n`. * fix `renaming clear` and `groups fill`. * fix death loop due to rehashing. * fix vectorized append. * add counter. * use extend rather than resize. * remove dbg!. * remove reserve. * refactor the codes to make simpler and more performant. * clear `scalarized_indices` in `intern` to avoid some corner case. * fix `scalarized_equal_to`. * fallback to total scalarized `GroupValuesColumn` in streaming aggregation. * add unit test for `VectorizedGroupValuesColumn`. * add unit test for emitting first n in `VectorizedGroupValuesColumn`. * sort out tests codes in for group columns and add vectorized tests for primitives. * add vectorized test for byte builder. * add vectorized test for byte view builder. * add test for the all nulls or not nulls branches in vectorized. * fix clippy. * fix fmt. * fix compile in rust 1.79. * improve comments. * fix doc. * add more comments to explain the really complex vectorized intern process. * add comments to explain why we still need origin `GroupValuesColumn`. * remove some stale comments. * fix clippy. * add comments for `vectorized_equal_to` and `vectorized_append`. * fix clippy. * use zip to simplify codes. * use izip to simplify codes. * Update datafusion/physical-plan/src/aggregates/group_values/group_column.rs Co-authored-by: Jay Zhan <[email protected]> * first_n attempt Signed-off-by: jayzhan211 <[email protected]> * add test Signed-off-by: jayzhan211 <[email protected]> * improve hashtable modifying in emit first n test. * add `emit_group_index_list_buffer` to avoid allocating new `Vec` to store the remaining gourp indices. * make comments in VectorizedGroupValuesColumn::intern simpler and clearer. * define `VectorizedOperationBuffers` to hold buffers used in vectorized operations to make code clearer. * unify `VectorizedGroupValuesColumn` and `GroupValuesColumn`. * fix fmt. * fix comments. * fix clippy. --------- Signed-off-by: jayzhan211 <[email protected]> Co-authored-by: Jay Zhan <[email protected]>
1 parent c3a9847 commit 345117b

File tree

9 files changed

+2296
-227
lines changed

9 files changed

+2296
-227
lines changed

datafusion/common/src/utils/memory.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ pub fn estimate_memory_size<T>(num_elements: usize, fixed_size: usize) -> Result
102102

103103
#[cfg(test)]
104104
mod tests {
105-
use std::collections::HashSet;
105+
use std::{collections::HashSet, mem::size_of};
106106

107107
use super::estimate_memory_size;
108108

datafusion/core/tests/user_defined/user_defined_aggregates.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
//! user defined aggregate functions
2020
2121
use std::hash::{DefaultHasher, Hash, Hasher};
22+
use std::mem::{size_of, size_of_val};
2223
use std::sync::{
2324
atomic::{AtomicBool, Ordering},
2425
Arc,

0 commit comments

Comments
 (0)