feat: multiple columns in count distinct#20460
Conversation
|
|
||
| #[derive(Debug)] | ||
| struct MultiColumnDistinctCountAccumulator { | ||
| values: HashSet<Vec<ScalarValue>, RandomState>, |
There was a problem hiding this comment.
Probably https://docs.rs/arrow-row/latest/arrow_row/struct.Row.html or some custom hashing/equality would be much faster than Vec<ScalarValue> as key.
| Ok(()) | ||
| } | ||
|
|
||
| #[tokio::test] |
There was a problem hiding this comment.
please have tests in .slt
There was a problem hiding this comment.
Thanks @Mark1626 for driving this 💪
Before going to code review lets expand tests a little bit to support possible cases, specifically:
- mixed nulls in values
- different column datatypes
- 3+ cols
- different col order
- duplicates like
select count(distinct a, a), select count(distinct a, a, b, b)`
Once we have tests passed, we most likely got the code is stable and ready for review
|
@comphead Sure I'll expand the tests, should all these new one be in @Dandandan I'll try using |
There was a problem hiding this comment.
Does sliding accumulator support distinct on multi column? We should add a test for it and block if it doesn't work. (ex. count(distinct a, b) over ...)
| .iter() | ||
| .map(|field| { | ||
| Arc::new(Field::new( | ||
| format_state_name(args.name, "count distinct"), |
There was a problem hiding this comment.
same column names will look identical here. we should include original field name or col index to differentiate
| merged.merge_batch(&state_arr1)?; | ||
| merged.merge_batch(&state_arr2)?; | ||
|
|
||
| // Expected (1, a), (1, b), (1, a), (2, b), (3, b) |
There was a problem hiding this comment.
this is not a correct comment, we only expect 4
| } | ||
|
|
||
| fn fixed_size(&self) -> usize { | ||
| std::mem::size_of_val(self) |
There was a problem hiding this comment.
Lets import use std::mem::size_of_val
Which issue does this PR close?
Closes #5619
What changes are included in this PR?
MultiColumnDistinctCountAccumulatorAre these changes tested?