Skip to content

Comments

feat: multiple columns in count distinct#20460

Open
Mark1626 wants to merge 3 commits intoapache:mainfrom
Mark1626:feat/count-distinct-multi
Open

feat: multiple columns in count distinct#20460
Mark1626 wants to merge 3 commits intoapache:mainfrom
Mark1626:feat/count-distinct-multi

Conversation

@Mark1626
Copy link
Contributor

Which issue does this PR close?

Closes #5619

What changes are included in this PR?

  1. Introduce a separate accumulator for multi column distinct count MultiColumnDistinctCountAccumulator
  2. I used some parts of Count distinct support multiple expressions #5939 for reference, however it was old so I had to reimplement this

Are these changes tested?

  1. Unit tests have been added
  2. I've tested this with a couple of queries in the cli
with data AS (
  select * from (values
    ('a', 1, 'x'),
    ('a', 2, 'x'),
    ('b', 2, 'y'),
    ('b', 2, 'z'),
    ('c', 3, 'z')
  ) AS t(col1, col2, col3)
)
select count(distinct (col1, col2)) FROM data;

@github-actions github-actions bot added core Core DataFusion crate functions Changes to functions implementation labels Feb 21, 2026
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 21, 2026

#[derive(Debug)]
struct MultiColumnDistinctCountAccumulator {
values: HashSet<Vec<ScalarValue>, RandomState>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably https://docs.rs/arrow-row/latest/arrow_row/struct.Row.html or some custom hashing/equality would be much faster than Vec<ScalarValue> as key.

Ok(())
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please have tests in .slt

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Mark1626 for driving this 💪

Before going to code review lets expand tests a little bit to support possible cases, specifically:

  • mixed nulls in values
  • different column datatypes
  • 3+ cols
  • different col order
  • duplicates like select count(distinct a, a), select count(distinct a, a, b, b)`

Once we have tests passed, we most likely got the code is stable and ready for review

@Mark1626
Copy link
Contributor Author

@comphead Sure I'll expand the tests, should all these new one be in .slt?

@Dandandan I'll try using struct.Row, I was wondering how I can improve performance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sliding accumulator support distinct on multi column? We should add a test for it and block if it doesn't work. (ex. count(distinct a, b) over ...)

.iter()
.map(|field| {
Arc::new(Field::new(
format_state_name(args.name, "count distinct"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same column names will look identical here. we should include original field name or col index to differentiate

merged.merge_batch(&state_arr1)?;
merged.merge_batch(&state_arr2)?;

// Expected (1, a), (1, b), (1, a), (2, b), (3, b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a correct comment, we only expect 4

}

fn fixed_size(&self) -> usize {
std::mem::size_of_val(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets import use std::mem::size_of_val

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Count() and Count(Distinct )should accept multiple exprs

4 participants