feat: multiple columns in count distinct by Mark1626 · Pull Request #20460 · apache/datafusion

Mark1626 · 2026-02-21T11:38:49Z

Which issue does this PR close?

What changes are included in this PR?

Introduce a separate accumulator for multi column distinct count MultiColumnDistinctCountAccumulator
I used some parts of Count distinct support multiple expressions #5939 for reference, however it was old so I had to reimplement this

Are these changes tested?

Unit tests have been added
I've tested this with a couple of queries in the cli

with data AS (
  select * from (values
    ('a', 1, 'x'),
    ('a', 2, 'x'),
    ('b', 2, 'y'),
    ('b', 2, 'z'),
    ('c', 3, 'z')
  ) AS t(col1, col2, col3)
)
select count(distinct (col1, col2)) FROM data;

Dandandan · 2026-02-21T17:26:26Z

datafusion/functions-aggregate/src/count.rs


+#[derive(Debug)]
+struct MultiColumnDistinctCountAccumulator {
+    values: HashSet<Vec<ScalarValue>, RandomState>,


Probably https://docs.rs/arrow-row/latest/arrow_row/struct.Row.html or some custom hashing/equality would be much faster than Vec<ScalarValue> as key.

comphead · 2026-02-21T18:08:14Z

datafusion/core/tests/sql/aggregates/basic.rs

    Ok(())
 }
+
+#[tokio::test]


please have tests in .slt

comphead

Thanks @Mark1626 for driving this 💪

Before going to code review lets expand tests a little bit to support possible cases, specifically:

mixed nulls in values
different column datatypes
3+ cols
different col order
duplicates like select count(distinct a, a), select count(distinct a, a, b, b)`

Once we have tests passed, we most likely got the code is stable and ready for review

Mark1626 · 2026-02-23T06:02:06Z

@comphead Sure I'll expand the tests, should all these new one be in .slt?

@Dandandan I'll try using struct.Row, I was wondering how I can improve performance

jonathanc-n · 2026-02-23T08:00:20Z

datafusion/functions-aggregate/src/count.rs

Does sliding accumulator support distinct on multi column? We should add a test for it and block if it doesn't work. (ex. count(distinct a, b) over ...)

jonathanc-n · 2026-02-23T08:00:26Z

datafusion/functions-aggregate/src/count.rs

+                    .iter()
+                    .map(|field| {
+                        Arc::new(Field::new(
+                            format_state_name(args.name, "count distinct"),


same column names will look identical here. we should include original field name or col index to differentiate

jonathanc-n · 2026-02-23T08:00:42Z

datafusion/functions-aggregate/src/count.rs

+        merged.merge_batch(&state_arr1)?;
+        merged.merge_batch(&state_arr2)?;
+
+        // Expected (1, a), (1, b), (1, a), (2, b), (3, b)


this is not a correct comment, we only expect 4

jonathanc-n · 2026-02-23T08:02:12Z

datafusion/functions-aggregate/src/count.rs

+    }
+
+    fn fixed_size(&self) -> usize {
+        std::mem::size_of_val(self)


Lets import use std::mem::size_of_val

feat: multiple columns in count distinct

ac48a2b

github-actions bot added core Core DataFusion crate functions Changes to functions implementation labels Feb 21, 2026

fix: clippy and slt expected result

183f2fc

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 21, 2026

fix: slt typo

ec37334

Dandandan reviewed Feb 21, 2026

View reviewed changes

comphead mentioned this pull request Feb 21, 2026

Add support for COUNT(DISTINCT expr, expr1, ...) apache/datafusion-comet#2292

Open

comphead reviewed Feb 21, 2026

View reviewed changes

datafusion/core/tests/sql/aggregates/basic.rs

Ok(())

}

#[tokio::test]

Copy link

Contributor

comphead Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please have tests in .slt

comphead reviewed Feb 21, 2026

View reviewed changes

jonathanc-n reviewed Feb 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: multiple columns in count distinct#20460

feat: multiple columns in count distinct#20460
Mark1626 wants to merge 3 commits intoapache:mainfrom
Mark1626:feat/count-distinct-multi

Mark1626 commented Feb 21, 2026

Uh oh!

Dandandan Feb 21, 2026

Uh oh!

comphead Feb 21, 2026

Uh oh!

comphead left a comment •

edited

Loading

Uh oh!

Mark1626 commented Feb 23, 2026

Uh oh!

jonathanc-n Feb 23, 2026

Uh oh!

jonathanc-n Feb 23, 2026

Uh oh!

jonathanc-n Feb 23, 2026

Uh oh!

jonathanc-n Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

Mark1626 commented Feb 21, 2026

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

Dandandan Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

comphead Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

comphead left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mark1626 commented Feb 23, 2026

Uh oh!

jonathanc-n Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

jonathanc-n Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

jonathanc-n Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

jonathanc-n Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

comphead left a comment •

edited

Loading