Optimize count_distinct.size #5377

comphead · 2023-02-23T20:48:38Z

Which issue does this PR close?

Closes #5325 .

Rationale for this change

Fixing performance drop for .size function. Drop was found during performance benchmarks
During regression the query with count(distinct ) took 100s in optimized build, now it takes 5s.

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

No

comphead · 2023-02-23T23:31:29Z

@alamb @crepererum please check this PR whenever you have time

crepererum · 2023-02-24T10:32:01Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

@@ -216,23 +216,19 @@ impl Accumulator for DistinctCountAccumulator {
    }

    fn size(&self) -> usize {
+        // temporarily calculating the size approximately, taking first batch size * number of batches
+        // such approach has some inaccuracy for variable length values, like strings.


This basically remove proper memory accounting for this operation and for strings will likely be very wrong. I would rather see a proper cached size accounting here.

Right, as we agreed in #5325 (comment)
We need a fix the benchmark and later @alamb can think how to deal with variable length data.

This is clearly a conflict of interests: you wanna fix the benchmark, I wanna have proper memory accounting. We likely can have both on the long run, but due to limited development resources, we cannot have the ideal solution right now.

I'll leave it to the project managers (e.g. @alamb) to decide what's more pressing.

Instead add some code that checks "if is ScalarType::Int8, UInt8, etc then size = size[0]*vec.len()"

#5325 (comment)

are we missing this conditionall check in this PR? so we still have accurate size (slow for now) for variable data and accurate size (fast) for fixed lenth data

are we missing this conditionall check in this PR? so we still have accurate size (slow for now) for variable data and accurate size (fast) for fixed lenth data

This is the middle path I would suggest: keep the slow but accurate accounting for variable length data (aka strings) and add a fast path for fixed length sizes (what is in the benchmark)

I believe the the additional overhead of doing accurate size accounting for string values is a relatively smaller amount of the overall time compared to fixed size types. Making count distinct with a large number of string values fast is likely going to take a more sophisticated approach to this query in general.

@alamb @crepererum Let me remind you again another possible temp solution - The fastest and most painless fix to fix benchmark is to increase batch number, and .size function will be called less often.

The overhead for an accurate answer is still enormous for variable length types and gets worse with increasing table sizes, as the size gets slower with more distinct values in the aggregation, and is called for every update (so a O(n^2) operation).

@alamb @crepererum @jychen7 @Dandandan Amended to get approx_size for primitives only

Dandandan · 2023-02-27T10:20:15Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

+
+    // calculating the size approximately, taking first batch size * number of batches
+    // approx_size has some inaccuracy for variable length values, like strings.
+    fn approx_size(&self) -> usize {


This shouldn't be approximate now, as we only do it for fixed types

So we could change the name to e.g. fixed_size

alamb

This is looking good @comphead -- I had one more suggestion about how to test which types were primitive but I also think this code is an improvement over what is on master so it could be merged as well.

Thank you so much

alamb · 2023-02-27T22:36:19Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

+            + self
+                .values
+                .iter()
+                .next()


👍 that is nice

alamb · 2023-02-27T22:39:07Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

+            | DataType::Time32(_)
+            | DataType::Time64(_)
+            | DataType::Null
+            | DataType::Timestamp(_, _)


I think this is missing some types like Interval and Duration

Perhaps you could check instead if DataType::primitive_width returned Some(.) 🤔

https://docs.rs/arrow/34.0.0/arrow/datatypes/enum.DataType.html#method.primitive_width

Done, seems in new arrow-rs @tustvold made some work for us, and introduced Datatype::is_primitive

alamb

Thanks @comphead -- this looks great

alamb · 2023-02-28T14:30:46Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

-                .sum::<usize>()
-            + self.count_data_type.size()
-            - std::mem::size_of_val(&self.count_data_type)
+        if self.count_data_type.is_primitive() {


ursabot · 2023-02-28T14:43:38Z

Benchmark runs are scheduled for baseline = c477fc0 and contender = 20d08ab. 20d08ab is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Optimize count_distinct.size

721a851

github-actions bot added the physical-expr Changes to the physical-expr crates label Feb 23, 2023

crepererum reviewed Feb 24, 2023

View reviewed changes

avantgardnerio requested a review from tustvold February 24, 2023 19:15

comphead added 2 commits February 26, 2023 14:48

Optimize count_distinct.size for primitives only

7e937ea

Optimize count_distinct.size for primitives only

6da6e28

Dandandan reviewed Feb 27, 2023

View reviewed changes

comphead added 2 commits February 27, 2023 09:37

Optimize count_distinct.size for primitives only

3e82039

Optimize count_distinct.size for primitives only

64dea1e

comphead requested review from Dandandan and removed request for tustvold February 27, 2023 18:24

alamb approved these changes Feb 27, 2023

View reviewed changes

Optimize count_distinct.size. refactor

669fb85

alamb approved these changes Feb 28, 2023

View reviewed changes

alamb merged commit 20d08ab into apache:main Feb 28, 2023

Optimize count_distinct.size #5377

Optimize count_distinct.size #5377

Uh oh!

Conversation

comphead commented Feb 23, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

comphead commented Feb 23, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jychen7 Feb 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ursabot commented Feb 28, 2023

Uh oh!

Uh oh!

jychen7 Feb 25, 2023 •

edited

Loading