Skip to content

Optimize count_distinct.size #5377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 28, 2023
Merged

Optimize count_distinct.size #5377

merged 6 commits into from
Feb 28, 2023

Conversation

comphead
Copy link
Contributor

Which issue does this PR close?

Closes #5325 .

Rationale for this change

Fixing performance drop for .size function. Drop was found during performance benchmarks
During regression the query with count(distinct ) took 100s in optimized build, now it takes 5s.

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Feb 23, 2023
@comphead
Copy link
Contributor Author

@alamb @crepererum please check this PR whenever you have time

@@ -216,23 +216,19 @@ impl Accumulator for DistinctCountAccumulator {
}

fn size(&self) -> usize {
// temporarily calculating the size approximately, taking first batch size * number of batches
// such approach has some inaccuracy for variable length values, like strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically remove proper memory accounting for this operation and for strings will likely be very wrong. I would rather see a proper cached size accounting here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, as we agreed in #5325 (comment)
We need a fix the benchmark and later @alamb can think how to deal with variable length data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clearly a conflict of interests: you wanna fix the benchmark, I wanna have proper memory accounting. We likely can have both on the long run, but due to limited development resources, we cannot have the ideal solution right now.

I'll leave it to the project managers (e.g. @alamb) to decide what's more pressing.

Copy link
Contributor

@jychen7 jychen7 Feb 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead add some code that checks "if is ScalarType::Int8, UInt8, etc then size = size[0]*vec.len()"

#5325 (comment)

are we missing this conditionall check in this PR? so we still have accurate size (slow for now) for variable data and accurate size (fast) for fixed lenth data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we missing this conditionall check in this PR? so we still have accurate size (slow for now) for variable data and accurate size (fast) for fixed lenth data

This is the middle path I would suggest: keep the slow but accurate accounting for variable length data (aka strings) and add a fast path for fixed length sizes (what is in the benchmark)

I believe the the additional overhead of doing accurate size accounting for string values is a relatively smaller amount of the overall time compared to fixed size types. Making count distinct with a large number of string values fast is likely going to take a more sophisticated approach to this query in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @crepererum Let me remind you again another possible temp solution - The fastest and most painless fix to fix benchmark is to increase batch number, and .size function will be called less often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overhead for an accurate answer is still enormous for variable length types and gets worse with increasing table sizes, as the size gets slower with more distinct values in the aggregation, and is called for every update (so a O(n^2) operation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @crepererum @jychen7 @Dandandan Amended to get approx_size for primitives only


// calculating the size approximately, taking first batch size * number of batches
// approx_size has some inaccuracy for variable length values, like strings.
fn approx_size(&self) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be approximate now, as we only do it for fixed types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we could change the name to e.g. fixed_size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@comphead comphead requested review from Dandandan and removed request for tustvold February 27, 2023 18:24
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good @comphead -- I had one more suggestion about how to test which types were primitive but I also think this code is an improvement over what is on master so it could be merged as well.

Thank you so much

+ self
.values
.iter()
.next()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 that is nice

| DataType::Time32(_)
| DataType::Time64(_)
| DataType::Null
| DataType::Timestamp(_, _)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is missing some types like Interval and Duration

Perhaps you could check instead if DataType::primitive_width returned Some(.) 🤔

https://docs.rs/arrow/34.0.0/arrow/datatypes/enum.DataType.html#method.primitive_width

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, seems in new arrow-rs @tustvold made some work for us, and introduced Datatype::is_primitive

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @comphead -- this looks great

.sum::<usize>()
+ self.count_data_type.size()
- std::mem::size_of_val(&self.count_data_type)
if self.count_data_type.is_primitive() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 20d08ab into apache:main Feb 28, 2023
@ursabot
Copy link

ursabot commented Feb 28, 2023

Benchmark runs are scheduled for baseline = c477fc0 and contender = 20d08ab. 20d08ab is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Changes to the physical-expr crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Accumulator size function performance (fix regression on clickbench)
6 participants