-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Optimize count_distinct.size #5377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@alamb @crepererum please check this PR whenever you have time |
@@ -216,23 +216,19 @@ impl Accumulator for DistinctCountAccumulator { | |||
} | |||
|
|||
fn size(&self) -> usize { | |||
// temporarily calculating the size approximately, taking first batch size * number of batches | |||
// such approach has some inaccuracy for variable length values, like strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically remove proper memory accounting for this operation and for strings will likely be very wrong. I would rather see a proper cached size accounting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, as we agreed in #5325 (comment)
We need a fix the benchmark and later @alamb can think how to deal with variable length data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is clearly a conflict of interests: you wanna fix the benchmark, I wanna have proper memory accounting. We likely can have both on the long run, but due to limited development resources, we cannot have the ideal solution right now.
I'll leave it to the project managers (e.g. @alamb) to decide what's more pressing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead add some code that checks "if is ScalarType::Int8, UInt8, etc then size = size[0]*vec.len()"
are we missing this conditionall check in this PR? so we still have accurate size (slow for now) for variable data and accurate size (fast) for fixed lenth data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we missing this conditionall check in this PR? so we still have accurate size (slow for now) for variable data and accurate size (fast) for fixed lenth data
This is the middle path I would suggest: keep the slow but accurate accounting for variable length data (aka strings) and add a fast path for fixed length sizes (what is in the benchmark)
I believe the the additional overhead of doing accurate size accounting for string values is a relatively smaller amount of the overall time compared to fixed size types. Making count distinct
with a large number of string values fast is likely going to take a more sophisticated approach to this query in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb @crepererum Let me remind you again another possible temp solution - The fastest and most painless fix to fix benchmark is to increase batch number, and .size function will be called less often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overhead for an accurate answer is still enormous for variable length types and gets worse with increasing table sizes, as the size
gets slower with more distinct values in the aggregation, and is called for every update (so a O(n^2)
operation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb @crepererum @jychen7 @Dandandan Amended to get approx_size for primitives only
|
||
// calculating the size approximately, taking first batch size * number of batches | ||
// approx_size has some inaccuracy for variable length values, like strings. | ||
fn approx_size(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be approximate now, as we only do it for fixed types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we could change the name to e.g. fixed_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good @comphead -- I had one more suggestion about how to test which types were primitive but I also think this code is an improvement over what is on master so it could be merged as well.
Thank you so much
+ self | ||
.values | ||
.iter() | ||
.next() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 that is nice
| DataType::Time32(_) | ||
| DataType::Time64(_) | ||
| DataType::Null | ||
| DataType::Timestamp(_, _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is missing some types like Interval
and Duration
Perhaps you could check instead if DataType::primitive_width
returned Some(.)
🤔
https://docs.rs/arrow/34.0.0/arrow/datatypes/enum.DataType.html#method.primitive_width
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, seems in new arrow-rs @tustvold made some work for us, and introduced Datatype::is_primitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @comphead -- this looks great
.sum::<usize>() | ||
+ self.count_data_type.size() | ||
- std::mem::size_of_val(&self.count_data_type) | ||
if self.count_data_type.is_primitive() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Benchmark runs are scheduled for baseline = c477fc0 and contender = 20d08ab. 20d08ab is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #5325 .
Rationale for this change
Fixing performance drop for
.size
function. Drop was found during performance benchmarksDuring regression the query with
count(distinct )
took 100s in optimized build, now it takes 5s.What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
No