-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Speed up unary not kernel by 50%, add BooleanBuffer::from_bitwise_unary
#8996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
not kernel / BooleanBuffer::from_bitwise_unary
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
arrow-buffer/src/buffer/boolean.rs
Outdated
| } | ||
| if left_chunks.remainder_len() > 0 { | ||
| debug_assert!(result.capacity() >= result.len() + 8); // should not reallocate | ||
| result.push(op(left_chunks.remainder_bits())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could use push_unchecked as well (for consistency)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea -- done in 469f2ad
|
|
||
| /// Like [`Self::from_bitwise_unary_op`] but optimized for the case where the | ||
| /// input is aligned to byte boundaries | ||
| fn try_from_aligned_bitwise_unary_op<F>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I wrote a version of this code to handle works for byte aligned, but it actually seems to have made performance worse, so I am going to update the comments and leave it this way
What I tried
/// Like [`Self::from_bitwise_unary_op`] but optimized for the case where the
/// input is aligned to byte boundaries
fn try_from_aligned_bitwise_unary_op<F>(
left: &[u8],
len_in_bits: usize,
op: &mut F,
) -> Option<Self>
where
F: FnMut(u64) -> u64,
{
// safety: all valid bytes are valid u64s
let (left_prefix, left_u64s, left_suffix) = unsafe { left.align_to::<u64>() };
// if there is no prefix or suffix, the buffer is aligned and we can do
// the operation directly on u64s
if left_prefix.is_empty() && left_suffix.is_empty() {
let result_u64s: Vec<u64> = left_u64s.iter().map(|l| op(*l)).collect();
let buffer = Buffer::from(result_u64s);
return Some(BooleanBuffer::new(buffer, 0, len_in_bits));
}
let mut result = MutableBuffer::with_capacity(
left_prefix.len() + left_u64s.len() * 8 + left_suffix.len(),
);
let prefix_u64 = op(Self::byte_slice_to_u64(left_prefix));
result.extend_from_slice(&prefix_u64.to_be_bytes()[0..left_prefix.len()]);
assert!(result.capacity() >= result.len() + left_u64s.len() * 8);
for &left in left_u64s.iter() {
// SAFETY: we asserted there is enough capacity above
unsafe {
result.push_unchecked(op(left));
}
}
let suffix_u64 = op(Self::byte_slice_to_u64(left_suffix));
result.extend_from_slice(&suffix_u64.to_be_bytes()[0..left_suffix.len()]);
Some(BooleanBuffer::new(result.into(), 0, len_in_bits))
}diff --git a/arrow-buffer/src/buffer/boolean.rs b/arrow-buffer/src/buffer/boolean.rs
index 97674c18843..285888b3a7c 100644
--- a/arrow-buffer/src/buffer/boolean.rs
+++ b/arrow-buffer/src/buffer/boolean.rs
@@ -18,10 +18,11 @@
use crate::bit_chunk_iterator::BitChunks;
use crate::bit_iterator::{BitIndexIterator, BitIndexU32Iterator, BitIterator, BitSliceIterator};
use crate::{
- BooleanBufferBuilder, Buffer, MutableBuffer, bit_util, buffer_bin_and, buffer_bin_or,
- buffer_bin_xor, buffer_unary_not,
+ BooleanBufferBuilder, Buffer, MutableBuffer, ToByteSlice, bit_util, buffer_bin_and,
+ buffer_bin_or, buffer_bin_xor, buffer_unary_not,
};
+use crate::bit_util::get_remainder_bits;
use std::ops::{BitAnd, BitOr, BitXor, Not};
/// A slice-able [`Buffer`] containing bit-packed booleans
@@ -200,14 +201,37 @@ impl BooleanBuffer {
// the operation directly on u64s
if left_prefix.is_empty() && left_suffix.is_empty() {
let result_u64s: Vec<u64> = left_u64s.iter().map(|l| op(*l)).collect();
- Some(BooleanBuffer::new(
- Buffer::from(result_u64s),
- 0,
- len_in_bits,
- ))
- } else {
- None
+ let buffer = Buffer::from(result_u64s);
+ return Some(BooleanBuffer::new(buffer, 0, len_in_bits));
}
+
+ let mut result = MutableBuffer::with_capacity(
+ left_prefix.len() + left_u64s.len() * 8 + left_suffix.len(),
+ );
+ let prefix_u64 = op(Self::byte_slice_to_u64(left_prefix));
+
+ result.extend_from_slice(&prefix_u64.to_be_bytes()[0..left_prefix.len()]);
+
+ assert!(result.capacity() >= result.len() + left_u64s.len() * 8);
+ for &left in left_u64s.iter() {
+ // SAFETY: we asserted there is enough capacity above
+ unsafe {
+ result.push_unchecked(op(left));
+ }
+ }
+
+ let suffix_u64 = op(Self::byte_slice_to_u64(left_suffix));
+ result.extend_from_slice(&suffix_u64.to_be_bytes()[0..left_suffix.len()]);
+
+ Some(BooleanBuffer::new(result.into(), 0, len_in_bits))
+ }
+
+ /// convert the bytes into a u64 suitable for opeartion
+ fn byte_slice_to_u64(src: &[u8]) -> u64 {
+ let num_bytes = src.len();
+ let mut bytes = [0u8; 8];
+ bytes[0..num_bytes].copy_from_slice(src);
+ u64::from_be_bytes(bytes)
}
/// Returns the number of set bits in this buffer
diff --git a/arrow-buffer/src/util/bit_util.rs b/arrow-buffer/src/util/bit_util.rsThis PR
not_slice_24 time: [81.729 ns 82.091 ns 82.587 ns]
When I tried fancier code for byte alignment:
not_slice_24 time: [121.13 ns 122.69 ns 124.52 ns]
arrow-buffer/src/buffer/boolean.rs
Outdated
| } | ||
| if left_chunks.remainder_len() > 0 { | ||
| debug_assert!(result.capacity() >= result.len() + 8); // should not reallocate | ||
| result.push(op(left_chunks.remainder_bits())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea -- done in 469f2ad
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - part of #8806 - Part of #8996 # Rationale for this change As part of #8996 I would like to add special case code for byte aligned boolean buffers, and to do so I would like to have benchmarks that cover this # What changes are included in this PR? 1. Add benchmark for offset of 24 bits (in addition to 1) # Are these changes tested? I ran it manually # Are there any user-facing changes? No
not kernel / BooleanBuffer::from_bitwise_unarynot kernel by 50% / BooleanBuffer::from_bitwise_unary
not kernel by 50% / BooleanBuffer::from_bitwise_unarynot kernel by 50%, add BooleanBuffer::from_bitwise_unary
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
Co-authored-by: Martin Hilton <[email protected]>
Dandandan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Related to #8806 - Related to #8996 # Rationale for this change When working on improving the boolean kernels, I have seen significant and unexplained noise from run to run. For example, just adding a fast path for `u64` aligned data resulted in a reported 30% regression in the speed of slice24 (code that is not affected by the change at all). for example, from #9022 ``` and 1.00 208.0±5.91ns ? ?/sec 1.34 278.8±10.07ns ? ?/sec and_sliced_1 1.00 1100.2±6.53ns ? ?/sec 1.12 1226.9±6.11ns ? ?/sec and_sliced_24 1.40 340.9±2.49ns ? ?/sec 1.00 243.7±2.13ns ? ?/sec ``` I also can't reproduce this effect locally or when I run the benchmarks individually. Given the above, and the tiny amount of time spent in the benchmark (hundreds of nanoseconds), I believe what is happening is that changing the allocation pattern during the benchmark runs (each kernel allocates output), data for subsequent iterations is allocated subtlety differently (e.g. the exact alignment or some other factor is different). This results in different performance characteristics even when the code has not changed. # What changes are included in this PR? To reduce this noise, I want to change the benchmarks to pre-allocate the input. # Are these changes tested? I ran them manually # Are there any user-facing changes? No, this is just a benchmark change
|
This PR introduced a very subtle bug, see |
# Which issue does this PR close? - Closes #9085 # Rationale for this change Fix a regression introduced in #8996 # What changes are included in this PR? 1. Add test coverage for nullif kernel 1. Undeprecate `bitwise_unary_op_helper` 2. Document subtle differences 3. Restore nullif kernel from #8996 # Are these changes tested Yes # Are there any user-facing changes? Fix (not yet released) bug
Which issue does this PR close?
Buffer::from_bitwise_unaryandBuffer::from_bitwise_binaryme… #8854Rationale for this change
The current implementation of the unary not kernel has an extra allocation when operating on sliced data which is not necessary.
Also, we can generate more optimal code by processing u64 words at a time when the buffer is already u64 aligned (see #8807)
Also, it is hard to find the code to create new Buffers by copying bits
What changes are included in this PR?
BooleanBuffer::from_bitwise_unaryandBooleanBuffer::from_bitsbitwise_unary_op_helperAre these changes tested?
Yes with new tests and benchmarks
Are there any user-facing changes?
new PAPI