Speed up unary `not` kernel by 50%, add `BooleanBuffer::from_bitwise_unary` #8996

alamb · 2025-12-15T14:00:42Z

Which issue does this PR close?

part of Consolidate bitwise operation implementations #8806
broken out from Add Buffer::from_bitwise_unary and Buffer::from_bitwise_binary me… #8854

Rationale for this change

The current implementation of the unary not kernel has an extra allocation when operating on sliced data which is not necessary.

Also, we can generate more optimal code by processing u64 words at a time when the buffer is already u64 aligned (see #8807)

Also, it is hard to find the code to create new Buffers by copying bits

What changes are included in this PR?

Introduce BooleanBuffer::from_bitwise_unary and BooleanBuffer::from_bits
Deprecate bitwise_unary_op_helper

Are these changes tested?

Yes with new tests and benchmarks

Are there any user-facing changes?

new PAPI

Dandandan · 2025-12-15T14:32:02Z

arrow-buffer/src/buffer/boolean.rs

+        }
+        if left_chunks.remainder_len() > 0 {
+            debug_assert!(result.capacity() >= result.len() + 8); // should not reallocate
+            result.push(op(left_chunks.remainder_bits()));


this could use push_unchecked as well (for consistency)?

Good idea -- done in 469f2ad

alamb · 2025-12-15T16:11:27Z

arrow-buffer/src/buffer/boolean.rs

+
+    /// Like [`Self::from_bitwise_unary_op`] but optimized for the case where the
+    /// input is aligned to byte boundaries
+    fn try_from_aligned_bitwise_unary_op<F>(


BTW I wrote a version of this code to handle works for byte aligned, but it actually seems to have made performance worse, so I am going to update the comments and leave it this way

What I tried

/// Like [`Self::from_bitwise_unary_op`] but optimized for the case where the /// input is aligned to byte boundaries fn try_from_aligned_bitwise_unary_op<F>( left: &[u8], len_in_bits: usize, op: &mut F, ) -> Option<Self> where F: FnMut(u64) -> u64, { // safety: all valid bytes are valid u64s let (left_prefix, left_u64s, left_suffix) = unsafe { left.align_to::<u64>() }; // if there is no prefix or suffix, the buffer is aligned and we can do // the operation directly on u64s if left_prefix.is_empty() && left_suffix.is_empty() { let result_u64s: Vec<u64> = left_u64s.iter().map(|l| op(*l)).collect(); let buffer = Buffer::from(result_u64s); return Some(BooleanBuffer::new(buffer, 0, len_in_bits)); } let mut result = MutableBuffer::with_capacity( left_prefix.len() + left_u64s.len() * 8 + left_suffix.len(), ); let prefix_u64 = op(Self::byte_slice_to_u64(left_prefix)); result.extend_from_slice(&prefix_u64.to_be_bytes()[0..left_prefix.len()]); assert!(result.capacity() >= result.len() + left_u64s.len() * 8); for &left in left_u64s.iter() { // SAFETY: we asserted there is enough capacity above unsafe { result.push_unchecked(op(left)); } } let suffix_u64 = op(Self::byte_slice_to_u64(left_suffix)); result.extend_from_slice(&suffix_u64.to_be_bytes()[0..left_suffix.len()]); Some(BooleanBuffer::new(result.into(), 0, len_in_bits)) }

diff --git a/arrow-buffer/src/buffer/boolean.rs b/arrow-buffer/src/buffer/boolean.rs index 97674c18843..285888b3a7c 100644 --- a/arrow-buffer/src/buffer/boolean.rs +++ b/arrow-buffer/src/buffer/boolean.rs @@ -18,10 +18,11 @@ use crate::bit_chunk_iterator::BitChunks; use crate::bit_iterator::{BitIndexIterator, BitIndexU32Iterator, BitIterator, BitSliceIterator}; use crate::{ - BooleanBufferBuilder, Buffer, MutableBuffer, bit_util, buffer_bin_and, buffer_bin_or, - buffer_bin_xor, buffer_unary_not, + BooleanBufferBuilder, Buffer, MutableBuffer, ToByteSlice, bit_util, buffer_bin_and, + buffer_bin_or, buffer_bin_xor, buffer_unary_not, }; +use crate::bit_util::get_remainder_bits; use std::ops::{BitAnd, BitOr, BitXor, Not}; /// A slice-able [`Buffer`] containing bit-packed booleans @@ -200,14 +201,37 @@ impl BooleanBuffer { // the operation directly on u64s if left_prefix.is_empty() && left_suffix.is_empty() { let result_u64s: Vec<u64> = left_u64s.iter().map(|l| op(*l)).collect(); - Some(BooleanBuffer::new( - Buffer::from(result_u64s), - 0, - len_in_bits, - )) - } else { - None + let buffer = Buffer::from(result_u64s); + return Some(BooleanBuffer::new(buffer, 0, len_in_bits)); } + + let mut result = MutableBuffer::with_capacity( + left_prefix.len() + left_u64s.len() * 8 + left_suffix.len(), + ); + let prefix_u64 = op(Self::byte_slice_to_u64(left_prefix)); + + result.extend_from_slice(&prefix_u64.to_be_bytes()[0..left_prefix.len()]); + + assert!(result.capacity() >= result.len() + left_u64s.len() * 8); + for &left in left_u64s.iter() { + // SAFETY: we asserted there is enough capacity above + unsafe { + result.push_unchecked(op(left)); + } + } + + let suffix_u64 = op(Self::byte_slice_to_u64(left_suffix)); + result.extend_from_slice(&suffix_u64.to_be_bytes()[0..left_suffix.len()]); + + Some(BooleanBuffer::new(result.into(), 0, len_in_bits)) + } + + /// convert the bytes into a u64 suitable for opeartion + fn byte_slice_to_u64(src: &[u8]) -> u64 { + let num_bytes = src.len(); + let mut bytes = [0u8; 8]; + bytes[0..num_bytes].copy_from_slice(src); + u64::from_be_bytes(bytes) } /// Returns the number of set bits in this buffer diff --git a/arrow-buffer/src/util/bit_util.rs b/arrow-buffer/src/util/bit_util.rs

This PR

not_slice_24 time: [81.729 ns 82.091 ns 82.587 ns]

When I tried fancier code for byte alignment:

not_slice_24 time: [121.13 ns 122.69 ns 124.52 ns]

alamb · 2025-12-15T16:23:54Z

arrow-buffer/src/buffer/boolean.rs

+        }
+        if left_chunks.remainder_len() > 0 {
+            debug_assert!(result.capacity() >= result.len() + 8); // should not reallocate
+            result.push(op(left_chunks.remainder_bits()));


Good idea -- done in 469f2ad

# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - part of #8806 - Part of #8996 # Rationale for this change As part of #8996 I would like to add special case code for byte aligned boolean buffers, and to do so I would like to have benchmarks that cover this # What changes are included in this PR? 1. Add benchmark for offset of 24 bits (in addition to 1) # Are these changes tested? I ran it manually # Are there any user-facing changes? No

alamb · 2025-12-15T16:25:39Z

run benchmark boolean_kernels

alamb-ghbot · 2025-12-15T16:25:45Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/unary_op (1bd1321) to 20cd096 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_unary_op
Results will be posted here when complete

alamb-ghbot · 2025-12-15T16:30:45Z

🤖: Benchmark completed

Details

group            alamb_unary_op                         main
-----            --------------                         ----
and              1.00    273.4±2.54ns        ? ?/sec    1.01    274.9±8.53ns        ? ?/sec
and_sliced_1     1.00   1229.7±9.23ns        ? ?/sec    1.00   1227.9±4.24ns        ? ?/sec
and_sliced_24    1.00    241.7±1.63ns        ? ?/sec    1.01    243.7±7.94ns        ? ?/sec
not              1.00    145.1±4.05ns        ? ?/sec    1.50    217.7±1.84ns        ? ?/sec
not_slice_24     1.00    192.5±1.60ns        ? ?/sec    1.16    224.1±2.67ns        ? ?/sec
not_sliced_1     1.00    618.5±3.31ns        ? ?/sec    1.14   704.6±15.29ns        ? ?/sec
or               1.00    251.0±2.93ns        ? ?/sec    1.00    250.3±2.81ns        ? ?/sec
or_sliced_1      1.00  1095.1±11.06ns        ? ?/sec    1.00   1097.0±5.30ns        ? ?/sec
or_sliced_24     1.00    241.4±1.85ns        ? ?/sec    1.00    241.6±3.08ns        ? ?/sec

arrow-buffer/src/buffer/boolean.rs

Co-authored-by: Martin Hilton <[email protected]>

Dandandan

Very nice!

# Which issue does this PR close?  - Related to #8806 - Related to #8996 # Rationale for this change When working on improving the boolean kernels, I have seen significant and unexplained noise from run to run. For example, just adding a fast path for `u64` aligned data resulted in a reported 30% regression in the speed of slice24 (code that is not affected by the change at all). for example, from #9022 ``` and 1.00 208.0±5.91ns ? ?/sec 1.34 278.8±10.07ns ? ?/sec and_sliced_1 1.00 1100.2±6.53ns ? ?/sec 1.12 1226.9±6.11ns ? ?/sec and_sliced_24 1.40 340.9±2.49ns ? ?/sec 1.00 243.7±2.13ns ? ?/sec ``` I also can't reproduce this effect locally or when I run the benchmarks individually. Given the above, and the tiny amount of time spent in the benchmark (hundreds of nanoseconds), I believe what is happening is that changing the allocation pattern during the benchmark runs (each kernel allocates output), data for subsequent iterations is allocated subtlety differently (e.g. the exact alignment or some other factor is different). This results in different performance characteristics even when the code has not changed. # What changes are included in this PR? To reduce this noise, I want to change the benchmarks to pre-allocate the input. # Are these changes tested? I ran them manually # Are there any user-facing changes? No, this is just a benchmark change

alamb · 2026-01-02T14:35:34Z

This PR introduced a very subtle bug, see

Fix nullif kernel #9087

# Which issue does this PR close? - Closes #9085 # Rationale for this change Fix a regression introduced in #8996 # What changes are included in this PR? 1. Add test coverage for nullif kernel 1. Undeprecate `bitwise_unary_op_helper` 2. Document subtle differences 3. Restore nullif kernel from #8996 # Are these changes tested Yes # Are there any user-facing changes? Fix (not yet released) bug

Speed up unary not kernel / BooleanBuffer::from_bitwise_unary

3355044

alamb changed the title ~~Speed up unary not kernel / BooleanBuffer::from_bitwise_unary~~ Speed up unary not kernel / BooleanBuffer::from_bitwise_unary Dec 15, 2025

github-actions bot added the arrow Changes to the arrow crate label Dec 15, 2025

This comment was marked as outdated.

Sign in to view

alamb mentioned this pull request Dec 15, 2025

Add boolean benchmark for byte aligned slices #8997

Merged

Add boolean benchmark for byte aligned slices

0b7d796

Dandandan reviewed Dec 15, 2025

View reviewed changes

alamb added 2 commits December 15, 2025 11:21

Improve comments and reduce nesting

ef81fcf

Use push_unchecked for consistency

469f2ad

alamb commented Dec 15, 2025

View reviewed changes

Merge remote-tracking branch 'apache/main' into alamb/unary_op

1bd1321

alamb changed the title ~~Speed up unary not kernel / BooleanBuffer::from_bitwise_unary~~ Speed up unary not kernel by 50% / BooleanBuffer::from_bitwise_unary Dec 15, 2025

alamb changed the title ~~Speed up unary not kernel by 50% / BooleanBuffer::from_bitwise_unary~~ Speed up unary not kernel by 50%, add BooleanBuffer::from_bitwise_unary Dec 15, 2025

alamb marked this pull request as ready for review December 15, 2025 17:10

mhilton approved these changes Dec 15, 2025

View reviewed changes

arrow-buffer/src/buffer/boolean.rs Outdated Show resolved Hide resolved

arrow-buffer/src/buffer/boolean.rs Outdated Show resolved Hide resolved

alamb added 2 commits December 15, 2025 12:42

Rename variables to avoid the term left

313f2dc

Use explicit bitmask to check alignment

e216bb8

mhilton reviewed Dec 15, 2025

View reviewed changes

arrow-buffer/src/buffer/boolean.rs Show resolved Hide resolved

arrow-buffer/src/buffer/boolean.rs Outdated Show resolved Hide resolved

Update arrow-buffer/src/buffer/boolean.rs

65d763b

Co-authored-by: Martin Hilton <[email protected]>

Dandandan approved these changes Dec 16, 2025

View reviewed changes

Dandandan merged commit 6b290d1 into apache:main Dec 17, 2025
26 checks passed

This was referenced Dec 20, 2025

Speed up binary kernels by XXX%, add BooleanBuffer::from_bitwise_binary #9022

Closed

Allocate buffers before work in boolean_kernels benchmark #9035

Merged

This was referenced Jan 2, 2026

wrong results for null count of nullif kernel #9085

Closed

Fix nullif kernel #9087

Merged

Speed up binary kernels (30% faster and and or), add BooleanBuffer::from_bitwise_binary_op #9090

Open

alamb added the performance label Jan 7, 2026

Speed up unary not kernel by 50%, add BooleanBuffer::from_bitwise_unary #8996

Speed up unary not kernel by 50%, add BooleanBuffer::from_bitwise_unary #8996

Uh oh!

Conversation

alamb commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Dandandan Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 15, 2025

Uh oh!

alamb-ghbot commented Dec 15, 2025

Uh oh!

alamb-ghbot commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Speed up unary `not` kernel by 50%, add `BooleanBuffer::from_bitwise_unary` #8996

Speed up unary `not` kernel by 50%, add `BooleanBuffer::from_bitwise_unary` #8996

alamb commented Dec 15, 2025 •

edited

Loading