Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Dec 15, 2025

Which issue does this PR close?

Rationale for this change

The current implementation of the unary not kernel has an extra allocation when operating on sliced data which is not necessary.

Also, we can generate more optimal code by processing u64 words at a time when the buffer is already u64 aligned (see #8807)

Also, it is hard to find the code to create new Buffers by copying bits

What changes are included in this PR?

  1. Introduce BooleanBuffer::from_bitwise_unary and BooleanBuffer::from_bits
  2. Deprecate bitwise_unary_op_helper

Are these changes tested?

Yes with new tests and benchmarks

Are there any user-facing changes?

new PAPI

@alamb alamb changed the title Speed up unary not kernel / BooleanBuffer::from_bitwise_unary Speed up unary not kernel / BooleanBuffer::from_bitwise_unary Dec 15, 2025
@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 15, 2025
@alamb

This comment was marked as outdated.

@alamb-ghbot

This comment was marked as outdated.

@alamb-ghbot

This comment was marked as outdated.

}
if left_chunks.remainder_len() > 0 {
debug_assert!(result.capacity() >= result.len() + 8); // should not reallocate
result.push(op(left_chunks.remainder_bits()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could use push_unchecked as well (for consistency)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea -- done in 469f2ad


/// Like [`Self::from_bitwise_unary_op`] but optimized for the case where the
/// input is aligned to byte boundaries
fn try_from_aligned_bitwise_unary_op<F>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I wrote a version of this code to handle works for byte aligned, but it actually seems to have made performance worse, so I am going to update the comments and leave it this way

What I tried
    /// Like [`Self::from_bitwise_unary_op`] but optimized for the case where the
    /// input is aligned to byte boundaries
    fn try_from_aligned_bitwise_unary_op<F>(
        left: &[u8],
        len_in_bits: usize,
        op: &mut F,
    ) -> Option<Self>
    where
        F: FnMut(u64) -> u64,
    {
        // safety: all valid bytes are valid u64s
        let (left_prefix, left_u64s, left_suffix) = unsafe { left.align_to::<u64>() };
        // if there is no prefix or suffix, the buffer is aligned and we can do
        // the operation directly on u64s
        if left_prefix.is_empty() && left_suffix.is_empty() {
            let result_u64s: Vec<u64> = left_u64s.iter().map(|l| op(*l)).collect();
            let buffer = Buffer::from(result_u64s);
            return Some(BooleanBuffer::new(buffer, 0, len_in_bits));
        }

        let mut result = MutableBuffer::with_capacity(
            left_prefix.len() + left_u64s.len() * 8 + left_suffix.len(),
        );
        let prefix_u64 = op(Self::byte_slice_to_u64(left_prefix));

        result.extend_from_slice(&prefix_u64.to_be_bytes()[0..left_prefix.len()]);

        assert!(result.capacity() >= result.len() + left_u64s.len() * 8);
        for &left in left_u64s.iter() {
            // SAFETY: we asserted there is enough capacity above
            unsafe {
                result.push_unchecked(op(left));
            }
        }

        let suffix_u64 = op(Self::byte_slice_to_u64(left_suffix));
        result.extend_from_slice(&suffix_u64.to_be_bytes()[0..left_suffix.len()]);

        Some(BooleanBuffer::new(result.into(), 0, len_in_bits))
    }
diff --git a/arrow-buffer/src/buffer/boolean.rs b/arrow-buffer/src/buffer/boolean.rs
index 97674c18843..285888b3a7c 100644
--- a/arrow-buffer/src/buffer/boolean.rs
+++ b/arrow-buffer/src/buffer/boolean.rs
@@ -18,10 +18,11 @@
 use crate::bit_chunk_iterator::BitChunks;
 use crate::bit_iterator::{BitIndexIterator, BitIndexU32Iterator, BitIterator, BitSliceIterator};
 use crate::{
-    BooleanBufferBuilder, Buffer, MutableBuffer, bit_util, buffer_bin_and, buffer_bin_or,
-    buffer_bin_xor, buffer_unary_not,
+    BooleanBufferBuilder, Buffer, MutableBuffer, ToByteSlice, bit_util, buffer_bin_and,
+    buffer_bin_or, buffer_bin_xor, buffer_unary_not,
 };

+use crate::bit_util::get_remainder_bits;
 use std::ops::{BitAnd, BitOr, BitXor, Not};

 /// A slice-able [`Buffer`] containing bit-packed booleans
@@ -200,14 +201,37 @@ impl BooleanBuffer {
         // the operation directly on u64s
         if left_prefix.is_empty() && left_suffix.is_empty() {
             let result_u64s: Vec<u64> = left_u64s.iter().map(|l| op(*l)).collect();
-            Some(BooleanBuffer::new(
-                Buffer::from(result_u64s),
-                0,
-                len_in_bits,
-            ))
-        } else {
-            None
+            let buffer = Buffer::from(result_u64s);
+            return Some(BooleanBuffer::new(buffer, 0, len_in_bits));
         }
+
+        let mut result = MutableBuffer::with_capacity(
+            left_prefix.len() + left_u64s.len() * 8 + left_suffix.len(),
+        );
+        let prefix_u64 = op(Self::byte_slice_to_u64(left_prefix));
+
+        result.extend_from_slice(&prefix_u64.to_be_bytes()[0..left_prefix.len()]);
+
+        assert!(result.capacity() >= result.len() + left_u64s.len() * 8);
+        for &left in left_u64s.iter() {
+            // SAFETY: we asserted there is enough capacity above
+            unsafe {
+                result.push_unchecked(op(left));
+            }
+        }
+
+        let suffix_u64 = op(Self::byte_slice_to_u64(left_suffix));
+        result.extend_from_slice(&suffix_u64.to_be_bytes()[0..left_suffix.len()]);
+
+        Some(BooleanBuffer::new(result.into(), 0, len_in_bits))
+    }
+
+    /// convert the bytes into a u64 suitable for opeartion
+    fn byte_slice_to_u64(src: &[u8]) -> u64 {
+        let num_bytes = src.len();
+        let mut bytes = [0u8; 8];
+        bytes[0..num_bytes].copy_from_slice(src);
+        u64::from_be_bytes(bytes)
     }

     /// Returns the number of set bits in this buffer
diff --git a/arrow-buffer/src/util/bit_util.rs b/arrow-buffer/src/util/bit_util.rs

This PR

not_slice_24            time:   [81.729 ns 82.091 ns 82.587 ns]

When I tried fancier code for byte alignment:

not_slice_24            time:   [121.13 ns 122.69 ns 124.52 ns]

}
if left_chunks.remainder_len() > 0 {
debug_assert!(result.capacity() >= result.len() + 8); // should not reallocate
result.push(op(left_chunks.remainder_bits()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea -- done in 469f2ad

alamb added a commit that referenced this pull request Dec 15, 2025
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- part of #8806
- Part of #8996

# Rationale for this change

As part of #8996 I would like to
add special case
code for byte aligned boolean buffers, and to do so I would like to have
benchmarks that cover this

# What changes are included in this PR?

1. Add benchmark for offset of 24 bits (in addition to 1)

# Are these changes tested?
I ran it manually

# Are there any user-facing changes?

No
@alamb alamb changed the title Speed up unary not kernel / BooleanBuffer::from_bitwise_unary Speed up unary not kernel by 50% / BooleanBuffer::from_bitwise_unary Dec 15, 2025
@alamb alamb changed the title Speed up unary not kernel by 50% / BooleanBuffer::from_bitwise_unary Speed up unary not kernel by 50%, add BooleanBuffer::from_bitwise_unary Dec 15, 2025
@alamb
Copy link
Contributor Author

alamb commented Dec 15, 2025

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/unary_op (1bd1321) to 20cd096 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            alamb_unary_op                         main
-----            --------------                         ----
and              1.00    273.4±2.54ns        ? ?/sec    1.01    274.9±8.53ns        ? ?/sec
and_sliced_1     1.00   1229.7±9.23ns        ? ?/sec    1.00   1227.9±4.24ns        ? ?/sec
and_sliced_24    1.00    241.7±1.63ns        ? ?/sec    1.01    243.7±7.94ns        ? ?/sec
not              1.00    145.1±4.05ns        ? ?/sec    1.50    217.7±1.84ns        ? ?/sec
not_slice_24     1.00    192.5±1.60ns        ? ?/sec    1.16    224.1±2.67ns        ? ?/sec
not_sliced_1     1.00    618.5±3.31ns        ? ?/sec    1.14   704.6±15.29ns        ? ?/sec
or               1.00    251.0±2.93ns        ? ?/sec    1.00    250.3±2.81ns        ? ?/sec
or_sliced_1      1.00  1095.1±11.06ns        ? ?/sec    1.00   1097.0±5.30ns        ? ?/sec
or_sliced_24     1.00    241.4±1.85ns        ? ?/sec    1.00    241.6±3.08ns        ? ?/sec

@alamb alamb marked this pull request as ready for review December 15, 2025 17:10
Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@Dandandan Dandandan merged commit 6b290d1 into apache:main Dec 17, 2025
26 checks passed
Dandandan pushed a commit that referenced this pull request Dec 25, 2025
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Related to #8806
- Related to #8996

# Rationale for this change

When working on improving the boolean kernels, I have seen significant
and unexplained noise from run to run. For example, just adding a fast
path for `u64` aligned data resulted in a reported 30% regression in the
speed of slice24 (code that is not affected by the change at all).

for example, from #9022

```
and              1.00    208.0±5.91ns        ? ?/sec    1.34   278.8±10.07ns        ? ?/sec
and_sliced_1     1.00   1100.2±6.53ns        ? ?/sec    1.12   1226.9±6.11ns        ? ?/sec
and_sliced_24    1.40    340.9±2.49ns        ? ?/sec    1.00    243.7±2.13ns        ? ?/sec
```

I also can't reproduce this effect locally or when I run the benchmarks
individually.

Given the above, and the tiny amount of time spent in the benchmark
(hundreds of nanoseconds), I believe what is happening is that changing
the allocation pattern during the benchmark runs (each kernel allocates
output), data for subsequent iterations is allocated subtlety
differently (e.g. the exact alignment or some other factor is
different).

This results in different performance characteristics even when the code
has not
changed.



# What changes are included in this PR?

To reduce this noise, I want to change the benchmarks to pre-allocate
the input.

# Are these changes tested?

I ran them manually

# Are there any user-facing changes?

No, this is just a benchmark change
@alamb
Copy link
Contributor Author

alamb commented Jan 2, 2026

This PR introduced a very subtle bug, see

alamb added a commit that referenced this pull request Jan 5, 2026
# Which issue does this PR close?

- Closes #9085

# Rationale for this change

Fix a regression introduced in
#8996

# What changes are included in this PR?

1. Add test coverage for nullif kernel
1. Undeprecate `bitwise_unary_op_helper`
2. Document subtle differences
3. Restore nullif kernel from
#8996

# Are these changes tested
Yes

# Are there any user-facing changes?

Fix (not yet released) bug
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants