-
Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: optimize bit_length for string arrays #19598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: optimize bit_length for string arrays #19598
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the bit_length function for string arrays by adding specialized implementations for StringArray and LargeStringArray that avoid the overhead of the generic Arrow kernel. The optimization aims to improve performance when executing bit_length via Apache Comet.
Key Changes:
- Added specialized bit_length implementations for StringArray and LargeStringArray using manual iteration
- Preserved fallback to Arrow's bit_length kernel for other array types (Utf8View, Dictionary, Binary)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } else if let Some(arr) = v.as_any().downcast_ref::<LargeStringArray>() { | ||
| let mut builder = Int32Builder::with_capacity(arr.len()); | ||
| for i in 0..arr.len() { | ||
| if arr.is_null(i) { | ||
| builder.append_null(); | ||
| } else { | ||
| let byte_len = arr.value(i).as_bytes().len(); | ||
| builder.append_value((byte_len * 8) as i32); | ||
| } | ||
| } | ||
| Ok(ColumnarValue::Array(Arc::new(builder.finish()))) |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type for LargeStringArray should be Int64, not Int32. The return_type method returns Int64 for LargeUtf8 (via utf8_to_int_type), and the scalar handling for LargeUtf8 also returns Int64 (line 130-132). This implementation should use Int64Builder to maintain consistency with the declared return type and scalar handling.
| if let Some(arr) = v.as_any().downcast_ref::<StringArray>() { | ||
| let mut builder = Int32Builder::with_capacity(arr.len()); | ||
| for i in 0..arr.len() { | ||
| if arr.is_null(i) { | ||
| builder.append_null(); | ||
| } else { | ||
| let byte_len = arr.value(i).as_bytes().len(); | ||
| builder.append_value((byte_len * 8) as i32); | ||
| } | ||
| } | ||
| Ok(ColumnarValue::Array(Arc::new(builder.finish()))) | ||
| } else if let Some(arr) = v.as_any().downcast_ref::<LargeStringArray>() { | ||
| let mut builder = Int32Builder::with_capacity(arr.len()); | ||
| for i in 0..arr.len() { | ||
| if arr.is_null(i) { | ||
| builder.append_null(); | ||
| } else { | ||
| let byte_len = arr.value(i).as_bytes().len(); | ||
| builder.append_value((byte_len * 8) as i32); | ||
| } | ||
| } | ||
| Ok(ColumnarValue::Array(Arc::new(builder.finish()))) |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is significant code duplication between the StringArray and LargeStringArray implementations. Consider extracting the common logic into a generic helper function or using a macro to reduce duplication and improve maintainability. The only differences are the array types being downcast and the builder types used.
| let byte_len = arr.value(i).as_bytes().len(); | ||
| builder.append_value((byte_len * 8) as i32); |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cast (byte_len * 8) as i32 could overflow if the string is longer than i32::MAX / 8 bytes (approximately 268 MB). Consider adding overflow checking or using checked_mul and returning an error for excessively large strings to prevent silent overflow.
I'm a bit curious about this claim, would be interested to see some benchmarks here 🤔 |
sure, i will add benchmarks for this also |
| Array, StringArray, LargeStringArray, Int32Builder, | ||
| }; | ||
| use std::sync::Arc; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if arr.is_null(i) { | ||
| builder.append_null(); | ||
| } else { | ||
| let byte_len = arr.value(i).as_bytes().len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is .as_bytes() needed here ? &str::len() returns the same
Which issue does this PR close?
This PR contributes to the performance epic tracking expressions that are
slower with Comet enabled:
Rationale for this change
Benchmarks in the Comet performance epic show that the
bit_lengthstringexpression is slower when executed via Comet. The existing implementation
relies on the generic Arrow
bit_lengthkernel, which introduces additionaloverhead for common string array types.
This change specializes the implementation for
StringArrayandLargeStringArrayto reduce per-row overhead while preserving existingbehavior for all other array types.
What changes are included in this PR?
bit_lengthforStringArrayandLargeStringArray(e.g. Utf8View, Dictionary, Binary)
Are these changes tested?
Yes. Existing unit tests and SQL logic tests already cover
bit_lengthbehavior across string types, including UTF-8, LargeUtf8, Utf8View, and
dictionary-encoded strings. All existing tests pass without modification.
Are there any user-facing changes?
No. This change is an internal performance optimization and does not affect
user-facing behavior or semantics.