Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 39 additions & 7 deletions datafusion/functions/src/string/bit_length.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ use datafusion_expr::{
TypeSignatureClass, Volatility,
};
use datafusion_macros::user_doc;
use arrow::array::{
Array, StringArray, LargeStringArray, Int32Builder,
};
use std::sync::Arc;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change


#[user_doc(
doc_section(label = "String Functions"),
Expand Down Expand Up @@ -90,17 +95,44 @@ impl ScalarUDFImpl for BitLengthFunc {
let [array] = take_function_args(self.name(), &args.args)?;

match array {
ColumnarValue::Array(v) => Ok(ColumnarValue::Array(bit_length(v.as_ref())?)),
ColumnarValue::Array(v) => {
if let Some(arr) = v.as_any().downcast_ref::<StringArray>() {
let mut builder = Int32Builder::with_capacity(arr.len());
for i in 0..arr.len() {
if arr.is_null(i) {
builder.append_null();
} else {
let byte_len = arr.value(i).as_bytes().len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is .as_bytes() needed here ? &str::len() returns the same

builder.append_value((byte_len * 8) as i32);
Comment on lines +105 to +106
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast (byte_len * 8) as i32 could overflow if the string is longer than i32::MAX / 8 bytes (approximately 268 MB). Consider adding overflow checking or using checked_mul and returning an error for excessively large strings to prevent silent overflow.

Copilot uses AI. Check for mistakes.
}
}
Ok(ColumnarValue::Array(Arc::new(builder.finish())))
} else if let Some(arr) = v.as_any().downcast_ref::<LargeStringArray>() {
let mut builder = Int32Builder::with_capacity(arr.len());
for i in 0..arr.len() {
if arr.is_null(i) {
builder.append_null();
} else {
let byte_len = arr.value(i).as_bytes().len();
builder.append_value((byte_len * 8) as i32);
}
}
Ok(ColumnarValue::Array(Arc::new(builder.finish())))
Comment on lines +110 to +120
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type for LargeStringArray should be Int64, not Int32. The return_type method returns Int64 for LargeUtf8 (via utf8_to_int_type), and the scalar handling for LargeUtf8 also returns Int64 (line 130-132). This implementation should use Int64Builder to maintain consistency with the declared return type and scalar handling.

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +120
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is significant code duplication between the StringArray and LargeStringArray implementations. Consider extracting the common logic into a generic helper function or using a macro to reduce duplication and improve maintainability. The only differences are the array types being downcast and the builder types used.

Copilot uses AI. Check for mistakes.
} else {
// fallback for Utf8View, Dictionary, Binary, etc.
Ok(ColumnarValue::Array(bit_length(v.as_ref())?))
}
}
ColumnarValue::Scalar(v) => match v {
ScalarValue::Utf8(v) => Ok(ColumnarValue::Scalar(ScalarValue::Int32(
v.as_ref().map(|x| (x.len() * 8) as i32),
))),
ScalarValue::LargeUtf8(v) => Ok(ColumnarValue::Scalar(
ScalarValue::Int64(v.as_ref().map(|x| (x.len() * 8) as i64)),
)),
ScalarValue::Utf8View(v) => Ok(ColumnarValue::Scalar(
ScalarValue::Int32(v.as_ref().map(|x| (x.len() * 8) as i32)),
)),
ScalarValue::LargeUtf8(v) => Ok(ColumnarValue::Scalar(ScalarValue::Int64(
v.as_ref().map(|x| (x.len() * 8) as i64),
))),
ScalarValue::Utf8View(v) => Ok(ColumnarValue::Scalar(ScalarValue::Int32(
v.as_ref().map(|x| (x.len() * 8) as i32),
))),
_ => unreachable!("bit length"),
},
}
Expand Down