Skip to content

feat: add rounding logic and scale zero fix parse_decimal to match parse_string_to_decimal_native behavior #7179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions arrow-cast/src/cast/decimal.rs
Original file line number Diff line number Diff line change
Expand Up @@ -590,6 +590,7 @@ where
#[cfg(test)]
mod tests {
use super::*;
use crate::parse::parse_decimal;

#[test]
fn test_parse_string_to_decimal_native() -> Result<(), ArrowError> {
Expand All @@ -598,7 +599,20 @@ mod tests {
0_i128
);
assert_eq!(
parse_decimal::<Decimal128Type>("0", 38, 0)?,
parse_string_to_decimal_native::<Decimal128Type>("0", 0)?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the same behavior in these two functions seems like a reasonable change to me

Copy link
Contributor Author

@himadripal himadripal Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we are able to move to using parse_decimal for casting and deprecate parse_string_to_decimal_native , these tests will be changed to assert the value in the message section of the assert.

"value is {}",
0_i128
);

assert_eq!(
parse_string_to_decimal_native::<Decimal128Type>("0", 5)?,
0_i128
);
assert_eq!(
parse_decimal::<Decimal128Type>("0", 38, 5)?,
parse_string_to_decimal_native::<Decimal128Type>("0", 5)?,
"value is {}",
0_i128
);

Expand All @@ -607,7 +621,20 @@ mod tests {
123_i128
);
assert_eq!(
parse_decimal::<Decimal128Type>("123", 38, 0)?,
parse_string_to_decimal_native::<Decimal128Type>("123", 0)?,
"value is {}",
123_i128
);

assert_eq!(
parse_string_to_decimal_native::<Decimal128Type>("123", 5)?,
12300000_i128
);
assert_eq!(
parse_decimal::<Decimal128Type>("123", 38, 5)?,
parse_string_to_decimal_native::<Decimal128Type>("123", 5)?,
"value is {}",
12300000_i128
);

Expand All @@ -616,7 +643,20 @@ mod tests {
123_i128
);
assert_eq!(
parse_decimal::<Decimal128Type>("123.45", 38, 0)?,
parse_string_to_decimal_native::<Decimal128Type>("123.45", 0)?,
"value is {}",
123_i128
);

assert_eq!(
parse_string_to_decimal_native::<Decimal128Type>("123.45", 5)?,
12345000_i128
);
assert_eq!(
parse_decimal::<Decimal128Type>("123.45", 38, 5)?,
parse_string_to_decimal_native::<Decimal128Type>("123.45", 5)?,
"value is {}",
12345000_i128
);

Expand All @@ -625,7 +665,20 @@ mod tests {
123_i128
);
assert_eq!(
parse_decimal::<Decimal128Type>("123.4567891", 38, 0)?,
parse_string_to_decimal_native::<Decimal128Type>("123.4567891", 0)?,
"value is {}",
123_i128
);

assert_eq!(
parse_string_to_decimal_native::<Decimal128Type>("123.4567891", 5)?,
12345679_i128
);
assert_eq!(
parse_decimal::<Decimal128Type>("123.4567891", 38, 5)?,
parse_string_to_decimal_native::<Decimal128Type>("123.4567891", 5)?,
"value is {}",
12345679_i128
);
Ok(())
Expand Down
124 changes: 111 additions & 13 deletions arrow-cast/src/parse.rs
Original file line number Diff line number Diff line change
Expand Up @@ -850,7 +850,18 @@ fn parse_e_notation<T: DecimalType>(
}

if exp < 0 {
result = result.div_wrapping(base.pow_wrapping(-exp as _));
let result_with_scale = result.div_wrapping(base.pow_wrapping(-exp as _));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this change the behavior of parsing e notation? If so I didn't see any tests

Copy link
Contributor Author

@himadripal himadripal Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I missed porting the tests while splitting the large PR. It rounds instead of current behavior - truncate. I'll add the tests.

let result_with_one_scale_up =
result.div_wrapping(base.pow_wrapping(-exp.add_wrapping(1) as _));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this logic is correct, but just for me to understand. E.g. for 12345e-5, would exp be -5? why is this adding 1?

Copy link
Contributor Author

@himadripal himadripal Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exp in the parse_e_notation method is being overriden couple of times based on which direction the decimal needs to shift and if the original string has fractional in it. ( i.e 1.23e-2 has 2 fractional digits).

before this check, exp represents number of digits to be removed or added. In this case, exp = -3

now, result_with_scale = 12
result_with_one_scale_up=123

to round up or down, we need to capture the digit next to last digit in the result, in this case 3. How we get it is
rounding_digit= result_with_one_scale_up - result_with_scale * 10
rounding_digit=123- 12*10 = 3

if rounding_digit >=5 then we add +1 to the result
else result remains intact.

Image 3-31-25 at 5 30 PM
I added a debugging screenshot to help understand it more.

let rounding_digit =
result_with_one_scale_up.sub_wrapping(result_with_scale.mul_wrapping(base));
//rounding digit is the next digit after result with scale, it helps in rounding to nearest integer
// with scale 1 rounding digit for 247e-2 is 7, hence result is 2.5, whereas rounding digit for 244e-2 is 4, hence result is 2.4
if rounding_digit >= T::Native::usize_as(5) {
result = result_with_scale.add_wrapping(T::Native::usize_as(1));
} else {
result = result_with_scale;
}
} else {
result = result.mul_wrapping(base.pow_wrapping(exp as _));
}
Expand All @@ -866,8 +877,9 @@ pub fn parse_decimal<T: DecimalType>(
scale: i8,
) -> Result<T::Native, ArrowError> {
let mut result = T::Native::usize_as(0);
let mut fractionals: i8 = 0;
let mut digits: u8 = 0;
let mut fractionals: i16 = 0;
let mut digits: u16 = 0;
let mut rounding_digit = -1; // to store digit after the scale for rounding
let base = T::Native::usize_as(10);

let bs = s.as_bytes();
Expand Down Expand Up @@ -897,6 +909,13 @@ pub fn parse_decimal<T: DecimalType>(
// Ignore leading zeros.
continue;
}
if fractionals == scale as i16 && scale != 0 {
// Capture the rounding digit once
if rounding_digit < 0 {
rounding_digit = (b - b'0') as i8;
}
continue;
}
digits += 1;
result = result.mul_wrapping(base);
result = result.add_wrapping(T::Native::usize_as((b - b'0') as usize));
Expand All @@ -909,8 +928,8 @@ pub fn parse_decimal<T: DecimalType>(
if *b == b'e' || *b == b'E' {
result = parse_e_notation::<T>(
s,
digits as u16,
fractionals as i16,
digits,
fractionals,
result,
point_index,
precision as u16,
Expand All @@ -925,11 +944,17 @@ pub fn parse_decimal<T: DecimalType>(
"can't parse the string value {s} to decimal"
)));
}
if fractionals == scale && scale != 0 {
if fractionals == scale as i16 {
// Capture the rounding digit once
if rounding_digit < 0 {
rounding_digit = (b - b'0') as i8;
}
// We have processed all the digits that we need. All that
// is left is to validate that the rest of the string contains
// valid digits.
continue;
if scale != 0 {
continue;
}
}
fractionals += 1;
digits += 1;
Expand All @@ -951,8 +976,8 @@ pub fn parse_decimal<T: DecimalType>(
b'e' | b'E' => {
result = parse_e_notation::<T>(
s,
digits as u16,
fractionals as i16,
digits,
fractionals,
result,
index,
precision as u16,
Expand All @@ -972,20 +997,28 @@ pub fn parse_decimal<T: DecimalType>(
}

if !is_e_notation {
if fractionals < scale {
let exp = scale - fractionals;
if exp as u8 + digits > precision {
if fractionals < scale as i16 {
let exp = scale as i16 - fractionals;
if exp + digits as i16 > precision as i16 {
return Err(ArrowError::ParseError(format!(
"parse decimal overflow ({s})"
)));
}
let mul = base.pow_wrapping(exp as _);
result = result.mul_wrapping(mul);
} else if digits > precision {
} else if digits > precision as u16 {
return Err(ArrowError::ParseError(format!(
"parse decimal overflow ({s})"
)));
}
if scale == 0 {
result = result.div_wrapping(base.pow_wrapping(fractionals as u32))
}
//rounding digit is the next digit after result with scale, it is used to do rounding to nearest integer
// with scale 1 rounding digit for 2.47 is 7, hence result is 2.5, whereas rounding digit for 2.44 is 4,hence result is 2.4
if rounding_digit >= 5 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering where >= 5 came from?

Copy link
Contributor Author

@himadripal himadripal Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first we figure out what is the rounding_digit - digit which is next to the last digit in the final result (without rounding logic applied), if the value of the rounding_digit is >=5, then we add +1 to round up the result, else it remains same.

"1265E-4" -> with scale 3 -> 0.127
in scale 3 the number would be 0.126 and rounding digit will be 5, as rounding digit >= 5, the result becomes 0.127

1264E-4" -> with scale 3 -> 0.126
here rounding_digit is 4, which is less than 5, so no need to add 1. 

Copy link
Contributor Author

@himadripal himadripal Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>= 5 is being used for rounding to the nearest integer

with scale 1 
2.47 -> 2.5
2.44 -> 2.4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could add a comment in the code to explain this point to future readers who may have the same question

result = result.add_wrapping(T::Native::usize_as(1));
}
}

Ok(if negative {
Expand Down Expand Up @@ -2564,6 +2597,18 @@ mod tests {
assert_eq!(i256::from_i128(i), result_256.unwrap());
}

let tests_with_varying_scale = [
("123.4567891", 12345679_i128, 5),
("123.4567891", 123_i128, 0),
("123.45", 12345000_i128, 5),
("-2.5", -3_i128, 0),
("-2.49", -2_i128, 0),
];
for (str, e, scale) in tests_with_varying_scale {
let result_128_a = parse_decimal::<Decimal128Type>(str, 20, scale);
assert_eq!(result_128_a.unwrap(), e);
}

let e_notation_tests = [
("1.23e3", "1230.0", 2),
("5.6714e+2", "567.14", 4),
Expand Down Expand Up @@ -2599,6 +2644,9 @@ mod tests {
("000001.1034567002e0", "000001.1034567002", 3),
("1.234e16", "12340000000000000", 0),
("123.4e16", "1234000000000000000", 0),
("4e+5", "400000", 4),
("4e7", "40000000", 2),
("1265E-4", ".1265", 3),
];
for (e, d, scale) in e_notation_tests {
let result_128_e = parse_decimal::<Decimal128Type>(e, 20, scale);
Expand All @@ -2608,6 +2656,7 @@ mod tests {
let result_256_d = parse_decimal::<Decimal256Type>(d, 20, scale);
assert_eq!(result_256_e.unwrap(), result_256_d.unwrap());
}

let can_not_parse_tests = [
"123,123",
".",
Expand Down Expand Up @@ -2780,6 +2829,55 @@ mod tests {
}
}

#[test]
fn test_parse_decimal_rounding() {
let test_rounding_for_e_notation_varying_scale = [
("1.2345e4", "12345", 2),
("12345e-5", "0.12", 2),
("12345E-5", "0.123", 3),
("12345e-5", "0.1235", 4),
("1265E-4", ".127", 3),
("12.345e3", "12345.000", 3),
("1.2345e4", "12345", 0),
("1.2345e3", "1235", 0),
("1.23e-3", "0", 0),
("123e-2", "1", 0),
("-1e-15", "-0.0000000000", 10),
("1e-15", "0.0000000000", 10),
("1e15", "1000000000000000", 2),
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also add couple more tests on negative

        // Edge cases
        assert_eq!(round_to_places(-2.5, 0), -3.0); // Negative half rounding up
        assert_eq!(round_to_places(-2.49, 0), -2.0); // Just below rounding negative
        // Whole numbers (should remain unchanged)
        assert_eq!(round_to_places(-100.0, 2), -100.0);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are existing tests for such scenarios here. Nevertheless, I'll add these as well.


for (e, d, scale) in test_rounding_for_e_notation_varying_scale {
let result_128_e = parse_decimal::<Decimal128Type>(e, 38, scale);
let result_128_d = parse_decimal::<Decimal128Type>(d, 38, scale);
assert_eq!(result_128_e.unwrap(), result_128_d.unwrap());
let result_256_e = parse_decimal::<Decimal256Type>(e, 38, scale);
let result_256_d = parse_decimal::<Decimal256Type>(d, 38, scale);
assert_eq!(result_256_e.unwrap(), result_256_d.unwrap());
}

let edge_tests_256_error = [
(&f64::INFINITY.to_string(), 0),
(&f64::NEG_INFINITY.to_string(), 0),
];
for (s, scale) in edge_tests_256_error {
let result = parse_decimal::<Decimal256Type>(s, 76, scale);
assert_eq!(
format!("Parser error: can't parse the string value {s} to decimal"),
result.unwrap_err().to_string()
);
}

let edge_tests_256_overflow = [(&f64::MIN.to_string(), 0), (&f64::MAX.to_string(), 0)];
for (s, scale) in edge_tests_256_overflow {
let result = parse_decimal::<Decimal256Type>(s, 76, scale);
assert_eq!(
format!("Parser error: parse decimal overflow ({s})"),
result.unwrap_err().to_string()
);
}
}

#[test]
fn test_parse_empty() {
assert_eq!(Int32Type::parse(""), None);
Expand Down
2 changes: 1 addition & 1 deletion arrow-csv/src/reader/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1286,7 +1286,7 @@ mod tests {
assert_eq!("53.002666", lat.value_as_string(1));
assert_eq!("52.412811", lat.value_as_string(2));
assert_eq!("51.481583", lat.value_as_string(3));
assert_eq!("12.123456", lat.value_as_string(4));
assert_eq!("12.123457", lat.value_as_string(4));
Copy link
Contributor Author

@himadripal himadripal Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb you can see the behavior change in this test of arrow-csv reader which uses parse_decimal

assert_eq!("50.760000", lat.value_as_string(5));
assert_eq!("0.123000", lat.value_as_string(6));
assert_eq!("123.000000", lat.value_as_string(7));
Expand Down
4 changes: 2 additions & 2 deletions arrow-json/src/reader/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1181,7 +1181,7 @@ mod tests {
assert!(col1.is_null(5));
assert_eq!(
col1.values(),
&[100, 200, 204, 1103420, 0, 0].map(T::Native::usize_as)
&[100, 200, 205, 1103420, 0, 0].map(T::Native::usize_as)
);

let col2 = batches[0].column(1).as_primitive::<T>();
Expand All @@ -1201,7 +1201,7 @@ mod tests {
assert!(col3.is_null(5));
assert_eq!(
col3.values(),
&[3830, 12345, 0, 0, 0, 0].map(T::Native::usize_as)
&[3830, 12346, 0, 0, 0, 0].map(T::Native::usize_as)
);
}

Expand Down
Loading