-
Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: improve performance of spark hex function
#19738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| let hex_string = hex_encode(bytes, lowercase); | ||
| Ok(hex_string) | ||
| /// Generic hex encoding for int64 type | ||
| fn hex_encode_int64<I>(iter: I, len: usize) -> Result<ColumnarValue, DataFusionError> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this doesnt need to be generic since its only used for int64 arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both Int64Array and Dictionary<Int64> paths need hex encoding and yield different iterator types. Perhaps using a generic iterator here could help share the same encoding logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I didn't see that dictionary used it like that. However that made me realise this edge case:
#[test]
fn test_dict_values_null() {
let keys = Int32Array::from(vec![Some(0), None, Some(1)]);
let vals = Int64Array::from(vec![Some(32), None]);
// [32, null, null]
let dict = DictionaryArray::new(keys, Arc::new(vals));
let columnar_value = ColumnarValue::Array(Arc::new(dict));
let result = super::spark_hex(&[columnar_value]).unwrap();
let result = match result {
ColumnarValue::Array(array) => array,
_ => panic!("Expected array"),
};
let result = as_string_array(&result);
let expected = StringArray::from(vec![Some("20"), None, None]);
assert_eq!(&expected, result);
}- When both key & value arrays have nulls
On main this succeeds because the 2nd element is null from the key (aka from dictionary) but 3rd element is null from the underlying values array.
On this PR it seems it will ignore any nulls in the values and treat it as non-null.
See doc on dictionaries: https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, I've refactored the dictionary handling to explicitly check the validity bitmap of the values array using is_valid.
And I've also incorporated your test case to ensure this edge case is covered. Thanks for the detailed explanation and the doc link!
| for v in iter { | ||
| if let Some(num) = v { | ||
| buffer.clear(); | ||
| hex_int64(num, &mut buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a little interesting how we have this buffer across iterations here, but then inside hex_int64 we have another buffer, which is used to copy into this outer buffer. Can we unify them somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have refactored the code to eliminate one redundant memory copy by unifying the buffer logic. Benchmarks show a significant performance boost of approximately 14-37%. Thanks for catching that.
group before after
----- ------- -------
hex_int64/size=1024 1.37 10.6±1.49µs ? ?/sec 1.00 7.8±0.32µs ? ?/sec
hex_int64/size=4096 1.19 46.1±1.78µs ? ?/sec 1.00 38.8±1.00µs ? ?/sec
hex_int64/size=8192 1.14 97.9±3.75µs ? ?/sec 1.00 85.8±3.83µs ? ?/sec
hex_int64_dict/size=1024 1.34 11.0±0.54µs ? ?/sec 1.00 8.2±0.29µs ? ?/sec
hex_int64_dict/size=4096 1.24 51.1±6.34µs ? ?/sec 1.00 41.2±2.05µs ? ?/sec
hex_int64_dict/size=8192 1.18 104.2±3.37µs ? ?/sec 1.00 88.4±2.19µs ? ?/sec
| let int_values = as_int64_array(values)?; | ||
| hex_encode_int64( | ||
| keys.iter().map(|k| { | ||
| k.and_then(|idx| { | ||
| if int_values.is_valid(idx as usize) { | ||
| Some(int_values.value(idx as usize)) | ||
| } else { | ||
| None | ||
| } | ||
| }) | ||
| }), | ||
| dict.len(), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More compact way of iterating over dictionaries using a TypedDictionaryArray:
let arr =
dict.downcast_dict::<arrow::array::Int64Array>().unwrap();
hex_encode_int64(arr.into_iter(), dict.len())(Alternatively could look into preserving the dictionary type as an output; so we can apply the hex only on the values and return a dict with the same keys, instead of expanding this dict into a regular string array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I’ve updated the implementation to use TypedDictionaryArray first. Preserving the dictionary output seems to require some refactoring and test updates, so I’ll try that next and follow up with another PR if there’s an improvement. Does that sound good?
|
Thanks @lyne7-sc |
…ansion (#19832) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> Follow up to #19738 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> The current hex implementation expands `DictionaryArray` inputs into a regular array, which causes loss of dictionary encoding and redundant hex computation for repeated values. ## What changes are included in this PR? - Apply hex encoding only to dictionary values - Avoid expanding dictionary arrays during execution ### Benchmark | Size | Before | After | Speedup | | ---- | ------ | ----- | ------- | | 1024 | 8.3 µs | 7.2 µs | 1.15× | | 4096 | 42.9 µs | 34.5 µs | 1.24× | | 8192 | 91.6 µs | 71.7 µs | 1.28× | <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes. Existing unit tests and `sqllogictest` tests pass. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
Which issue does this PR close?
Rationale for this change
The current
hexfunction implementation usesformat!macro andStringArray::fromiterator pattern, which causes:Stringviaformat!What changes are included in this PR?
This PR optimizes the
hexencoding by:format!("{num:X}")with a fast lookup table approachStringBuilderBenchmark Results
Are these changes tested?
Yes. Existing units and sqllogictest tests pass. New benchmarks added.
Are there any user-facing changes?
No.