perf: improve performance of `spark hex` function #19738

lyne7-sc · 2026-01-11T04:24:14Z

Which issue does this PR close?

part of [EPIC] Optimize performance for slow expressions datafusion-comet#2986

Rationale for this change

The current hex function implementation uses format! macro and StringArray::from iterator pattern, which causes:

Per-element String allocations: Each value allocates a new String via format!
Unnecessary conversions: Multiple intermediate conversions between types
Inefficient dictionary type handling: Collects all values into vectors before building the result

What changes are included in this PR?

This PR optimizes the hex encoding by:

Replacing format!("{num:X}") with a fast lookup table approach
Building results directly using StringBuilder
Reusing a single vec buffer per iteration to avoid re-allocation
Optimizing dictionary array handling by building results iteratively

Benchmark Results

Group	Size	Before	After	Speedup
hex_binary	1024	89.9 µs	51.9 µs	1.73x
hex_binary	4096	385.2 µs	218.7 µs	1.76x
hex_binary	8192	741.6 µs	451.6 µs	1.64x
hex_int64	1024	32.0 µs	12.4 µs	2.57x
hex_int64	4096	132.4 µs	59.7 µs	2.22x
hex_int64	8192	258.5 µs	120.6 µs	2.14x
hex_int64_dict	1024	75.2 µs	12.4 µs	6.04x
hex_int64_dict	4096	313.2 µs	60.5 µs	5.18x
hex_int64_dict	8192	614.7 µs	129.0 µs	4.76x
hex_utf8	1024	88.5 µs	53.5 µs	1.66x
hex_utf8	4096	357.6 µs	211.1 µs	1.69x
hex_utf8	8192	698.7 µs	424.8 µs	1.64x

Are these changes tested?

Yes. Existing units and sqllogictest tests pass. New benchmarks added.

Are there any user-facing changes?

No.

Jefffrey · 2026-01-11T12:14:20Z

datafusion/spark/src/function/math/hex.rs

-    let hex_string = hex_encode(bytes, lowercase);
-    Ok(hex_string)
+/// Generic hex encoding for int64 type
+fn hex_encode_int64<I>(iter: I, len: usize) -> Result<ColumnarValue, DataFusionError>


nit: this doesnt need to be generic since its only used for int64 arrays

Both Int64Array and Dictionary<Int64> paths need hex encoding and yield different iterator types. Perhaps using a generic iterator here could help share the same encoding logic?

Ah I didn't see that dictionary used it like that. However that made me realise this edge case:

#[test] fn test_dict_values_null() { let keys = Int32Array::from(vec![Some(0), None, Some(1)]); let vals = Int64Array::from(vec![Some(32), None]); // [32, null, null] let dict = DictionaryArray::new(keys, Arc::new(vals)); let columnar_value = ColumnarValue::Array(Arc::new(dict)); let result = super::spark_hex(&[columnar_value]).unwrap(); let result = match result { ColumnarValue::Array(array) => array, _ => panic!("Expected array"), }; let result = as_string_array(&result); let expected = StringArray::from(vec![Some("20"), None, None]); assert_eq!(&expected, result); }

When both key & value arrays have nulls

On main this succeeds because the 2nd element is null from the key (aka from dictionary) but 3rd element is null from the underlying values array.

On this PR it seems it will ignore any nulls in the values and treat it as non-null.

See doc on dictionaries: https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout

Understood, I've refactored the dictionary handling to explicitly check the validity bitmap of the values array using is_valid.
And I've also incorporated your test case to ensure this edge case is covered. Thanks for the detailed explanation and the doc link!

Jefffrey · 2026-01-11T12:16:32Z

datafusion/spark/src/function/math/hex.rs

+    for v in iter {
+        if let Some(num) = v {
+            buffer.clear();
+            hex_int64(num, &mut buffer);


I find it a little interesting how we have this buffer across iterations here, but then inside hex_int64 we have another buffer, which is used to copy into this outer buffer. Can we unify them somehow?

Done. I have refactored the code to eliminate one redundant memory copy by unifying the buffer logic. Benchmarks show a significant performance boost of approximately 14-37%. Thanks for catching that.

group before after ----- ------- ------- hex_int64/size=1024 1.37 10.6±1.49µs ? ?/sec 1.00 7.8±0.32µs ? ?/sec hex_int64/size=4096 1.19 46.1±1.78µs ? ?/sec 1.00 38.8±1.00µs ? ?/sec hex_int64/size=8192 1.14 97.9±3.75µs ? ?/sec 1.00 85.8±3.83µs ? ?/sec hex_int64_dict/size=1024 1.34 11.0±0.54µs ? ?/sec 1.00 8.2±0.29µs ? ?/sec hex_int64_dict/size=4096 1.24 51.1±6.34µs ? ?/sec 1.00 41.2±2.05µs ? ?/sec hex_int64_dict/size=8192 1.18 104.2±3.37µs ? ?/sec 1.00 88.4±2.19µs ? ?/sec

datafusion/spark/src/function/math/hex.rs

Jefffrey · 2026-01-13T02:41:32Z

datafusion/spark/src/function/math/hex.rs

+                        let int_values = as_int64_array(values)?;
+                        hex_encode_int64(
+                            keys.iter().map(|k| {
+                                k.and_then(|idx| {
+                                    if int_values.is_valid(idx as usize) {
+                                        Some(int_values.value(idx as usize))
+                                    } else {
+                                        None
+                                    }
+                                })
+                            }),
+                            dict.len(),
+                        )


More compact way of iterating over dictionaries using a TypedDictionaryArray:

let arr = dict.downcast_dict::<arrow::array::Int64Array>().unwrap(); hex_encode_int64(arr.into_iter(), dict.len())

(Alternatively could look into preserving the dictionary type as an output; so we can apply the hex only on the values and return a dict with the same keys, instead of expanding this dict into a regular string array)

Thanks for the suggestion! I’ve updated the implementation to use TypedDictionaryArray first. Preserving the dictionary output seems to require some refactoring and test updates, so I’ll try that next and follow up with another PR if there’s an improvement. Does that sound good?

Jefffrey · 2026-01-14T01:31:48Z

Thanks @lyne7-sc

…ansion (#19832) ## Which issue does this PR close?  Follow up to #19738 ## Rationale for this change  The current hex implementation expands `DictionaryArray` inputs into a regular array, which causes loss of dictionary encoding and redundant hex computation for repeated values. ## What changes are included in this PR? - Apply hex encoding only to dictionary values - Avoid expanding dictionary arrays during execution ### Benchmark | Size | Before | After | Speedup | | ---- | ------ | ----- | ------- | | 1024 | 8.3 µs | 7.2 µs | 1.15× | | 4096 | 42.9 µs | 34.5 µs | 1.24× | | 8192 | 91.6 µs | 71.7 µs | 1.28× |  ## Are these changes tested? Yes. Existing unit tests and `sqllogictest` tests pass.  ## Are there any user-facing changes? No.

lyne7-sc and others added 9 commits January 9, 2026 22:47

add spark_hex bench

3102dd6

optimize hex

268a9f1

cargo fmt

db49fff

Merge branch 'apache:main' into perf/spark_hex

c1848cb

optimize hex

8b20f93

Merge remote-tracking branch 'origin/perf/spark_hex' into perf/spark_hex

3be11f7

Merge branch 'apache:main' into perf/spark_hex

142eb65

Merge remote-tracking branch 'origin/perf/spark_hex' into perf/spark_hex

c459982

optimize hex_int64

ec42662

github-actions bot added the spark label Jan 11, 2026

Jefffrey reviewed Jan 11, 2026

View reviewed changes

lyne7-sc and others added 3 commits January 12, 2026 22:17

Merge branch 'apache:main' into perf/spark_hex

6efd327

optimize hex_int64

45daf86

fix dictionary null values

f79aed2

Jefffrey approved these changes Jan 13, 2026

View reviewed changes

lyne7-sc and others added 2 commits January 13, 2026 21:45

Merge branch 'apache:main' into perf/spark_hex

248b825

simplify dictionary handling

604cc44

Jefffrey added this pull request to the merge queue Jan 14, 2026

Merged via the queue into apache:main with commit 1d19c52 Jan 14, 2026
31 checks passed

lyne7-sc mentioned this pull request Jan 15, 2026

perf: optimize spark_hex dictionary path by avoiding dictionary expansion #19832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: improve performance of `spark hex` function #19738

perf: improve performance of `spark hex` function #19738

lyne7-sc commented Jan 11, 2026

Uh oh!

Jefffrey Jan 11, 2026

Uh oh!

lyne7-sc Jan 12, 2026

Uh oh!

Jefffrey Jan 12, 2026

Uh oh!

lyne7-sc Jan 12, 2026

Uh oh!

Jefffrey Jan 11, 2026

Uh oh!

lyne7-sc Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

Jefffrey Jan 13, 2026

Uh oh!

lyne7-sc Jan 13, 2026

Uh oh!

Uh oh!

Jefffrey commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: improve performance of spark hex function #19738

perf: improve performance of spark hex function #19738

Conversation

lyne7-sc commented Jan 11, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Benchmark Results

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jefffrey commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: improve performance of `spark hex` function #19738

perf: improve performance of `spark hex` function #19738