Skip to content

Conversation

@Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Jan 2, 2026

Which issue does this PR close?

  • Closes #NNN.

Rationale for this change

Improve JSON binary decoding performance by avoiding per-value allocations and enabling direct hex decoding into builders.

What changes are included in this PR?

Optimized binary hex decoding paths to reduce allocations and improve throughput.

decode_binary_hex_json  time:   [3.6780 ms 3.6953 ms 3.7150 ms]
                        change: [−61.051% −60.818% −60.565%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

decode_fixed_binary_hex_json
                        time:   [4.0404 ms 4.1400 ms 4.2901 ms]
                        change: [−56.149% −55.040% −53.330%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  7 (7.00%) high mild
  12 (12.00%) high severe

decode_binary_view_hex_json
                        time:   [4.3731 ms 4.4242 ms 4.4767 ms]
                        change: [−53.305% −52.771% −52.239%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 2, 2026
@Weijun-H Weijun-H changed the title perf: optimize hex decoding in json perf: optimize hex decoding in json (1.8x faster in binary-heavy) Jan 2, 2026
@Weijun-H Weijun-H marked this pull request as ready for review January 2, 2026 16:44
@Weijun-H Weijun-H force-pushed the optimize-json-binary-parse branch 2 times, most recently from 30f160e to 4a6b5d4 Compare January 5, 2026 08:41
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Weijun-H - -this looks pretty close to me

impl ArrayDecoder for FixedSizeBinaryArrayDecoder {
fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) -> Result<ArrayData, ArrowError> {
let mut builder = FixedSizeBinaryBuilder::with_capacity(pos.len(), self.len);
let mut scratch = Vec::with_capacity(self.len as usize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why self.len? Isn't that the length of the input? Also the scratch used below is different (no initial capacity)

Copy link
Member Author

@Weijun-H Weijun-H Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Here self.len comes from FixedSizeBinary and represents the decoded byte width, not the input hex string length. The input string length is expected to be 2 * len. For variable-width types we use Vec::new() and reserve per value instead.

@alamb
Copy link
Contributor

alamb commented Jan 10, 2026

run benchmark json-reader

@apache apache deleted a comment from alamb-ghbot Jan 10, 2026
@alamb
Copy link
Contributor

alamb commented Jan 10, 2026

(I fixed a bug in my benchmark runner that didn't allow - in benchmark names

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Weijun-H -- this looks good to me

Nice work

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@Weijun-H Weijun-H force-pushed the optimize-json-binary-parse branch from c8ed14d to 2571386 Compare January 10, 2026 15:00
@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-json-binary-parse (2571386) to 298d3aa diff
BENCH_NAME=json-reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench json-reader
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize-json-binary-parse
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                        main                                   optimize-json-binary-parse
-----                                        ----                                   --------------------------
decode_binary_hex_json                       4.44     92.9±0.64ms        ? ?/sec    1.00     20.9±0.19ms        ? ?/sec
decode_binary_view_hex_json                  4.14     94.3±0.83ms        ? ?/sec    1.00     22.8±0.39ms        ? ?/sec
decode_fixed_binary_hex_json                 4.21     92.6±0.27ms        ? ?/sec    1.00     22.0±0.47ms        ? ?/sec
decode_wide_object_i64_json                  1.03  1499.6±27.81ms        ? ?/sec    1.00  1453.6±31.44ms        ? ?/sec
decode_wide_object_i64_serialize             1.02  1273.2±14.13ms        ? ?/sec    1.00  1249.0±12.82ms        ? ?/sec
decode_wide_projection_full_json/131072      1.01       3.1±0.03s    57.0 MB/sec    1.00       3.0±0.02s    57.6 MB/sec
decode_wide_projection_narrow_json/131072    1.01   786.3±11.63ms   221.3 MB/sec    1.00    778.2±9.68ms   223.6 MB/sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants