Skip to content

Conversation

@klion26
Copy link
Member

@klion26 klion26 commented Jul 31, 2025

Which issue does this PR close?

This pr wants to optimize the logic of ObjectBuilder::finish

Rationale for this change

This pr wants to optimize the logic of ObjectBuilder::finish

What changes are included in this PR?

This PR wants to optimize ObjectBuilder::finish with packedu3 iterator

Are these changes tested?

This pr was covered by existing test

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 31, 2025
@klion26
Copy link
Member Author

klion26 commented Jul 31, 2025

@alamb @scovich @viirya Please help review this when you're free. thanks.

I've tried to benchmark three implementations

  1. the current implementation with PackedU32Iterator
  2. the implementation with append_packed_u32 into a tmp buf, then splice into the parent's buffer like we did for ListBuilder::finish
append_packed_u32 into a tmp buf
   let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte.
    let header_size = 1 + // header byte (i.e., `object_header`)
        num_fileds_size + // num_fields_size
        (num_fields * id_size as usize) + // field IDs
        ((num_fields + 1) * offset_size as usize); // field offsets + data_size
    let mut bytes_to_splice = Vec::with_capacity(header_size + 3);
    // Write header byte
    let header = object_header(is_large, id_size, offset_size);
    bytes_to_splice.push(header);
    append_packed_u32(&mut bytes_to_splice, num_fields as u32, num_fileds_size);
    for field_id in self.fields.keys() {
        append_packed_u32(&mut bytes_to_splice, *field_id, id_size as usize);
    }
    for offset in self.fields.values() {
        append_packed_u32(&mut bytes_to_splice, *offset as u32, id_size as usize);
    }
    append_packed_u32(&mut bytes_to_splice, data_size as u32, offset_size as usize);
    let starting_offset = self.parent_value_offset_base;
    // Shift existing data to make room for the header
    let buffer = parent_buffer.inner_mut();
    buffer.splice(starting_offset..starting_offset, bytes_to_splice);
  1. the implementation with using PackedU32Iterator and extend, then splice into the iterator into the parent's buffer
PackedU32Iterator with extend
    let num_fields_bytes = num_fields.to_le_bytes();
    let num_elements_bytes = num_fields_bytes.iter().take(num_fileds_size).copied();
    let fields = PackedU32Iterator::new(
        id_size as usize,
        self.fields.keys().map(|offset| offset.to_le_bytes()),
    );
    let offsets = PackedU32Iterator::new(
        offset_size as usize,
        self.fields
            .values()
            .map(|offset| (*offset as u32).to_le_bytes()),
    );
    let data_size_bytes = (data_size as u32).to_le_bytes();
    let data_size_bytes_iter = data_size_bytes.iter().take(offset_size as usize).copied();
    let header = object_header(is_large, id_size, offset_size);
    let mut bytes_to_splice = vec![header];
    bytes_to_splice.extend(num_elements_bytes);
    bytes_to_splice.extend(fields);
    bytes_to_splice.extend(offsets);
    bytes_to_splice.extend(data_size_bytes_iter);
    let starting_offset = self.parent_value_offset_base;
    // Shift existing data to make room for the header
    let buffer = parent_buffer.inner_mut();
    buffer.splice(starting_offset..starting_offset, bytes_to_splice);

The results show that the first win(but not too much).

The results were generated by
1 Create three branches with the code
2 Execute the benchmark command one by one on the three branches and main
3 Execute critcmp to generate the result.

PackedU32Iterator

group                                                                7978_optimize_header_generatation_with_iterator    main
-----                                                                -----------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                     1.00     28.4±1.05ms        ? ?/sec                1.03     29.1±0.97ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00    324.6±8.35ms        ? ?/sec                1.01    327.6±8.18ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.00     11.0±0.40ms        ? ?/sec                1.01     11.2±0.34ms        ? ?/sec
variant_get_primitive                                                1.00  1105.8±33.81µs        ? ?/sec                1.01  1121.0±43.28µs        ? ?/sec

append_packed_u32 into a tmp buf

group                                                                7978_optimize_header_generation_logic_object_builder    main
-----                                                                ----------------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                     1.00     28.5±1.09ms        ? ?/sec                     1.02     29.1±0.97ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00    328.9±8.20ms        ? ?/sec                     1.00    327.6±8.18ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.00     11.3±0.31ms        ? ?/sec                     1.00     11.2±0.34ms        ? ?/sec
variant_get_primitive                                                1.00  1106.9±40.65µs        ? ?/sec                     1.01  1121.0±43.28µs        ? ?/sec

PackedU32Iterator with extend

group                                                                7978_optimizer_header_generation_with_extend    main
-----                                                                --------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                     1.04     30.2±1.82ms        ? ?/sec             1.00     29.1±0.97ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.06    345.8±9.52ms        ? ?/sec             1.00    327.6±8.18ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.12     12.5±0.77ms        ? ?/sec             1.00     11.2±0.34ms        ? ?/sec
variant_get_primitive                                                1.00  1123.7±50.11µs        ? ?/sec             1.00  1121.0±43.28µs        ? ?/sec

From the conversation of previous prs[1][2], I think we prefer that

  1. using iterator(or tmp buf) with splice, so that we won't mis-calculate the size we spliced into the parent's buffer
  2. we want to avoid tmp buf creation

For this and the benchmarks, I've some questions

  1. for the mis-calculated header problem, can we use ut to cover the header size calculation logic?
  2. for the tmp buf creation problem, from the pr in Remove redundant is_err checks in Variant tests #7897 , it would have a better performance if we use a tmp buf with append_packed_u32, is that the tmp buffer creation cost may or may not be the bottleneck, it depends on the workloads we're on. If this is the case, do we need to add some more benchmarks?(also do we need to bench for different header metadata such asis_large is true)
  3. the variant_kernel benchmark will run batch_json_string_to_variant random_json({} bytes per document) twice, is the first for warmup?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @klion26 -- I agree this appears somewhat complicated for what it is doing and doesn't really make a huge difference

I left some thoughts on specialization

Let me know what you think

cc @scovich

(if is_large { 4 } else { 1 }) + // num_fields
(num_fields * id_size as usize) + // field IDs
((num_fields + 1) * offset_size as usize); // field offsets + data_size
let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable name seems to have a typo:

Suggested change
let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte.
let num_fields_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: I personally find the comment unhelpful -- it just restates the code.

/// An iterator that yields the bytes of a packed u32 iterator.
/// Will yield the first `packed_bytes` bytes of each item in the iterator.
struct PackedU32Iterator<T: Iterator<Item = [u8; 4]>> {
packed_bytes: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to get this really fast we will need to use generics based on the packed_bytes size -- so we would end up with different versions of the code for 1,2 and 4 byte offsets

We could try to make this particular iterator generic, but it might get a bit messy.

I was thinking maybe we can somehow structure the code so there is a function like this that writes the header for a certain offset size:

fn write_header<const SIZE: usize>(dst: &mut Vec<u8>, ...) {
  ...
}

Then we would basically have a switch like this to instantiate the appropriate versions (can probably avoid the panic)

match  int_size(max_id as usize) {
  1 => write_header::<1>(dst, ...),
  2 => write_header::<2>(dst, ...),
  4 => write_header::<4>(dst, ...),
  _ => panic!("unsupported size")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even just specializing the PackedU32Iterator with a const N: usize parameter would probably give most of the benefit, for this iterator-based approach?

However -- the temp buf append+truncate is fundamentally faster on a per-entry basis, because of how much simpler it is at machine code level. And it will run at exactly the same speed for all packing sizes, because there are no branches or even conditional moves based on the packing size (**). The only downside is the up-front cost of allocating said temp buffer first, which has to amortize across all entries.

(**) If the compiler were a bit smarter, it could even eliminate the bound checks in the loop's push calls (based on the result of the with_capacity that preceded it), and also eliminate the bounds checks in the loop's truncate calls (based on the fact that the arg passed to truncate is never larger than the vec size).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into the variant implementation or this PR in detail, but does this iterator need a custom implementation or can it be an anonymous impl Iterator<Item=u8>? It looks like the implementation is basically iter.flat_map(u32::to_le_bytes). The benefit of that is that if the underlying iterator has a trusted length then the flat-mapped iterator is still trusted, and extending a slice from a TrustedLen iterator is much more efficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhorstmann IIUC, the benefit of this iterator is size_hint, Rust can using the size_hint do do better staff, as it knows the size of the iterator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However -- the temp buf append+truncate is fundamentally faster on a per-entry basis, because of how much simpler it is at machine code level. And it will run at exactly the same speed for all packing sizes, because there are no branches or even conditional moves based on the packing size (**). The only downside is the up-front cost of allocating said temp buffer first, which has to amortize across all entries.

Yes I agree with this analysis.

This is why my opinion still remains that the way to avoid the allocation is to calculate the header size and shift the bytes first so we write directly to the target. I realize that is trickier code, but as @klion26 has said somewhere else I think we could write some good assertions and tests to avoid problems

When I next get a chance to work on Variant related things with time, I want to focus on unsticking the shredding implementation. Once that is done, I may return here to try and restructure things

One thought I had was to encapsulate the logic / templating in a

struct ListHeaderWriter {
  offsets: &[usize],
}

impl ListHeaderWriter {
  fn header_size::<const OFFSET_SIZE:usize>(&self) -> usize {
    // put the header size calculation here
  }
  //  write the header into dst starting at start_offset
  fn write(dst: &mut Vec<u8>, start_offset: usize)::<const OFFSET_SIZE:usize>(&self) {
   ... 
  }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems you are right, FlatMap would only work for constant length arrays, but not dynamic slices with length based on packed_bytes. And TrustedLen optimization also can not kick in since the base iterator is from a crate outside the standard library.

@scovich
Copy link
Contributor

scovich commented Jul 31, 2025

I've tried to benchmark three implementations

Thanks for taking the time to do this!

The results show that the first win(but not too much).

Technically, the first two are equivalent because each result's mean is comfortably inside the other's confidence window. While it may be suggestive that the temp buf approach has a higher mean in all four results, any real difference is dwarfed by noise:

group PackedU32Iterator append into tmp buf
json_list 28.4±1.05ms 28.5±1.09ms
random_json 324.6±8.35ms 328.9±8.20m
repeated_struct 11.0±0.40ms 11.3±0.31ms
variant_get_primitive 1105.8±33.81µs 1106.9±40.65µ

(aside: why would variant_get_primitive case (= reading) even be relevant for variant building?)

Meanwhile, it's unsurprising in retrospect that the complex iterator approach would be slower than the append+truncate approach, given how simple (branchless!) the latter's machine code is.

for the mis-calculated header problem, can we use ut to cover the header size calculation logic?

I mean, yes we can. But given a choice to eliminate a bug surface entirely from the code base, vs. try to validate it with unit tests that could be incomplete and/or go out of sync? I almost always favor eliminating the possibility of bugs over testing for them -- unless the perf and/or complexity difference is large enough to make me question that default choice.

do we need to bench for different header metadata such as is_large is true?

If we care about the performance (and correctness) of that case, it seems like we need to have benchmarks and unit tests to cover it?

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears somewhat complicated for what it is doing and doesn't really make a huge difference

If we want to measure the cost of safety, it might be instructive to modify append_offset_array_start_from_buf_pos to take the same approach as append_packed_u32 in writing more bytes than we keep -- something like:

        let mut current_pos = start_pos;
        for relative_offset in offsets {
-           write_offset_at_pos(buf, current_pos, relative_offset, nbytes);
+           write_u32_at_pos(buf, current_pos, relative_offset as u32);
            current_pos += nbytes as usize;
        }

        // Write data_size
        if let Some(data_size) = data_size {
            // Write data_size at the end of the offsets
            write_offset_at_pos(buf, current_pos, data_size, nbytes);
            current_pos += nbytes as usize;
        }

where

// always writes the four LE bytes of a value, relying on the caller to only advance
// the buffer's position by the number of bytes that should be kept (1..=4). 
//
// WARNING: Caller must ensure that it's safe to produce as many as 3 extra bytes
// without corrupting the buffer. Usually by ensuring that something else overwrites
// the space immediately after.
fn write_u32_at_pos(buf: &mut [u8], start_pos: usize, value: u32) {
    let bytes = value.to_le_bytes();
    buf[start_pos..start_pos + bytes.len()].copy_from_slice(&bytes);
}

Highly dangerous -- it relies on the fact that the field array is followed by the value array, and the value array is followed followed by the data size (which is written in the safer/slower way) -- but we could at least see whether it gives enough performance boost to consider keeping it.

IMO, if want to keep the current unsafe approach that uses offset arithmetic in a pre-allocated buffer... we may as well go all the way and maximize the performance benefit we gain by giving up safety.

/// An iterator that yields the bytes of a packed u32 iterator.
/// Will yield the first `packed_bytes` bytes of each item in the iterator.
struct PackedU32Iterator<T: Iterator<Item = [u8; 4]>> {
packed_bytes: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even just specializing the PackedU32Iterator with a const N: usize parameter would probably give most of the benefit, for this iterator-based approach?

However -- the temp buf append+truncate is fundamentally faster on a per-entry basis, because of how much simpler it is at machine code level. And it will run at exactly the same speed for all packing sizes, because there are no branches or even conditional moves based on the packing size (**). The only downside is the up-front cost of allocating said temp buffer first, which has to amortize across all entries.

(**) If the compiler were a bit smarter, it could even eliminate the bound checks in the loop's push calls (based on the result of the with_capacity that preceded it), and also eliminate the bounds checks in the loop's truncate calls (based on the fact that the arg passed to truncate is never larger than the vec size).

(if is_large { 4 } else { 1 }) + // num_fields
(num_fields * id_size as usize) + // field IDs
((num_fields + 1) * offset_size as usize); // field offsets + data_size
let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: I personally find the comment unhelpful -- it just restates the code.

let data_size_bytes = (data_size as u32).to_le_bytes();
let data_size_bytes_iter = data_size_bytes.iter().take(offset_size as usize).copied();
let header = object_header(is_large, id_size, offset_size);
let bytess_to_splice = std::iter::once(header)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let bytess_to_splice = std::iter::once(header)
let bytes_to_splice = std::iter::once(header)

@klion26
Copy link
Member Author

klion26 commented Aug 1, 2025

@alamb I think it might be better to stick with the current solution: 1) The solution in the current PR is more complex than the original, but the performance improvement isn't significant; 2) I originally hoped to keep the same implementation for ListBuilder/ObjectBuilder, but it seems that changes to ObjectBuilder might result in performance regressions.

@scovich

I mean, yes we can. But given a choice to eliminate a bug surface entirely from the code base, vs. try to validate it with unit tests that could be incomplete and/or go out of sync? I almost always favor eliminating the possibility of bugs over testing for them -- unless the perf and/or complexity difference is large enough to make me question that default choice

Thanks for your detailed response. I agree with your point. It's more reliable to fundamentally address the potential for errors. This way, even if there's no test to guarantee this logic for any reason in the future, it will continue to run correctly.

I'm a little confused about the benchmark. Why is PackedU32Iterator + extend the slowest? Is it because it calls multiple extend, which results in data movement? (using PackedU32Iterator allows extend to take advantage of size_hint).

For the unsafe code with write_u32_at_pos, I tried it, but it requires further changes; otherwise, it will panic in write_u32_at_pos because there is no more room. If we need to benchmark this, I'll try to make this change and benchmark it with current solution.

@alamb @scovich Finally, thank you for your detailed review and guidance on these PRs. I've learned a lot from them. Thank you again.

@alamb
Copy link
Contributor

alamb commented Aug 1, 2025

@alamb I think it might be better to stick with the current solution: 1) The solution in the current PR is more complex than the original, but the performance improvement isn't significant;

Yes I agree with this plan for now

@alamb @scovich Finally, thank you for your detailed review and guidance on these PRs. I've learned a lot from them. Thank you again.

Thank you again for all the help

@scovich
Copy link
Contributor

scovich commented Aug 1, 2025

Why is PackedU32Iterator + extend the slowest? Is it because it calls multiple extend, which results in data movement? (using PackedU32Iterator allows extend to take advantage of size_hint).

I suspect it's because the compiler can't "see through" the packed iterator like it can with the append+truncate approach. So the former produces 1-4 individual byte appends, each with its own bounds check and length change, while the latter produces a single 32-bit append followed by a truncate.

For the unsafe code with write_u32_at_pos, I tried it, but it requires further changes; otherwise, it will panic in write_u32_at_pos because there is no more room.

If you hit a buffer overflow panic, I think that means you tried to use the "fast" version to write the Some(data_size) entry, after the loop exits. Or did you hit a problem in the loop itself?

I think it might be better to stick with the current solution: 1) The solution in the current PR is more complex than the original, but the performance improvement isn't significant.

The same argument likely applies to the object builder. Should we consider reverting that back to match the list builder, for simplicity and maintainability? I'd have to go back and check the other PR, but was the perf improvement actually significant there?

If so=> why did it not work for list builder? Do we not have apples-to-apples benchmarking?

If not => probably good to revert.

@klion26
Copy link
Member Author

klion26 commented Aug 2, 2025

@scovich

If you hit a buffer overflow panic, I think that means you tried to use the "fast" version to write the Some(data_size) entry, after the loop exits. Or did you hit a problem in the loop itself?

It would panic in the following test when creating a variant with VariantBuilder::new.

let (m1, v1) = make_nested_object();
        let variant = Variant::new(&m1, &v1);

        // because we can guarantee metadata is validated through the builder
        let mut builder = VariantBuilder::new().with_metadata(VariantMetadata::new(&m1));
        builder.append_value(variant.clone()); <-- panic here because we have a buffer with size 6 and will try to copy 4 bytes to the buffer starting from 3

        let (metadata, value) = builder.finish();
        let result_variant = Variant::new(&metadata, &value);

        assert_eq!(variant, result_variant);

The same argument likely applies to the object builder. Should we consider reverting that back to match the list builder, for simplicity and maintainability? I'd have to go back and check the other PR, but was the perf improvement actually significant there?

Sorry, I've rechecked the current ObjectBuilder::finish logic(I misremembered the implementation in main branch before). I think either the current approach or aligning it with ListBuilder::finish is fine for me. The current approach has some possibility of generating bugs as @scovich said in the previous comment, but for now we have ut which ensures correctness. Aligning the logic with ListBuilder::finish will fundamentally ensure correctness(the benchmark difference is minimal). @alamb @scovich, I can modify the code if necessary.

header_pos = self
.parent_state
.buffer()
.append_offset_array_start_from_buf_pos(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but append_offset_array_start_from_buf_pos also writes field ids instead of just offsets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change the function name after this pr finalized.

}

impl<T: Iterator<Item = [u8; 4]>> PackedU32Iterator<T> {
fn new(packed_bytes: usize, iterator: T) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add an assert on packed_bytes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out, adding assert is indeed a better behavior

buffer.splice(
starting_offset..starting_offset,
std::iter::repeat_n(0u8, header_size),
let fields = PackedU32Iterator::new(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly that the current approach looks more easy to understand at the first glance. This PackedU32Iterator approach is a bit over-abstraction to me. For appending the bytes into the buffer, I'd prefer to keep it simple instead of introducing a new abstraction on it if not too much benefits.

@scovich
Copy link
Contributor

scovich commented Aug 3, 2025

If you hit a buffer overflow panic, I think that means you tried to use the "fast" version to write the Some(data_size) entry, after the loop exits. Or did you hit a problem in the loop itself?

It would panic in the following test when creating a variant with VariantBuilder::new.

let (m1, v1) = make_nested_object();
        let variant = Variant::new(&m1, &v1);

        // because we can guarantee metadata is validated through the builder
        let mut builder = VariantBuilder::new().with_metadata(VariantMetadata::new(&m1));
        builder.append_value(variant.clone()); <-- panic here because we have a buffer with size 6 and will try to copy 4 bytes to the buffer starting from 3

Ah... I didn't think about that. With one-byte offsets, Some(data_size) (= one byte) isn't enough buffer to avoid overflow when writing 4 bytes. Drat.

@alamb alamb marked this pull request as draft August 5, 2025 19:28
@alamb
Copy link
Contributor

alamb commented Aug 5, 2025

Converting to a draft as I think we are leanting towards not merging this one and I want to clean up the PR queue

@klion26
Copy link
Member Author

klion26 commented Dec 29, 2025

As this is included in #8480, which wants to finalize Variant Type support in Parquet, I want to push this forward to a final state.

Conclusion

After playing with some benchmarks, I have some conclusions here

  • We can add a HeaderWriter for object/list builder(the headwriter is generic with some const parameter), listbuilder and objectbuilder use the same header writer, and it will improve the new benchmark(described below) by 3%~4%.
  • We need some new benchmarks to better measure the object/list builder's work

CC @alamb @scovich

Details

  • I re-ran the existing benchmarks for multiple solutions(including the generic with const parameter), and found that there wasn't much difference in the results.
  • Use sample and flamegraph tools to generate the flamegraph, which shows that the objectbuilder::finish only consumes ~2% time, and JSON-related work consumes ~54% time, ObjectBuilder::insert ~ 14%, and ObjectBuilder::append_value 15% (flamegraph attached below)
  • I added a new benchmark(described below) that excludes the Json deserialize work for the ObjectBuilder, shows that ObjectBuilder::finish consumes ~27%, and ObjectBuilder::append_value consumes ~ 62%
  • I ran a benchmark for the single-depth object builder with 100/200 fields, which shows that the new solution will win with 3% ~ 4% improvement.
  • did not add a benchmark for the ObjectBuilder::finish because the append_value and finish are a single unit for the caller.

Benchmark

The benchmark results

  • optimize-with-header-writer-const-generic is optimized headerWriter and const generic(branch goes here, and attach sample code below)
  • main_with_object_bench is the main branch with the new object builder benchmark(the generate code here, the benchmark code here, sample code below)
  • optimize-with-packedu32iterator is based on main-with-object-bench and uses the packedu32Iteratro(code here -- PackedU32Iterator, and the caller logic, and sample code below)
  • The result shows that optimize-with-header-writer-const-generic and optimize-with-packedu32iterator are both better than main(optimize-with-packedu32iterator is better), but as optimize-with-header-writer-const-generic is more easily understand(maybe maintain), wo I lean to use optimize-with-header-writer-const-generic
group                                                                                           optimize-with-header-writer-const-generic     main-with-object-bench            optimize-with-packedu32iterator
-----                                                                                           --------------------------------------------------------    ---------------------------            ----------------------------------------------------
batch_json_string_to_variant json_list 8k string                                                1.00     27.5±0.89ms        ? ?/sec                         1.09     29.9±0.81ms        ? ?/sec    1.03     28.3±0.76ms        ? ?/sec
batch_json_string_to_variant object - 1 depth(8000-100) random_json(2436 bytes per document)    1.01    204.1±4.21ms        ? ?/sec                         1.03    208.6±4.17ms        ? ?/sec    1.00    201.8±4.59ms        ? ?/sec
batch_json_string_to_variant object - 1 depth(8000-200) random_json(4871 bytes per document)    1.02    389.8±7.24ms        ? ?/sec                         1.04    399.6±7.10ms        ? ?/sec    1.00    383.8±7.35ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)                               1.01    328.1±5.06ms        ? ?/sec                         1.04    337.8±6.06ms        ? ?/sec    1.00    325.6±5.64ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string                                          1.06     11.0±0.35ms        ? ?/sec                         1.05     11.0±0.33ms        ? ?/sec    1.00     10.4±0.28ms        ? ?/sec
variant_get_primitive                                                                           1.01  1730.8±49.77ns        ? ?/sec                         1.03  1759.5±53.97ns        ? ?/sec    1.00  1712.1±57.32ns        ? ?/sec
variant_get_shredded_utf8                                                                       1.02  1248.2±43.11ns        ? ?/sec                         1.04  1276.2±26.61ns        ? ?/sec    1.00  1225.9±38.09ns        ? ?/sec

Ref

new benchmark logic
       write!(output_buffer, "{{").unwrap();
        for i in 0..*max_fields {
            let key_length = rng.random_range(1..=20);
            let key: String = (0..key_length)
                .map(|_| rng.sample(Alphanumeric) as char)
                .collect();
            write!(output_buffer, "\"{key}\":").unwrap();

            let total_weight = *null_weight + *string_weight + *number_weight + *boolean_weight;

            // Generate a random number to determine the type
            let mut random_value: usize = rng.random_range(0..total_weight);

            if random_value <= *null_weight {
                write!(output_buffer, "null").unwrap();
            } else {
                random_value -= *null_weight;

                if random_value <= *string_weight {
                    // Generate a random string between 1 and 20 characters
                    let length = rng.random_range(1..=20);
                    let random_string: String = (0..length)
                        .map(|_| rng.sample(Alphanumeric) as char)
                        .collect();
                    write!(output_buffer, "\"{random_string}\"",).unwrap();
                } else {
                    random_value -= *string_weight;

                    if random_value <= *number_weight {
                        // 50% chance of generating an integer or a float
                        if rng.random_bool(0.5) {
                            // Generate a random integer
                            let random_integer: i64 = rng.random_range(-1000..1000);
                            write!(output_buffer, "{random_integer}",).unwrap();
                        } else {
                            // Generate a random float
                            let random_float: f64 = rng.random_range(-1000.0..1000.0);
                            write!(output_buffer, "{random_float}",).unwrap();
                        }
                    } else {
                        random_value -= *number_weight;

                        if random_value <= *boolean_weight {
                            // Generate a random boolean
                            let random_boolean: bool = rng.random();
                            write!(output_buffer, "{random_boolean}",).unwrap();
                        }
                    }
                }
            }
            if i < *max_fields - 1 {
                write!(output_buffer, ",").unwrap();
            }
        }
        write!(&mut self.output_buffer, "}}").unwrap();
PackedU32Iteratro
struct PackedU32Iterator<const PACKED_BYTES: usize, T: Iterator<Item = [u8; 4]>> {
    iterator: T,
    current_item: [u8; 4],
    current_byte: usize, // 0..3
}

impl<const PACKED_BYTES: usize, T: Iterator<Item = [u8; 4]>> PackedU32Iterator<PACKED_BYTES, T> {
    fn new(packed_bytes: usize, iterator: T) -> Self {
        // eliminate corner cases in `next` by initializing
        // with a fake already-consumed "first" item
        Self {
            iterator,
            current_item: [0; 4],
            current_byte: packed_bytes,
        }
    }
}

impl<const PACKED_BYTES: usize, T: Iterator<Item = [u8; 4]>> Iterator
    for PackedU32Iterator<PACKED_BYTES, T>
{
    type Item = u8;

    fn next(&mut self) -> Option<Self::Item> {
        if self.current_byte >= PACKED_BYTES {
            self.current_item = self.iterator.next()?;
            self.current_byte = 0;
        }

        let rval = self.current_item[self.current_byte];
        self.current_byte += 1;
        Some(rval)
    }

    fn size_hint(&self) -> (usize, Option<usize>) {
        let lower = (PACKED_BYTES - self.current_byte) + PACKED_BYTES * self.iterator.size_hint().0;
        (lower, Some(lower))
    }
}
header writer with const generic
fn object_header<const LARGE_BIT: u8, const ID_SIZE: u8, const OFFSET_SIZE: u8>() -> u8 {
    (LARGE_BIT << (BASIC_TYPE_BITS + 4))
        | ((ID_SIZE - 1) << (BASIC_TYPE_BITS + 2))
        | ((OFFSET_SIZE - 1) << BASIC_TYPE_BITS)
        | VariantBasicType::Object as u8
}

struct ObjectHeaderWriter<const OFFSET_SIZE: u8, const ID_SIZE: u8>();

impl<const OFFSET_SIZE: u8, const ID_SIZE: u8> ObjectHeaderWriter<OFFSET_SIZE, ID_SIZE> {
    fn write(
        dst: &mut Vec<u8>,
        num_fields: usize,
        field_ids: impl Iterator<Item = u32>,
        offsets: impl Iterator<Item = usize>,
        data_size: usize,
    ) {
        let header = match num_fields > u8::MAX as usize {
            true => object_header::<1, { ID_SIZE }, { OFFSET_SIZE }>(),
            false => object_header::<0, { ID_SIZE }, { OFFSET_SIZE }>(),
        };
        dst.push(header);

        append_packed_u32::<ID_SIZE>(dst, num_fields as u32);

        for id in field_ids {
            append_packed_u32::<ID_SIZE>(dst, id);
        }

        for off in offsets {
            append_packed_u32::<OFFSET_SIZE>(dst, off as u32);
        }

        append_packed_u32::<OFFSET_SIZE>(dst, data_size as u32);
    }
}

fn append_packed_u32<const SIZE: u8>(dest: &mut Vec<u8>, value: u32) {
    let len = dest.len() + SIZE as usize;
    dest.extend(value.to_le_bytes());
    dest.truncate(len);
}

The flamegraph was obtained with the steps

  1. execute cargo bench --no-run to get the executable
  2. execute with the executable with the arguments like
  3. use sudo sample $pid 60 -file /tmp/variant_kernels.sample to generate the sample file
  4. use flamegraph tools to generate the flamegraph
    use these steps to generate the flamegraph as can't install xcode for now :(

The flamegraph with JSON deserialize
variant_kernels-object

The flamegraph without the JSON deserialize
variant_kernels-object-no-json-desr

@alamb alamb marked this pull request as ready for review December 30, 2025 13:01
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @klion26 -- your analysis makes sense to me and I think we can merge this PR (we can always make it better as a follow on). I'll run some benchmarks and assuming all looks good I'll plan to merge it

@alamb
Copy link
Contributor

alamb commented Dec 30, 2025

run benchmark variant_kernels variant_builder

@alamb
Copy link
Contributor

alamb commented Dec 30, 2025

(BTW this PR has some merge conflicts that need to be resolevd)

@apache apache deleted a comment from alamb-ghbot Dec 30, 2025
@alamb-ghbot
Copy link

Benchmark script failed with exit code 128.

Last 10 lines of output:

Click to expand
From https://github.com/apache/arrow-rs
 * [new ref]               refs/pull/8031/head -> 7978_optimize_header_generatation_with_iterator
Previous HEAD position was 2d6fc518ed Add examples for min and max functions (#9062)
Switched to branch '7978_optimize_header_generatation_with_iterator'
M	parquet-testing
++ git submodule update --init
From https://github.com/apache/parquet-testing
   a3d96a6..92d45b0  master     -> origin/master
fatal: remote error: upload-pack: not our ref b68bea40fed8d1a780a9e09dd2262017e04b19ad
fatal: Fetched in submodule path 'parquet-testing', but it did not contain b68bea40fed8d1a780a9e09dd2262017e04b19ad. Direct fetching of that commit failed.

@alamb
Copy link
Contributor

alamb commented Dec 30, 2025

(I think if we merge up / fix the conflicts I can get a clean benchmark run)

@klion26 klion26 force-pushed the 7978_optimize_header_generatation_with_iterator branch from 4a50b88 to 963f194 Compare January 4, 2026 12:22
@github-actions github-actions bot added parquet-variant parquet-variant* crates and removed parquet Changes to the parquet crate labels Jan 4, 2026
@klion26 klion26 force-pushed the 7978_optimize_header_generatation_with_iterator branch from 963f194 to 486f379 Compare January 4, 2026 15:31
@klion26
Copy link
Member Author

klion26 commented Jan 4, 2026

@alamb thanks for the review and happy new year! I've updated the code

  • Commit 25ad804: Refines ObjectBuilder::finish.
  • Commit 486f379: Adds benchmarking for ObjectBuilder.

@klion26
Copy link
Member Author

klion26 commented Jan 4, 2026

run benchmark variant_kernels variant_builder

@alamb-ghbot
Copy link

🤖 Hi @klion26, thanks for the request (#8031 (comment)). scrape_comments.py only responds to whitelisted users. Allowed users: Dandandan, Omega359, adriangb, alamb, comphead, geoffreyclaude, rluvaton, xudong963, zhuqi-lucas.

@klion26
Copy link
Member Author

klion26 commented Jan 4, 2026

@alamb Please help to trigger a benchmark, thanks

@alamb
Copy link
Contributor

alamb commented Jan 5, 2026

run benchmark variant_kernels variant_builder

@alamb
Copy link
Contributor

alamb commented Jan 5, 2026

@alamb Please help to trigger a benchmark, thanks

Thanks @klion26 -- I triggered the benchmark and also added you to the list of people who can run them (see alamb/datafusion-benchmarking@c6d1404)

@alamb
Copy link
Contributor

alamb commented Jan 5, 2026

I am sorry for the delay -- I have been out for a while.

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 7978_optimize_header_generatation_with_iterator (fccab21) to a9d6e92 diff
BENCH_NAME=variant_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=7978_optimize_header_generatation_with_iterator
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                                             7978_optimize_header_generatation_with_iterator    main
-----                                                                                             -----------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                                                  1.01     23.9±0.15ms        ? ?/sec                1.00     23.6±0.76ms        ? ?/sec
batch_json_string_to_variant object - 1 depth(100 fields) random_json(2436 bytes per document)    1.00    200.2±1.09ms        ? ?/sec              
batch_json_string_to_variant object - 1 depth(200 fields) random_json(4871 bytes per document)    1.00    413.9±2.56ms        ? ?/sec              
batch_json_string_to_variant random_json(2633 bytes per document)                                 1.03    315.4±2.54ms        ? ?/sec                1.00    307.0±3.49ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string                                            1.03      7.3±0.05ms        ? ?/sec                1.00      7.1±0.06ms        ? ?/sec
variant_get_primitive                                                                             1.04   946.7±21.79ns        ? ?/sec                1.00    910.0±5.51ns        ? ?/sec
variant_get_shredded_utf8                                                                         1.01    694.2±5.75ns        ? ?/sec                1.00    688.6±7.63ns        ? ?/sec

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 7978_optimize_header_generatation_with_iterator (fccab21) to a9d6e92 diff
BENCH_NAME=variant_builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_builder
BENCH_FILTER=
BENCH_BRANCH_NAME=7978_optimize_header_generatation_with_iterator
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                       7978_optimize_header_generatation_with_iterator    main
-----                                       -----------------------------------------------    ----
bench_extend_metadata_builder               1.04     59.4±4.51ms        ? ?/sec                1.00     57.0±2.32ms        ? ?/sec
bench_object_field_names_reverse_order      1.00     18.9±0.30ms        ? ?/sec                1.02     19.4±0.51ms        ? ?/sec
bench_object_list_partially_same_schema     1.00  1249.6±22.05µs        ? ?/sec                1.00  1250.7±16.55µs        ? ?/sec
bench_object_list_same_schema               1.03     24.9±0.46ms        ? ?/sec                1.00     24.2±0.23ms        ? ?/sec
bench_object_list_unknown_schema            1.01     13.3±0.11ms        ? ?/sec                1.00     13.3±0.33ms        ? ?/sec
bench_object_partially_same_schema          1.00      3.3±0.05ms        ? ?/sec                1.00      3.3±0.15ms        ? ?/sec
bench_object_same_schema                    1.00     37.6±0.86ms        ? ?/sec                1.00     37.8±0.21ms        ? ?/sec
bench_object_unknown_schema                 1.01     16.2±0.13ms        ? ?/sec                1.00     16.1±0.10ms        ? ?/sec
iteration/unvalidated_fallible_iteration    1.00      2.6±0.01ms        ? ?/sec                1.04      2.7±0.08ms        ? ?/sec
iteration/validated_iteration               1.03     48.6±0.88µs        ? ?/sec                1.00     47.3±0.67µs        ? ?/sec
validation/unvalidated_construction         1.00      6.5±0.03µs        ? ?/sec                1.00      6.5±0.05µs        ? ?/sec
validation/validated_construction           1.01     61.0±0.88µs        ? ?/sec                1.00     60.4±0.57µs        ? ?/sec
validation/validation_cost                  1.01     54.4±0.59µs        ? ?/sec                1.00     54.1±1.08µs        ? ?/sec

@klion26
Copy link
Member Author

klion26 commented Jan 7, 2026

There are some regressions in the variant_builder benchmark. I'm looking into it.

@alamb
Copy link
Contributor

alamb commented Jan 7, 2026

There are some regressions in the variant_builder benchmark. I'm looking into it.

FWIW I don't think the benchmarks are precise enough (the VM I run them on regularly has small ~5% variances) so it would be good to verify / reproduce the results locally before you spend too much time on

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach LGTM. I think we can still simplify and tighten up the code a bit (readability, not speed) to end up with something that's faster than current main with fewer lines of code. Always a nice combo.

Comment on lines 37 to 38
pub use from_json::JsonToVariant;
pub use from_json::append_json;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub use from_json::JsonToVariant;
pub use from_json::append_json;
pub use from_json::{append_json, JsonToVariant};

?

Comment on lines 44 to 55
let is_large = num_fields > u8::MAX as usize;
match is_large {
true => {
dst.push(object_header::<1, { ID_SIZE }, { OFFSET_SIZE }>());
// num_fields will consume 4 bytes when it is larger than u8::MAX
append_packed_u32::<4>(dst, num_fields);
}
false => {
dst.push(object_header::<0, { ID_SIZE }, { OFFSET_SIZE }>());
append_packed_u32::<1>(dst, num_fields);
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let is_large = num_fields > u8::MAX as usize;
match is_large {
true => {
dst.push(object_header::<1, { ID_SIZE }, { OFFSET_SIZE }>());
// num_fields will consume 4 bytes when it is larger than u8::MAX
append_packed_u32::<4>(dst, num_fields);
}
false => {
dst.push(object_header::<0, { ID_SIZE }, { OFFSET_SIZE }>());
append_packed_u32::<1>(dst, num_fields);
}
};
// num_fields will consume 4 bytes when it is larger than u8::MAX
if num_fields > u8::MAX as usize {
dst.push(object_header::<1, { ID_SIZE }, { OFFSET_SIZE }>());
append_packed_u32::<4>(dst, num_fields);
} else {
dst.push(object_header::<0, { ID_SIZE }, { OFFSET_SIZE }>());
append_packed_u32::<1>(dst, num_fields);
}

Comment on lines 290 to 291
// let header = object_header(is_large, id_size, offset_size);
// bytes_to_splice.push(header);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

self.fields.values().copied(),
data_size,
),
_ => panic!("Unsupported offset_size/id_size combination"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This match arm would be unnecessary if int_size returned OffsetSizeBytes (decoder.rs)... except I don't think const generics work with enum variants? We'd have to do e.g.

use OffsetSizeBytes::*;
match (offset_size, id_size) {
      ...
    (Four, Two) => ObjectHeaderWriter::<4, 2>::write(...),
      ...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that enum variant casting does actually work for const generics, so we could also do:

use OffsetSizeBytes::*;
match (offset_size, id_size) {
      ...
    (Four, Two) => ObjectHeaderWriter::<{Four as _}, {Two as _}>::write(...),
      ...
}

Comment on lines 293 to 303
match (offset_size, id_size) {
(1, 1) => ObjectHeaderWriter::<1, 1>::write(
&mut bytes_to_splice,
num_fields,
self.fields.keys().copied(),
self.fields.values().copied(),
data_size,
),
(1, 2) => ObjectHeaderWriter::<1, 2>::write(
&mut bytes_to_splice,
num_fields,
self.fields.keys().copied(),
self.fields.values().copied(),
data_size,
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is begging for a helper macro to simplify things... but I have no idea how to write a macro that expands a cartesian product :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might define a local macro that captures local state, just to reduce the boilerplate a bit:

macro_rules! write_object_header {
    ($offset_size:ident, $id_size:ident) => {
        ObjectHeaderWriter::<$offset_size, $id_size>::write(...)
    };
}

match (offset_size, id_size) {
    (1, 1) => write_object_header!(1, 1),
    (1, 2) => write_object_header!(1, 2),
      ..
    (4, 3) => write_object_header!(4, 3),
    (4, 4) => write_object_header!(4, 4),
    _ => { ... }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining that with the other idea to have int_size return OffsetSizeBytes would yield:

macro_rules! write_object_header {
    ($offset_size:ident, $id_size:ident) => {
        ObjectHeaderWriter::<{$offset_size as _}, {$id_size as _}>::write(...)
    };
}

match (offset_size, id_size) {
    (One, One) => write_object_header!(One, One),
    (One, Two) => write_object_header!(One, Two),
      ...
    (Four, Three) => write_object_header!(Four, Three),
    (Four, Four) => write_object_header!(Four, Four),
}

Copy link
Contributor

@scovich scovich Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I figured out a way to write a cartesian product expansion macro (it needs two macros, actually), but it's hideously complex and way more than sixteen LoC. Not worth it unless we're really worried about typos in the match arms)

@klion26 klion26 force-pushed the 7978_optimize_header_generatation_with_iterator branch from fccab21 to 528f04c Compare January 8, 2026 06:58
@klion26
Copy link
Member Author

klion26 commented Jan 8, 2026

run benchmark variant_kernels variant_builder

@klion26
Copy link
Member Author

klion26 commented Jan 8, 2026

FWIW I don't think the benchmarks are precise enough (the VM I run them on regularly has small ~5% variances) so it would be good to verify / reproduce the results locally before you spend too much time on

In my local testing, the difference was less than 5%. After some minor modifications, the results are now almost identical (the same benchmark occasionally performs better and occasionally worse, by about 2%).

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 7978_optimize_header_generatation_with_iterator (df7a0ee) to 9e822e0 diff
BENCH_NAME=variant_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=7978_optimize_header_generatation_with_iterator
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                                             7978_optimize_header_generatation_with_iterator    main
-----                                                                                             -----------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                                                  1.00     22.8±0.19ms        ? ?/sec                1.02     23.3±0.19ms        ? ?/sec
batch_json_string_to_variant object - 1 depth(100 fields) random_json(2436 bytes per document)    1.00    196.9±1.87ms        ? ?/sec              
batch_json_string_to_variant object - 1 depth(200 fields) random_json(4871 bytes per document)    1.00    410.3±2.93ms        ? ?/sec              
batch_json_string_to_variant random_json(2633 bytes per document)                                 1.01    310.2±1.80ms        ? ?/sec                1.00    308.5±1.66ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string                                            1.00      7.2±0.09ms        ? ?/sec                1.00      7.2±0.13ms        ? ?/sec
variant_get_primitive                                                                             1.00    924.0±4.79ns        ? ?/sec                1.01   928.7±18.64ns        ? ?/sec
variant_get_shredded_utf8                                                                         1.02    700.3±6.59ns        ? ?/sec                1.00    689.4±8.01ns        ? ?/sec

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 7978_optimize_header_generatation_with_iterator (df7a0ee) to 9e822e0 diff
BENCH_NAME=variant_builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_builder
BENCH_FILTER=
BENCH_BRANCH_NAME=7978_optimize_header_generatation_with_iterator
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                       7978_optimize_header_generatation_with_iterator    main
-----                                       -----------------------------------------------    ----
bench_extend_metadata_builder               1.00     56.7±1.67ms        ? ?/sec                1.03     58.3±2.90ms        ? ?/sec
bench_object_field_names_reverse_order      1.00     19.2±0.36ms        ? ?/sec                1.02     19.7±0.43ms        ? ?/sec
bench_object_list_partially_same_schema     1.00  1230.6±16.62µs        ? ?/sec                1.02  1257.7±14.60µs        ? ?/sec
bench_object_list_same_schema               1.00     23.7±0.40ms        ? ?/sec                1.04     24.7±0.46ms        ? ?/sec
bench_object_list_unknown_schema            1.00     13.1±0.11ms        ? ?/sec                1.01     13.2±0.08ms        ? ?/sec
bench_object_partially_same_schema          1.00      3.2±0.01ms        ? ?/sec                1.02      3.3±0.09ms        ? ?/sec
bench_object_same_schema                    1.00     37.4±0.15ms        ? ?/sec                1.02     38.1±0.33ms        ? ?/sec
bench_object_unknown_schema                 1.00     16.2±0.22ms        ? ?/sec                1.00     16.2±0.37ms        ? ?/sec
iteration/unvalidated_fallible_iteration    1.00      2.1±0.01ms        ? ?/sec                1.26      2.7±0.08ms        ? ?/sec
iteration/validated_iteration               1.00     47.3±0.51µs        ? ?/sec                1.07     50.8±2.71µs        ? ?/sec
validation/unvalidated_construction         1.00      6.5±0.31µs        ? ?/sec                1.00      6.5±0.11µs        ? ?/sec
validation/validated_construction           1.02     61.4±0.49µs        ? ?/sec                1.00     60.4±0.50µs        ? ?/sec
validation/validation_cost                  1.01     54.4±1.08µs        ? ?/sec                1.00     53.9±0.44µs        ? ?/sec

@klion26
Copy link
Member Author

klion26 commented Jan 8, 2026

@alamb @scovich I've updated the code, and the benchmark seems good now. Please take another look when you're free. Thanks.

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +44 to +46
let is_large = num_fields > u8::MAX as usize;
// num_fields will consume 4 bytes when it is larger than u8::MAX
if is_large {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not sure is_large as a val is really helpful -- if field count is too large to fit in a byte, of course it will need more bytes to store? (but definitely a matter of preference)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Optimize the object header generation logic in ObjectBuilder::finish

6 participants