[Variant] Optimize the object header generation logic in ObjectBuilder::finish #8031

klion26 · 2025-07-31T11:56:18Z

Which issue does this PR close?

This pr wants to optimize the logic of ObjectBuilder::finish

Closes [Variant] Optimize the object header generation logic in ObjectBuilder::finish #7978 .

Rationale for this change

This pr wants to optimize the logic of ObjectBuilder::finish

What changes are included in this PR?

This PR wants to optimize ObjectBuilder::finish with packedu3 iterator

Are these changes tested?

This pr was covered by existing test

Are there any user-facing changes?

No

klion26 · 2025-07-31T12:31:23Z

@alamb @scovich @viirya Please help review this when you're free. thanks.

I've tried to benchmark three implementations

the current implementation with PackedU32Iterator
the implementation with append_packed_u32 into a tmp buf, then splice into the parent's buffer like we did for ListBuilder::finish

append_packed_u32 into a tmp buf

   let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte.
    let header_size = 1 + // header byte (i.e., `object_header`)
        num_fileds_size + // num_fields_size
        (num_fields * id_size as usize) + // field IDs
        ((num_fields + 1) * offset_size as usize); // field offsets + data_size
    let mut bytes_to_splice = Vec::with_capacity(header_size + 3);
    // Write header byte
    let header = object_header(is_large, id_size, offset_size);
    bytes_to_splice.push(header);
    append_packed_u32(&mut bytes_to_splice, num_fields as u32, num_fileds_size);
    for field_id in self.fields.keys() {
        append_packed_u32(&mut bytes_to_splice, *field_id, id_size as usize);
    }
    for offset in self.fields.values() {
        append_packed_u32(&mut bytes_to_splice, *offset as u32, id_size as usize);
    }
    append_packed_u32(&mut bytes_to_splice, data_size as u32, offset_size as usize);
    let starting_offset = self.parent_value_offset_base;
    // Shift existing data to make room for the header
    let buffer = parent_buffer.inner_mut();
    buffer.splice(starting_offset..starting_offset, bytes_to_splice);

the implementation with using PackedU32Iterator and extend, then splice into the iterator into the parent's buffer

PackedU32Iterator with extend

    let num_fields_bytes = num_fields.to_le_bytes();
    let num_elements_bytes = num_fields_bytes.iter().take(num_fileds_size).copied();
    let fields = PackedU32Iterator::new(
        id_size as usize,
        self.fields.keys().map(|offset| offset.to_le_bytes()),
    );
    let offsets = PackedU32Iterator::new(
        offset_size as usize,
        self.fields
            .values()
            .map(|offset| (*offset as u32).to_le_bytes()),
    );
    let data_size_bytes = (data_size as u32).to_le_bytes();
    let data_size_bytes_iter = data_size_bytes.iter().take(offset_size as usize).copied();
    let header = object_header(is_large, id_size, offset_size);
    let mut bytes_to_splice = vec![header];
    bytes_to_splice.extend(num_elements_bytes);
    bytes_to_splice.extend(fields);
    bytes_to_splice.extend(offsets);
    bytes_to_splice.extend(data_size_bytes_iter);
    let starting_offset = self.parent_value_offset_base;
    // Shift existing data to make room for the header
    let buffer = parent_buffer.inner_mut();
    buffer.splice(starting_offset..starting_offset, bytes_to_splice);

The results show that the first win(but not too much).

The results were generated by
1 Create three branches with the code
2 Execute the benchmark command one by one on the three branches and main
3 Execute critcmp to generate the result.

PackedU32Iterator

group                                                                7978_optimize_header_generatation_with_iterator    main
-----                                                                -----------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                     1.00     28.4±1.05ms        ? ?/sec                1.03     29.1±0.97ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00    324.6±8.35ms        ? ?/sec                1.01    327.6±8.18ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.00     11.0±0.40ms        ? ?/sec                1.01     11.2±0.34ms        ? ?/sec
variant_get_primitive                                                1.00  1105.8±33.81µs        ? ?/sec                1.01  1121.0±43.28µs        ? ?/sec

`append_packed_u32` into a tmp buf

group                                                                7978_optimize_header_generation_logic_object_builder    main
-----                                                                ----------------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                     1.00     28.5±1.09ms        ? ?/sec                     1.02     29.1±0.97ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00    328.9±8.20ms        ? ?/sec                     1.00    327.6±8.18ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.00     11.3±0.31ms        ? ?/sec                     1.00     11.2±0.34ms        ? ?/sec
variant_get_primitive                                                1.00  1106.9±40.65µs        ? ?/sec                     1.01  1121.0±43.28µs        ? ?/sec

`PackedU32Iterator` with extend

group                                                                7978_optimizer_header_generation_with_extend    main
-----                                                                --------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                     1.04     30.2±1.82ms        ? ?/sec             1.00     29.1±0.97ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.06    345.8±9.52ms        ? ?/sec             1.00    327.6±8.18ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.12     12.5±0.77ms        ? ?/sec             1.00     11.2±0.34ms        ? ?/sec
variant_get_primitive                                                1.00  1123.7±50.11µs        ? ?/sec             1.00  1121.0±43.28µs        ? ?/sec

From the conversation of previous prs[1][2], I think we prefer that

using iterator(or tmp buf) with splice, so that we won't mis-calculate the size we spliced into the parent's buffer
we want to avoid tmp buf creation

For this and the benchmarks, I've some questions

for the mis-calculated header problem, can we use ut to cover the header size calculation logic?
for the tmp buf creation problem, from the pr in Remove redundant is_err checks in Variant tests #7897 , it would have a better performance if we use a tmp buf with append_packed_u32, is that the tmp buffer creation cost may or may not be the bottleneck, it depends on the workloads we're on. If this is the case, do we need to add some more benchmarks?(also do we need to bench for different header metadata such asis_large is true)
the variant_kernel benchmark will run batch_json_string_to_variant random_json({} bytes per document) twice, is the first for warmup?

alamb

Thanks @klion26 -- I agree this appears somewhat complicated for what it is doing and doesn't really make a huge difference

I left some thoughts on specialization

Let me know what you think

cc @scovich

alamb · 2025-07-31T20:17:43Z

parquet-variant/src/builder.rs

-            (if is_large { 4 } else { 1 }) + // num_fields
-            (num_fields * id_size as usize) + // field IDs
-            ((num_fields + 1) * offset_size as usize); // field offsets + data_size
+        let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte.


This variable name seems to have a typo:

Suggested change

let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte.

let num_fields_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte.

also: I personally find the comment unhelpful -- it just restates the code.

alamb · 2025-07-31T20:39:12Z

parquet-variant/src/builder.rs

+/// An iterator that yields the bytes of a packed u32 iterator.
+/// Will yield the first `packed_bytes` bytes of each item in the iterator.
+struct PackedU32Iterator<T: Iterator<Item = [u8; 4]>> {
+    packed_bytes: usize,


I think to get this really fast we will need to use generics based on the packed_bytes size -- so we would end up with different versions of the code for 1,2 and 4 byte offsets

We could try to make this particular iterator generic, but it might get a bit messy.

I was thinking maybe we can somehow structure the code so there is a function like this that writes the header for a certain offset size:

fn write_header<const SIZE: usize>(dst: &mut Vec<u8>, ...) { ... }

Then we would basically have a switch like this to instantiate the appropriate versions (can probably avoid the panic)

match int_size(max_id as usize) { 1 => write_header::<1>(dst, ...), 2 => write_header::<2>(dst, ...), 4 => write_header::<4>(dst, ...), _ => panic!("unsupported size") }

Even just specializing the PackedU32Iterator with a const N: usize parameter would probably give most of the benefit, for this iterator-based approach?

However -- the temp buf append+truncate is fundamentally faster on a per-entry basis, because of how much simpler it is at machine code level. And it will run at exactly the same speed for all packing sizes, because there are no branches or even conditional moves based on the packing size (**). The only downside is the up-front cost of allocating said temp buffer first, which has to amortize across all entries.

(**) If the compiler were a bit smarter, it could even eliminate the bound checks in the loop's push calls (based on the result of the with_capacity that preceded it), and also eliminate the bounds checks in the loop's truncate calls (based on the fact that the arg passed to truncate is never larger than the vec size).

I haven't looked into the variant implementation or this PR in detail, but does this iterator need a custom implementation or can it be an anonymous impl Iterator<Item=u8>? It looks like the implementation is basically iter.flat_map(u32::to_le_bytes). The benefit of that is that if the underlying iterator has a trusted length then the flat-mapped iterator is still trusted, and extending a slice from a TrustedLen iterator is much more efficient.

@jhorstmann IIUC, the benefit of this iterator is size_hint, Rust can using the size_hint do do better staff, as it knows the size of the iterator

However -- the temp buf append+truncate is fundamentally faster on a per-entry basis, because of how much simpler it is at machine code level. And it will run at exactly the same speed for all packing sizes, because there are no branches or even conditional moves based on the packing size (**). The only downside is the up-front cost of allocating said temp buffer first, which has to amortize across all entries.

Yes I agree with this analysis.

This is why my opinion still remains that the way to avoid the allocation is to calculate the header size and shift the bytes first so we write directly to the target. I realize that is trickier code, but as @klion26 has said somewhere else I think we could write some good assertions and tests to avoid problems

When I next get a chance to work on Variant related things with time, I want to focus on unsticking the shredding implementation. Once that is done, I may return here to try and restructure things

One thought I had was to encapsulate the logic / templating in a

struct ListHeaderWriter { offsets: &[usize], } impl ListHeaderWriter { fn header_size::<const OFFSET_SIZE:usize>(&self) -> usize { // put the header size calculation here } // write the header into dst starting at start_offset fn write(dst: &mut Vec<u8>, start_offset: usize)::<const OFFSET_SIZE:usize>(&self) { ... } }

Seems you are right, FlatMap would only work for constant length arrays, but not dynamic slices with length based on packed_bytes. And TrustedLen optimization also can not kick in since the base iterator is from a crate outside the standard library.

scovich · 2025-07-31T21:50:18Z

I've tried to benchmark three implementations

Thanks for taking the time to do this!

The results show that the first win(but not too much).

Technically, the first two are equivalent because each result's mean is comfortably inside the other's confidence window. While it may be suggestive that the temp buf approach has a higher mean in all four results, any real difference is dwarfed by noise:

group	PackedU32Iterator	append into tmp buf
json_list	28.4±1.05ms	28.5±1.09ms
random_json	324.6±8.35ms	328.9±8.20m
repeated_struct	11.0±0.40ms	11.3±0.31ms
variant_get_primitive	1105.8±33.81µs	1106.9±40.65µ

(aside: why would variant_get_primitive case (= reading) even be relevant for variant building?)

Meanwhile, it's unsurprising in retrospect that the complex iterator approach would be slower than the append+truncate approach, given how simple (branchless!) the latter's machine code is.

for the mis-calculated header problem, can we use ut to cover the header size calculation logic?

I mean, yes we can. But given a choice to eliminate a bug surface entirely from the code base, vs. try to validate it with unit tests that could be incomplete and/or go out of sync? I almost always favor eliminating the possibility of bugs over testing for them -- unless the perf and/or complexity difference is large enough to make me question that default choice.

do we need to bench for different header metadata such as is_large is true?

If we care about the performance (and correctness) of that case, it seems like we need to have benchmarks and unit tests to cover it?

scovich

this appears somewhat complicated for what it is doing and doesn't really make a huge difference

If we want to measure the cost of safety, it might be instructive to modify append_offset_array_start_from_buf_pos to take the same approach as append_packed_u32 in writing more bytes than we keep -- something like:

        let mut current_pos = start_pos;
        for relative_offset in offsets {
-           write_offset_at_pos(buf, current_pos, relative_offset, nbytes);
+           write_u32_at_pos(buf, current_pos, relative_offset as u32);
            current_pos += nbytes as usize;
        }

        // Write data_size
        if let Some(data_size) = data_size {
            // Write data_size at the end of the offsets
            write_offset_at_pos(buf, current_pos, data_size, nbytes);
            current_pos += nbytes as usize;
        }

where

// always writes the four LE bytes of a value, relying on the caller to only advance
// the buffer's position by the number of bytes that should be kept (1..=4). 
//
// WARNING: Caller must ensure that it's safe to produce as many as 3 extra bytes
// without corrupting the buffer. Usually by ensuring that something else overwrites
// the space immediately after.
fn write_u32_at_pos(buf: &mut [u8], start_pos: usize, value: u32) {
    let bytes = value.to_le_bytes();
    buf[start_pos..start_pos + bytes.len()].copy_from_slice(&bytes);
}

Highly dangerous -- it relies on the fact that the field array is followed by the value array, and the value array is followed followed by the data size (which is written in the safer/slower way) -- but we could at least see whether it gives enough performance boost to consider keeping it.

IMO, if want to keep the current unsafe approach that uses offset arithmetic in a pre-allocated buffer... we may as well go all the way and maximize the performance benefit we gain by giving up safety.

scovich · 2025-07-31T21:15:21Z

parquet-variant/src/builder.rs

+/// An iterator that yields the bytes of a packed u32 iterator.
+/// Will yield the first `packed_bytes` bytes of each item in the iterator.
+struct PackedU32Iterator<T: Iterator<Item = [u8; 4]>> {
+    packed_bytes: usize,


Even just specializing the PackedU32Iterator with a const N: usize parameter would probably give most of the benefit, for this iterator-based approach?

However -- the temp buf append+truncate is fundamentally faster on a per-entry basis, because of how much simpler it is at machine code level. And it will run at exactly the same speed for all packing sizes, because there are no branches or even conditional moves based on the packing size (**). The only downside is the up-front cost of allocating said temp buffer first, which has to amortize across all entries.

(**) If the compiler were a bit smarter, it could even eliminate the bound checks in the loop's push calls (based on the result of the with_capacity that preceded it), and also eliminate the bounds checks in the loop's truncate calls (based on the fact that the arg passed to truncate is never larger than the vec size).

scovich · 2025-07-31T21:18:49Z

parquet-variant/src/builder.rs

-            (if is_large { 4 } else { 1 }) + // num_fields
-            (num_fields * id_size as usize) + // field IDs
-            ((num_fields + 1) * offset_size as usize); // field offsets + data_size
+        let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte.


also: I personally find the comment unhelpful -- it just restates the code.

scovich · 2025-07-31T21:20:02Z

parquet-variant/src/builder.rs

+        let data_size_bytes = (data_size as u32).to_le_bytes();
+        let data_size_bytes_iter = data_size_bytes.iter().take(offset_size as usize).copied();
        let header = object_header(is_large, id_size, offset_size);
+        let bytess_to_splice = std::iter::once(header)


Suggested change

let bytess_to_splice = std::iter::once(header)

let bytes_to_splice = std::iter::once(header)

klion26 · 2025-08-01T09:53:55Z

@alamb I think it might be better to stick with the current solution: 1) The solution in the current PR is more complex than the original, but the performance improvement isn't significant; 2) I originally hoped to keep the same implementation for ListBuilder/ObjectBuilder, but it seems that changes to ObjectBuilder might result in performance regressions.

@scovich

I mean, yes we can. But given a choice to eliminate a bug surface entirely from the code base, vs. try to validate it with unit tests that could be incomplete and/or go out of sync? I almost always favor eliminating the possibility of bugs over testing for them -- unless the perf and/or complexity difference is large enough to make me question that default choice

Thanks for your detailed response. I agree with your point. It's more reliable to fundamentally address the potential for errors. This way, even if there's no test to guarantee this logic for any reason in the future, it will continue to run correctly.

I'm a little confused about the benchmark. Why is PackedU32Iterator + extend the slowest? Is it because it calls multiple extend, which results in data movement? (using PackedU32Iterator allows extend to take advantage of size_hint).

For the unsafe code with write_u32_at_pos, I tried it, but it requires further changes; otherwise, it will panic in write_u32_at_pos because there is no more room. If we need to benchmark this, I'll try to make this change and benchmark it with current solution.

@alamb @scovich Finally, thank you for your detailed review and guidance on these PRs. I've learned a lot from them. Thank you again.

alamb · 2025-08-01T14:04:38Z

@alamb I think it might be better to stick with the current solution: 1) The solution in the current PR is more complex than the original, but the performance improvement isn't significant;

Yes I agree with this plan for now

@alamb @scovich Finally, thank you for your detailed review and guidance on these PRs. I've learned a lot from them. Thank you again.

Thank you again for all the help

scovich · 2025-08-01T17:37:19Z

Why is PackedU32Iterator + extend the slowest? Is it because it calls multiple extend, which results in data movement? (using PackedU32Iterator allows extend to take advantage of size_hint).

I suspect it's because the compiler can't "see through" the packed iterator like it can with the append+truncate approach. So the former produces 1-4 individual byte appends, each with its own bounds check and length change, while the latter produces a single 32-bit append followed by a truncate.

For the unsafe code with write_u32_at_pos, I tried it, but it requires further changes; otherwise, it will panic in write_u32_at_pos because there is no more room.

If you hit a buffer overflow panic, I think that means you tried to use the "fast" version to write the Some(data_size) entry, after the loop exits. Or did you hit a problem in the loop itself?

I think it might be better to stick with the current solution: 1) The solution in the current PR is more complex than the original, but the performance improvement isn't significant.

The same argument likely applies to the object builder. Should we consider reverting that back to match the list builder, for simplicity and maintainability? I'd have to go back and check the other PR, but was the perf improvement actually significant there?

If so=> why did it not work for list builder? Do we not have apples-to-apples benchmarking?

If not => probably good to revert.

klion26 · 2025-08-02T15:47:04Z

@scovich

If you hit a buffer overflow panic, I think that means you tried to use the "fast" version to write the Some(data_size) entry, after the loop exits. Or did you hit a problem in the loop itself?

It would panic in the following test when creating a variant with VariantBuilder::new.

let (m1, v1) = make_nested_object();
        let variant = Variant::new(&m1, &v1);

        // because we can guarantee metadata is validated through the builder
        let mut builder = VariantBuilder::new().with_metadata(VariantMetadata::new(&m1));
        builder.append_value(variant.clone()); <-- panic here because we have a buffer with size 6 and will try to copy 4 bytes to the buffer starting from 3

        let (metadata, value) = builder.finish();
        let result_variant = Variant::new(&metadata, &value);

        assert_eq!(variant, result_variant);

The same argument likely applies to the object builder. Should we consider reverting that back to match the list builder, for simplicity and maintainability? I'd have to go back and check the other PR, but was the perf improvement actually significant there?

Sorry, I've rechecked the current ObjectBuilder::finish logic(I misremembered the implementation in main branch before). I think either the current approach or aligning it with ListBuilder::finish is fine for me. The current approach has some possibility of generating bugs as @scovich said in the previous comment, but for now we have ut which ensures correctness. Aligning the logic with ListBuilder::finish will fundamentally ensure correctness(the benchmark difference is minimal). @alamb @scovich, I can modify the code if necessary.

viirya · 2025-08-03T00:34:16Z

parquet-variant/src/builder.rs

-        header_pos = self
-            .parent_state
-            .buffer()
-            .append_offset_array_start_from_buf_pos(


Not related to this PR, but append_offset_array_start_from_buf_pos also writes field ids instead of just offsets.

Will change the function name after this pr finalized.

viirya · 2025-08-03T00:39:10Z

parquet-variant/src/builder.rs

+}
+
+impl<T: Iterator<Item = [u8; 4]>> PackedU32Iterator<T> {
+    fn new(packed_bytes: usize, iterator: T) -> Self {


It's better to add an assert on packed_bytes.

Thanks for pointing it out, adding assert is indeed a better behavior

viirya · 2025-08-03T01:05:00Z

parquet-variant/src/builder.rs

-        buffer.splice(
-            starting_offset..starting_offset,
-            std::iter::repeat_n(0u8, header_size),
+        let fields = PackedU32Iterator::new(


Honestly that the current approach looks more easy to understand at the first glance. This PackedU32Iterator approach is a bit over-abstraction to me. For appending the bytes into the buffer, I'd prefer to keep it simple instead of introducing a new abstraction on it if not too much benefits.

scovich · 2025-08-03T03:20:19Z

If you hit a buffer overflow panic, I think that means you tried to use the "fast" version to write the Some(data_size) entry, after the loop exits. Or did you hit a problem in the loop itself?

It would panic in the following test when creating a variant with VariantBuilder::new.
let (m1, v1) = make_nested_object();
        let variant = Variant::new(&m1, &v1);

        // because we can guarantee metadata is validated through the builder
        let mut builder = VariantBuilder::new().with_metadata(VariantMetadata::new(&m1));
        builder.append_value(variant.clone()); <-- panic here because we have a buffer with size 6 and will try to copy 4 bytes to the buffer starting from 3

Ah... I didn't think about that. With one-byte offsets, Some(data_size) (= one byte) isn't enough buffer to avoid overflow when writing 4 bytes. Drat.

alamb · 2025-08-05T19:29:06Z

Converting to a draft as I think we are leanting towards not merging this one and I want to clean up the PR queue

klion26 · 2025-12-29T03:19:07Z

As this is included in #8480, which wants to finalize Variant Type support in Parquet, I want to push this forward to a final state.

Conclusion

After playing with some benchmarks, I have some conclusions here

We can add a HeaderWriter for object/list builder(the headwriter is generic with some const parameter), listbuilder and objectbuilder use the same header writer, and it will improve the new benchmark(described below) by 3%~4%.
We need some new benchmarks to better measure the object/list builder's work

CC @alamb @scovich

Details

I re-ran the existing benchmarks for multiple solutions(including the generic with const parameter), and found that there wasn't much difference in the results.
Use sample and flamegraph tools to generate the flamegraph, which shows that the objectbuilder::finish only consumes ~2% time, and JSON-related work consumes ~54% time, ObjectBuilder::insert ~ 14%, and ObjectBuilder::append_value 15% (flamegraph attached below)
I added a new benchmark(described below) that excludes the Json deserialize work for the ObjectBuilder, shows that ObjectBuilder::finish consumes ~27%, and ObjectBuilder::append_value consumes ~ 62%
I ran a benchmark for the single-depth object builder with 100/200 fields, which shows that the new solution will win with 3% ~ 4% improvement.
did not add a benchmark for the ObjectBuilder::finish because the append_value and finish are a single unit for the caller.

Benchmark

The benchmark results

optimize-with-header-writer-const-generic is optimized headerWriter and const generic(branch goes here, and attach sample code below)
main_with_object_bench is the main branch with the new object builder benchmark(the generate code here, the benchmark code here, sample code below)
optimize-with-packedu32iterator is based on main-with-object-bench and uses the packedu32Iteratro(code here -- PackedU32Iterator, and the caller logic, and sample code below)
The result shows that optimize-with-header-writer-const-generic and optimize-with-packedu32iterator are both better than main(optimize-with-packedu32iterator is better), but as optimize-with-header-writer-const-generic is more easily understand(maybe maintain), wo I lean to use optimize-with-header-writer-const-generic

group                                                                                           optimize-with-header-writer-const-generic     main-with-object-bench            optimize-with-packedu32iterator
-----                                                                                           --------------------------------------------------------    ---------------------------            ----------------------------------------------------
batch_json_string_to_variant json_list 8k string                                                1.00     27.5±0.89ms        ? ?/sec                         1.09     29.9±0.81ms        ? ?/sec    1.03     28.3±0.76ms        ? ?/sec
batch_json_string_to_variant object - 1 depth(8000-100) random_json(2436 bytes per document)    1.01    204.1±4.21ms        ? ?/sec                         1.03    208.6±4.17ms        ? ?/sec    1.00    201.8±4.59ms        ? ?/sec
batch_json_string_to_variant object - 1 depth(8000-200) random_json(4871 bytes per document)    1.02    389.8±7.24ms        ? ?/sec                         1.04    399.6±7.10ms        ? ?/sec    1.00    383.8±7.35ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)                               1.01    328.1±5.06ms        ? ?/sec                         1.04    337.8±6.06ms        ? ?/sec    1.00    325.6±5.64ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string                                          1.06     11.0±0.35ms        ? ?/sec                         1.05     11.0±0.33ms        ? ?/sec    1.00     10.4±0.28ms        ? ?/sec
variant_get_primitive                                                                           1.01  1730.8±49.77ns        ? ?/sec                         1.03  1759.5±53.97ns        ? ?/sec    1.00  1712.1±57.32ns        ? ?/sec
variant_get_shredded_utf8                                                                       1.02  1248.2±43.11ns        ? ?/sec                         1.04  1276.2±26.61ns        ? ?/sec    1.00  1225.9±38.09ns        ? ?/sec

Ref

new benchmark logic

       write!(output_buffer, "{{").unwrap();
        for i in 0..*max_fields {
            let key_length = rng.random_range(1..=20);
            let key: String = (0..key_length)
                .map(|_| rng.sample(Alphanumeric) as char)
                .collect();
            write!(output_buffer, "\"{key}\":").unwrap();

            let total_weight = *null_weight + *string_weight + *number_weight + *boolean_weight;

            // Generate a random number to determine the type
            let mut random_value: usize = rng.random_range(0..total_weight);

            if random_value <= *null_weight {
                write!(output_buffer, "null").unwrap();
            } else {
                random_value -= *null_weight;

                if random_value <= *string_weight {
                    // Generate a random string between 1 and 20 characters
                    let length = rng.random_range(1..=20);
                    let random_string: String = (0..length)
                        .map(|_| rng.sample(Alphanumeric) as char)
                        .collect();
                    write!(output_buffer, "\"{random_string}\"",).unwrap();
                } else {
                    random_value -= *string_weight;

                    if random_value <= *number_weight {
                        // 50% chance of generating an integer or a float
                        if rng.random_bool(0.5) {
                            // Generate a random integer
                            let random_integer: i64 = rng.random_range(-1000..1000);
                            write!(output_buffer, "{random_integer}",).unwrap();
                        } else {
                            // Generate a random float
                            let random_float: f64 = rng.random_range(-1000.0..1000.0);
                            write!(output_buffer, "{random_float}",).unwrap();
                        }
                    } else {
                        random_value -= *number_weight;

                        if random_value <= *boolean_weight {
                            // Generate a random boolean
                            let random_boolean: bool = rng.random();
                            write!(output_buffer, "{random_boolean}",).unwrap();
                        }
                    }
                }
            }
            if i < *max_fields - 1 {
                write!(output_buffer, ",").unwrap();
            }
        }
        write!(&mut self.output_buffer, "}}").unwrap();

PackedU32Iteratro

struct PackedU32Iterator<const PACKED_BYTES: usize, T: Iterator<Item = [u8; 4]>> {
    iterator: T,
    current_item: [u8; 4],
    current_byte: usize, // 0..3
}

impl<const PACKED_BYTES: usize, T: Iterator<Item = [u8; 4]>> PackedU32Iterator<PACKED_BYTES, T> {
    fn new(packed_bytes: usize, iterator: T) -> Self {
        // eliminate corner cases in `next` by initializing
        // with a fake already-consumed "first" item
        Self {
            iterator,
            current_item: [0; 4],
            current_byte: packed_bytes,
        }
    }
}

impl<const PACKED_BYTES: usize, T: Iterator<Item = [u8; 4]>> Iterator
    for PackedU32Iterator<PACKED_BYTES, T>
{
    type Item = u8;

    fn next(&mut self) -> Option<Self::Item> {
        if self.current_byte >= PACKED_BYTES {
            self.current_item = self.iterator.next()?;
            self.current_byte = 0;
        }

        let rval = self.current_item[self.current_byte];
        self.current_byte += 1;
        Some(rval)
    }

    fn size_hint(&self) -> (usize, Option<usize>) {
        let lower = (PACKED_BYTES - self.current_byte) + PACKED_BYTES * self.iterator.size_hint().0;
        (lower, Some(lower))
    }
}

header writer with const generic

fn object_header<const LARGE_BIT: u8, const ID_SIZE: u8, const OFFSET_SIZE: u8>() -> u8 {
    (LARGE_BIT << (BASIC_TYPE_BITS + 4))
        | ((ID_SIZE - 1) << (BASIC_TYPE_BITS + 2))
        | ((OFFSET_SIZE - 1) << BASIC_TYPE_BITS)
        | VariantBasicType::Object as u8
}

struct ObjectHeaderWriter<const OFFSET_SIZE: u8, const ID_SIZE: u8>();

impl<const OFFSET_SIZE: u8, const ID_SIZE: u8> ObjectHeaderWriter<OFFSET_SIZE, ID_SIZE> {
    fn write(
        dst: &mut Vec<u8>,
        num_fields: usize,
        field_ids: impl Iterator<Item = u32>,
        offsets: impl Iterator<Item = usize>,
        data_size: usize,
    ) {
        let header = match num_fields > u8::MAX as usize {
            true => object_header::<1, { ID_SIZE }, { OFFSET_SIZE }>(),
            false => object_header::<0, { ID_SIZE }, { OFFSET_SIZE }>(),
        };
        dst.push(header);

        append_packed_u32::<ID_SIZE>(dst, num_fields as u32);

        for id in field_ids {
            append_packed_u32::<ID_SIZE>(dst, id);
        }

        for off in offsets {
            append_packed_u32::<OFFSET_SIZE>(dst, off as u32);
        }

        append_packed_u32::<OFFSET_SIZE>(dst, data_size as u32);
    }
}

fn append_packed_u32<const SIZE: u8>(dest: &mut Vec<u8>, value: u32) {
    let len = dest.len() + SIZE as usize;
    dest.extend(value.to_le_bytes());
    dest.truncate(len);
}

The flamegraph was obtained with the steps

execute cargo bench --no-run to get the executable

execute with the executable with the arguments like

use sudo sample $pid 60 -file /tmp/variant_kernels.sample to generate the sample file

use flamegraph tools to generate the flamegraph
use these steps to generate the flamegraph as can't install xcode for now :(

The flamegraph with JSON deserialize

The flamegraph without the JSON deserialize

alamb

Thanks @klion26 -- your analysis makes sense to me and I think we can merge this PR (we can always make it better as a follow on). I'll run some benchmarks and assuming all looks good I'll plan to merge it

alamb · 2025-12-30T13:06:11Z

run benchmark variant_kernels variant_builder

alamb · 2025-12-30T13:06:25Z

(BTW this PR has some merge conflicts that need to be resolevd)

alamb-ghbot · 2025-12-30T13:06:47Z

Benchmark script failed with exit code 128.

Last 10 lines of output:

Click to expand

From https://github.com/apache/arrow-rs
 * [new ref]               refs/pull/8031/head -> 7978_optimize_header_generatation_with_iterator
Previous HEAD position was 2d6fc518ed Add examples for min and max functions (#9062)
Switched to branch '7978_optimize_header_generatation_with_iterator'
M	parquet-testing
++ git submodule update --init
From https://github.com/apache/parquet-testing
   a3d96a6..92d45b0  master     -> origin/master
fatal: remote error: upload-pack: not our ref b68bea40fed8d1a780a9e09dd2262017e04b19ad
fatal: Fetched in submodule path 'parquet-testing', but it did not contain b68bea40fed8d1a780a9e09dd2262017e04b19ad. Direct fetching of that commit failed.

alamb · 2025-12-30T13:07:34Z

(I think if we merge up / fix the conflicts I can get a clean benchmark run)

klion26 · 2026-01-04T15:49:23Z

@alamb thanks for the review and happy new year! I've updated the code

Commit 25ad804: Refines ObjectBuilder::finish.
Commit 486f379: Adds benchmarking for ObjectBuilder.

klion26 · 2026-01-04T15:49:33Z

run benchmark variant_kernels variant_builder

alamb-ghbot · 2026-01-04T15:49:35Z

🤖 Hi @klion26, thanks for the request (#8031 (comment)). scrape_comments.py only responds to whitelisted users. Allowed users: Dandandan, Omega359, adriangb, alamb, comphead, geoffreyclaude, rluvaton, xudong963, zhuqi-lucas.

klion26 · 2026-01-04T15:50:46Z

@alamb Please help to trigger a benchmark, thanks

alamb · 2026-01-05T22:00:27Z

run benchmark variant_kernels variant_builder

alamb · 2026-01-05T22:02:12Z

@alamb Please help to trigger a benchmark, thanks

Thanks @klion26 -- I triggered the benchmark and also added you to the list of people who can run them (see alamb/datafusion-benchmarking@c6d1404)

alamb · 2026-01-05T22:02:40Z

I am sorry for the delay -- I have been out for a while.

alamb-ghbot · 2026-01-06T19:40:15Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 7978_optimize_header_generatation_with_iterator (fccab21) to a9d6e92 diff
BENCH_NAME=variant_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=7978_optimize_header_generatation_with_iterator
Results will be posted here when complete

alamb-ghbot · 2026-01-06T19:46:41Z

🤖: Benchmark completed

Details

group                                                                                             7978_optimize_header_generatation_with_iterator    main
-----                                                                                             -----------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                                                  1.01     23.9±0.15ms        ? ?/sec                1.00     23.6±0.76ms        ? ?/sec
batch_json_string_to_variant object - 1 depth(100 fields) random_json(2436 bytes per document)    1.00    200.2±1.09ms        ? ?/sec              
batch_json_string_to_variant object - 1 depth(200 fields) random_json(4871 bytes per document)    1.00    413.9±2.56ms        ? ?/sec              
batch_json_string_to_variant random_json(2633 bytes per document)                                 1.03    315.4±2.54ms        ? ?/sec                1.00    307.0±3.49ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string                                            1.03      7.3±0.05ms        ? ?/sec                1.00      7.1±0.06ms        ? ?/sec
variant_get_primitive                                                                             1.04   946.7±21.79ns        ? ?/sec                1.00    910.0±5.51ns        ? ?/sec
variant_get_shredded_utf8                                                                         1.01    694.2±5.75ns        ? ?/sec                1.00    688.6±7.63ns        ? ?/sec

alamb-ghbot · 2026-01-06T19:46:44Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 7978_optimize_header_generatation_with_iterator (fccab21) to a9d6e92 diff
BENCH_NAME=variant_builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_builder
BENCH_FILTER=
BENCH_BRANCH_NAME=7978_optimize_header_generatation_with_iterator
Results will be posted here when complete

alamb-ghbot · 2026-01-06T19:51:09Z

🤖: Benchmark completed

Details

group                                       7978_optimize_header_generatation_with_iterator    main
-----                                       -----------------------------------------------    ----
bench_extend_metadata_builder               1.04     59.4±4.51ms        ? ?/sec                1.00     57.0±2.32ms        ? ?/sec
bench_object_field_names_reverse_order      1.00     18.9±0.30ms        ? ?/sec                1.02     19.4±0.51ms        ? ?/sec
bench_object_list_partially_same_schema     1.00  1249.6±22.05µs        ? ?/sec                1.00  1250.7±16.55µs        ? ?/sec
bench_object_list_same_schema               1.03     24.9±0.46ms        ? ?/sec                1.00     24.2±0.23ms        ? ?/sec
bench_object_list_unknown_schema            1.01     13.3±0.11ms        ? ?/sec                1.00     13.3±0.33ms        ? ?/sec
bench_object_partially_same_schema          1.00      3.3±0.05ms        ? ?/sec                1.00      3.3±0.15ms        ? ?/sec
bench_object_same_schema                    1.00     37.6±0.86ms        ? ?/sec                1.00     37.8±0.21ms        ? ?/sec
bench_object_unknown_schema                 1.01     16.2±0.13ms        ? ?/sec                1.00     16.1±0.10ms        ? ?/sec
iteration/unvalidated_fallible_iteration    1.00      2.6±0.01ms        ? ?/sec                1.04      2.7±0.08ms        ? ?/sec
iteration/validated_iteration               1.03     48.6±0.88µs        ? ?/sec                1.00     47.3±0.67µs        ? ?/sec
validation/unvalidated_construction         1.00      6.5±0.03µs        ? ?/sec                1.00      6.5±0.05µs        ? ?/sec
validation/validated_construction           1.01     61.0±0.88µs        ? ?/sec                1.00     60.4±0.57µs        ? ?/sec
validation/validation_cost                  1.01     54.4±0.59µs        ? ?/sec                1.00     54.1±1.08µs        ? ?/sec

klion26 · 2026-01-07T09:35:03Z

There are some regressions in the variant_builder benchmark. I'm looking into it.

alamb · 2026-01-07T16:40:13Z

There are some regressions in the variant_builder benchmark. I'm looking into it.

FWIW I don't think the benchmarks are precise enough (the VM I run them on regularly has small ~5% variances) so it would be good to verify / reproduce the results locally before you spend too much time on

scovich

Approach LGTM. I think we can still simplify and tighten up the code a bit (readability, not speed) to end up with something that's faster than current main with fewer lines of code. Always a nice combo.

scovich · 2026-01-07T15:45:14Z

parquet-variant-json/src/lib.rs

 pub use from_json::JsonToVariant;
+pub use from_json::append_json;


Suggested change

pub use from_json::JsonToVariant;

pub use from_json::append_json;

pub use from_json::{append_json, JsonToVariant};

?

scovich · 2026-01-07T15:48:28Z

parquet-variant/src/builder/object.rs

+        let is_large = num_fields > u8::MAX as usize;
+        match is_large {
+            true => {
+                dst.push(object_header::<1, { ID_SIZE }, { OFFSET_SIZE }>());
+                // num_fields will consume 4 bytes when it is larger than u8::MAX
+                append_packed_u32::<4>(dst, num_fields);
+            }
+            false => {
+                dst.push(object_header::<0, { ID_SIZE }, { OFFSET_SIZE }>());
+                append_packed_u32::<1>(dst, num_fields);
+            }
+        };


Suggested change

let is_large = num_fields > u8::MAX as usize;

match is_large {

true => {

dst.push(object_header::<1, { ID_SIZE }, { OFFSET_SIZE }>());

// num_fields will consume 4 bytes when it is larger than u8::MAX

append_packed_u32::<4>(dst, num_fields);

}

false => {

dst.push(object_header::<0, { ID_SIZE }, { OFFSET_SIZE }>());

append_packed_u32::<1>(dst, num_fields);

}

};

// num_fields will consume 4 bytes when it is larger than u8::MAX

if num_fields > u8::MAX as usize {

dst.push(object_header::<1, { ID_SIZE }, { OFFSET_SIZE }>());

append_packed_u32::<4>(dst, num_fields);

} else {

dst.push(object_header::<0, { ID_SIZE }, { OFFSET_SIZE }>());

append_packed_u32::<1>(dst, num_fields);

}

scovich · 2026-01-07T15:49:53Z

parquet-variant/src/builder/object.rs

+        // let header = object_header(is_large, id_size, offset_size);
+        // bytes_to_splice.push(header);


scovich · 2026-01-07T15:59:50Z

parquet-variant/src/builder/object.rs

+                self.fields.values().copied(),
+                data_size,
+            ),
+            _ => panic!("Unsupported offset_size/id_size combination"),


This match arm would be unnecessary if int_size returned OffsetSizeBytes (decoder.rs)... except I don't think const generics work with enum variants? We'd have to do e.g.

use OffsetSizeBytes::*; match (offset_size, id_size) { ... (Four, Two) => ObjectHeaderWriter::<4, 2>::write(...), ... }

I confirmed that enum variant casting does actually work for const generics, so we could also do:

use OffsetSizeBytes::*; match (offset_size, id_size) { ... (Four, Two) => ObjectHeaderWriter::<{Four as _}, {Two as _}>::write(...), ... }

scovich · 2026-01-07T16:03:03Z

parquet-variant/src/builder/object.rs

+        match (offset_size, id_size) {
+            (1, 1) => ObjectHeaderWriter::<1, 1>::write(
+                &mut bytes_to_splice,
+                num_fields,
+                self.fields.keys().copied(),
+                self.fields.values().copied(),
+                data_size,
+            ),
+            (1, 2) => ObjectHeaderWriter::<1, 2>::write(
+                &mut bytes_to_splice,
+                num_fields,
+                self.fields.keys().copied(),
+                self.fields.values().copied(),
+                data_size,
+            ),


This is begging for a helper macro to simplify things... but I have no idea how to write a macro that expands a cartesian product :(

We might define a local macro that captures local state, just to reduce the boilerplate a bit:

macro_rules! write_object_header { ($offset_size:ident, $id_size:ident) => { ObjectHeaderWriter::<$offset_size, $id_size>::write(...) }; } match (offset_size, id_size) { (1, 1) => write_object_header!(1, 1), (1, 2) => write_object_header!(1, 2), .. (4, 3) => write_object_header!(4, 3), (4, 4) => write_object_header!(4, 4), _ => { ... } }

Combining that with the other idea to have int_size return OffsetSizeBytes would yield:

macro_rules! write_object_header { ($offset_size:ident, $id_size:ident) => { ObjectHeaderWriter::<{$offset_size as _}, {$id_size as _}>::write(...) }; } match (offset_size, id_size) { (One, One) => write_object_header!(One, One), (One, Two) => write_object_header!(One, Two), ... (Four, Three) => write_object_header!(Four, Three), (Four, Four) => write_object_header!(Four, Four), }

(I figured out a way to write a cartesian product expansion macro (it needs two macros, actually), but it's hideously complex and way more than sixteen LoC. Not worth it unless we're really worried about typos in the match arms)

…r::finish

klion26 · 2026-01-08T07:49:05Z

run benchmark variant_kernels variant_builder

klion26 · 2026-01-08T07:53:20Z

FWIW I don't think the benchmarks are precise enough (the VM I run them on regularly has small ~5% variances) so it would be good to verify / reproduce the results locally before you spend too much time on

In my local testing, the difference was less than 5%. After some minor modifications, the results are now almost identical (the same benchmark occasionally performs better and occasionally worse, by about 2%).

alamb-ghbot · 2026-01-08T08:29:43Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 7978_optimize_header_generatation_with_iterator (df7a0ee) to 9e822e0 diff
BENCH_NAME=variant_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=7978_optimize_header_generatation_with_iterator
Results will be posted here when complete

alamb-ghbot · 2026-01-08T08:35:59Z

🤖: Benchmark completed

Details

group                                                                                             7978_optimize_header_generatation_with_iterator    main
-----                                                                                             -----------------------------------------------    ----
batch_json_string_to_variant json_list 8k string                                                  1.00     22.8±0.19ms        ? ?/sec                1.02     23.3±0.19ms        ? ?/sec
batch_json_string_to_variant object - 1 depth(100 fields) random_json(2436 bytes per document)    1.00    196.9±1.87ms        ? ?/sec              
batch_json_string_to_variant object - 1 depth(200 fields) random_json(4871 bytes per document)    1.00    410.3±2.93ms        ? ?/sec              
batch_json_string_to_variant random_json(2633 bytes per document)                                 1.01    310.2±1.80ms        ? ?/sec                1.00    308.5±1.66ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string                                            1.00      7.2±0.09ms        ? ?/sec                1.00      7.2±0.13ms        ? ?/sec
variant_get_primitive                                                                             1.00    924.0±4.79ns        ? ?/sec                1.01   928.7±18.64ns        ? ?/sec
variant_get_shredded_utf8                                                                         1.02    700.3±6.59ns        ? ?/sec                1.00    689.4±8.01ns        ? ?/sec

alamb-ghbot · 2026-01-08T08:36:02Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 7978_optimize_header_generatation_with_iterator (df7a0ee) to 9e822e0 diff
BENCH_NAME=variant_builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_builder
BENCH_FILTER=
BENCH_BRANCH_NAME=7978_optimize_header_generatation_with_iterator
Results will be posted here when complete

alamb-ghbot · 2026-01-08T08:40:29Z

🤖: Benchmark completed

Details

group                                       7978_optimize_header_generatation_with_iterator    main
-----                                       -----------------------------------------------    ----
bench_extend_metadata_builder               1.00     56.7±1.67ms        ? ?/sec                1.03     58.3±2.90ms        ? ?/sec
bench_object_field_names_reverse_order      1.00     19.2±0.36ms        ? ?/sec                1.02     19.7±0.43ms        ? ?/sec
bench_object_list_partially_same_schema     1.00  1230.6±16.62µs        ? ?/sec                1.02  1257.7±14.60µs        ? ?/sec
bench_object_list_same_schema               1.00     23.7±0.40ms        ? ?/sec                1.04     24.7±0.46ms        ? ?/sec
bench_object_list_unknown_schema            1.00     13.1±0.11ms        ? ?/sec                1.01     13.2±0.08ms        ? ?/sec
bench_object_partially_same_schema          1.00      3.2±0.01ms        ? ?/sec                1.02      3.3±0.09ms        ? ?/sec
bench_object_same_schema                    1.00     37.4±0.15ms        ? ?/sec                1.02     38.1±0.33ms        ? ?/sec
bench_object_unknown_schema                 1.00     16.2±0.22ms        ? ?/sec                1.00     16.2±0.37ms        ? ?/sec
iteration/unvalidated_fallible_iteration    1.00      2.1±0.01ms        ? ?/sec                1.26      2.7±0.08ms        ? ?/sec
iteration/validated_iteration               1.00     47.3±0.51µs        ? ?/sec                1.07     50.8±2.71µs        ? ?/sec
validation/unvalidated_construction         1.00      6.5±0.31µs        ? ?/sec                1.00      6.5±0.11µs        ? ?/sec
validation/validated_construction           1.02     61.4±0.49µs        ? ?/sec                1.00     60.4±0.50µs        ? ?/sec
validation/validation_cost                  1.01     54.4±1.08µs        ? ?/sec                1.00     53.9±0.44µs        ? ?/sec

klion26 · 2026-01-08T11:37:16Z

@alamb @scovich I've updated the code, and the benchmark seems good now. Please take another look when you're free. Thanks.

scovich

Nice!

scovich · 2026-01-08T18:56:52Z

parquet-variant/src/builder/object.rs

+        let is_large = num_fields > u8::MAX as usize;
+        // num_fields will consume 4 bytes when it is larger than u8::MAX
+        if is_large {


nit: I'm not sure is_large as a val is really helpful -- if field count is too large to fit in a byte, of course it will need more bytes to store? (but definitely a matter of preference)

github-actions bot added the parquet Changes to the parquet crate label Jul 31, 2025

alamb reviewed Jul 31, 2025

View reviewed changes

scovich reviewed Jul 31, 2025

View reviewed changes

viirya reviewed Aug 3, 2025

View reviewed changes

alamb marked this pull request as draft August 5, 2025 19:28

alamb marked this pull request as ready for review December 30, 2025 13:01

alamb approved these changes Dec 30, 2025

View reviewed changes

apache deleted a comment from alamb-ghbot Dec 30, 2025

klion26 force-pushed the 7978_optimize_header_generatation_with_iterator branch from 4a50b88 to 963f194 Compare January 4, 2026 12:22

github-actions bot added parquet-variant parquet-variant* crates and removed parquet Changes to the parquet crate labels Jan 4, 2026

klion26 force-pushed the 7978_optimize_header_generatation_with_iterator branch from 963f194 to 486f379 Compare January 4, 2026 15:31

scovich reviewed Jan 7, 2026

View reviewed changes

klion26 added 4 commits January 8, 2026 14:57

[Variant] Optimize the object header generation logic in ObjectBuilde…

40cbdf9

…r::finish

[Variant] Add benchmark for VariantObject

a116533

inline add_packed_u32

2093c52

rm useless code

528f04c

klion26 force-pushed the 7978_optimize_header_generatation_with_iterator branch from fccab21 to 528f04c Compare January 8, 2026 06:58

address comments

df7a0ee

scovich approved these changes Jan 8, 2026

View reviewed changes

	let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte.
	let num_fields_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte.

	let bytess_to_splice = std::iter::once(header)
	let bytes_to_splice = std::iter::once(header)

		pub use from_json::JsonToVariant;
		pub use from_json::append_json;

	pub use from_json::JsonToVariant;
	pub use from_json::append_json;
	pub use from_json::{append_json, JsonToVariant};

		// let header = object_header(is_large, id_size, offset_size);
		// bytes_to_splice.push(header);

[Variant] Optimize the object header generation logic in ObjectBuilder::finish #8031

Are you sure you want to change the base?

[Variant] Optimize the object header generation logic in ObjectBuilder::finish #8031

Uh oh!

Conversation

klion26 commented Jul 31, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

klion26 commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PackedU32Iterator

append_packed_u32 into a tmp buf

PackedU32Iterator with extend

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich commented Jul 31, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klion26 commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Aug 1, 2025

Uh oh!

scovich commented Aug 1, 2025

Uh oh!

klion26 commented Aug 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich commented Aug 3, 2025

Uh oh!

alamb commented Aug 5, 2025

Uh oh!

klion26 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conclusion

Details

Benchmark

Ref

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

klion26 commented Jul 31, 2025 •

edited

Loading

`append_packed_u32` into a tmp buf

`PackedU32Iterator` with extend

klion26 commented Aug 1, 2025 •

edited

Loading

klion26 commented Dec 29, 2025 •

edited

Loading

alamb commented Dec 30, 2025 •

edited

Loading