Skip to content

New Parquet-Variant encoding#7130

Open
AdamGS wants to merge 13 commits intodevelopfrom
adamg/parquet-variant
Open

New Parquet-Variant encoding#7130
AdamGS wants to merge 13 commits intodevelopfrom
adamg/parquet-variant

Conversation

@AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Mar 23, 2026

Summary

This PR introduces a new encoding to support Arrow's canonical extension type, but does not integrate it with anything else.

It does include basic pieces like slice/take/filter, and it can (or at least should) roundtrip with the equivalent arrow array.

Testing

The encoding includes a bunch of basic tests, both for making sure it roundtrips with arrow and for the various nullability cases.

@AdamGS AdamGS added the changelog/feature A new feature label Mar 23, 2026
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch 2 times, most recently from 62280a5 to f266c90 Compare March 23, 2026 16:21
@codspeed-hq
Copy link

codspeed-hq bot commented Mar 23, 2026

Merging this PR will not alter performance

✅ 1106 untouched benchmarks
⏩ 1522 skipped benchmarks1


Comparing adamg/parquet-variant (338af8a) with develop (2383946)

Open in CodSpeed

Footnotes

  1. 1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@a10y
Copy link
Contributor

a10y commented Mar 23, 2026

It looks like this supports one typed_value per column. How would you represent something like this example, which shreds out multiple nested columns from an object?

image

Physically I think this differs from shredding out a struct column with two fields.

@AdamGS
Copy link
Contributor Author

AdamGS commented Mar 24, 2026

I think in that case it is a struct, because they are named

@AdamGS
Copy link
Contributor Author

AdamGS commented Mar 24, 2026

typed_value will be a struct (I think that's "object" in variant terms), but not directly of the event_type and event_ts types, but of ParquetVariant arrays that, but roughly:

Struct {
    "event_ts": Variant {
        ..,
        typed_value: Timestamp
    },
    "event_type": Variant {
        ..,
        typed_value: String
    },
}

@AdamGS AdamGS force-pushed the adamg/parquet-variant branch 3 times, most recently from 3d7ecb4 to ad93dcf Compare March 24, 2026 13:28
@AdamGS AdamGS requested review from a10y, connortsui20 and gatesn and removed request for a10y, connortsui20 and gatesn March 24, 2026 15:33
AdamGS added a commit that referenced this pull request Mar 24, 2026
## Summary

Filling up missing parts of making Variant a canonical array. It still
not fully supported but this is a step towards it.

These changes started as part of
#7130, but I figured they are
just noise in that already too big of a PR.

## API Changes

Makes variant officially a canonical array and type.

---------

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch 2 times, most recently from 6e32d0f to fbf4d4e Compare March 25, 2026 12:36
@AdamGS AdamGS marked this pull request as ready for review March 25, 2026 14:18
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch from fbf4d4e to 4848c82 Compare March 25, 2026 14:18
@AdamGS AdamGS requested review from a10y, gatesn and robert3005 March 25, 2026 14:19
AdamGS added 12 commits March 25, 2026 16:04
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch from 510a8c8 to e5a799e Compare March 25, 2026 16:06
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch from e5a799e to 338af8a Compare March 25, 2026 16:07
@AdamGS AdamGS requested a review from connortsui20 March 26, 2026 14:52
Copy link
Contributor

@connortsui20 connortsui20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I gave it a quick pass but haven't looked at the scalar operations yet

}

/// Creates a Parquet Variant array with explicit outer validity.
pub fn try_new_with_validity(
Copy link
Contributor

@connortsui20 connortsui20 Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should just be try_new? That would be consistent with a lot of our other canonical array constructors. And if you want to be extra consistent you can pull out a validate method, but that's not strictly necessary.

Comment on lines +83 to +97
if let Some(validity_len) = validity.maybe_len() {
vortex_ensure!(
validity_len == len,
"validity length must match metadata length"
);
}
if let Some(ref v) = value {
vortex_ensure!(v.len() == len, "value length must match metadata length");
}
if let Some(ref tv) = typed_value {
vortex_ensure!(
tv.len() == len,
"typed_value length must match metadata length"
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vortex_ensure_eq!

We really should just plumb through the codebase and replace all instances of these...

Validity::from(BitBuffer::from(nulls.inner().clone()))
}
})
.unwrap_or(Validity::NonNullable);
Copy link
Contributor

@connortsui20 connortsui20 Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed on the arrow side that None implies non-nullable? I feel like I remember running some oddness around this before...

Maybe worth a comment if this is totally fine

Edit: Actually looking down below it seems like both Validity::NonNullable | Validity::AllValid => None so I'm not actually sure what to do here?

Edit again: We have this code in vortex-array/src/arrow/convert.rs, I think we must pass nullability into from_arrow_variant to make this correct?

fn nulls(nulls: Option<&NullBuffer>, nullable: bool) -> Validity {
    if nullable {
        nulls
            .map(|nulls| {
                if nulls.null_count() == nulls.len() {
                    Validity::AllInvalid
                } else {
                    Validity::from(BitBuffer::from(nulls.inner().clone()))
                }
            })
            .unwrap_or_else(|| Validity::AllValid)
    } else {
        assert!(nulls.map(|x| x.null_count() == 0).unwrap_or(true));
        Validity::NonNullable
    }
}

Comment on lines +164 to +165
let pv =
ParquetVariantArray::try_new_with_validity(validity, metadata, value, typed_value)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it makes sense to just have try_new that takes validity, and also we could consider having a new_unchecked that does no validation here instead since we trust that the arrow array is well-formed (lengths are correct).

Comment on lines +172 to +185
let nulls = match &self.validity {
Validity::NonNullable | Validity::AllValid => None,
Validity::AllInvalid => Some(NullBuffer::new_null(len)),
Validity::Array(array) => {
let mask = array.clone().execute::<Mask>(ctx)?;
match mask {
Mask::AllTrue(_) => None,
Mask::AllFalse(l) => Some(NullBuffer::new_null(l)),
Mask::Values(values) => Some(NullBuffer::from(
arrow_buffer::BooleanBuffer::from(values.bit_buffer().clone()),
)),
}
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have use vortex::array::arrow::executor::validity::to_arrow_null_buffer; for this which is pub(super) right now but arguably should be pub.

Comment on lines +48 to +49
ParquetVariantArray::try_new_with_validity(validity, metadata, value, typed_value)?
.into_array(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about an unchecked constructor? If we're just validating lengths it's probably not worth it...

Comment on lines +131 to +145
match (&array.value, &other.value) {
(Some(a), Some(b)) => {
if !a.array_eq(b, precision) {
return false;
}
}
(None, None) => {}
_ => return false,
}
match (&array.typed_value, &other.typed_value) {
(Some(a), Some(b)) => a.array_eq(b, precision),
(None, None) => true,
_ => false,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do array equality via structural equality or value equality? Since we can have different representations for the same data we might want to just do a loop of scalar_at in here (after the validity and metadata checks).


fn scalar_from_variant_storage(
metadata: &[u8],
value: Option<&vortex_array::ArrayRef>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import?

Comment on lines +56 to +57
pub(crate) value: Option<ArrayRef>,
pub(crate) typed_value: Option<ArrayRef>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crazy idea after a quick pass over everything, is it worth having a dedicated enum that enforces at least 1 of these exist? There's a lot of logic that has to check for that everywhere (including in some free standing functions). Maybe that's overkill though and introduces more complexity than it is worth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a good idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants