Conversation
62280a5 to
f266c90
Compare
Merging this PR will not alter performance
Comparing Footnotes
|
|
I think in that case it is a struct, because they are named |
|
|
3d7ecb4 to
ad93dcf
Compare
## Summary Filling up missing parts of making Variant a canonical array. It still not fully supported but this is a step towards it. These changes started as part of #7130, but I figured they are just noise in that already too big of a PR. ## API Changes Makes variant officially a canonical array and type. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>
6e32d0f to
fbf4d4e
Compare
fbf4d4e to
4848c82
Compare
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
510a8c8 to
e5a799e
Compare
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
e5a799e to
338af8a
Compare
connortsui20
left a comment
There was a problem hiding this comment.
Looks good! I gave it a quick pass but haven't looked at the scalar operations yet
| } | ||
|
|
||
| /// Creates a Parquet Variant array with explicit outer validity. | ||
| pub fn try_new_with_validity( |
There was a problem hiding this comment.
I feel like this should just be try_new? That would be consistent with a lot of our other canonical array constructors. And if you want to be extra consistent you can pull out a validate method, but that's not strictly necessary.
| if let Some(validity_len) = validity.maybe_len() { | ||
| vortex_ensure!( | ||
| validity_len == len, | ||
| "validity length must match metadata length" | ||
| ); | ||
| } | ||
| if let Some(ref v) = value { | ||
| vortex_ensure!(v.len() == len, "value length must match metadata length"); | ||
| } | ||
| if let Some(ref tv) = typed_value { | ||
| vortex_ensure!( | ||
| tv.len() == len, | ||
| "typed_value length must match metadata length" | ||
| ); | ||
| } |
There was a problem hiding this comment.
vortex_ensure_eq!
We really should just plumb through the codebase and replace all instances of these...
| Validity::from(BitBuffer::from(nulls.inner().clone())) | ||
| } | ||
| }) | ||
| .unwrap_or(Validity::NonNullable); |
There was a problem hiding this comment.
Is this guaranteed on the arrow side that None implies non-nullable? I feel like I remember running some oddness around this before...
Maybe worth a comment if this is totally fine
Edit: Actually looking down below it seems like both Validity::NonNullable | Validity::AllValid => None so I'm not actually sure what to do here?
Edit again: We have this code in vortex-array/src/arrow/convert.rs, I think we must pass nullability into from_arrow_variant to make this correct?
fn nulls(nulls: Option<&NullBuffer>, nullable: bool) -> Validity {
if nullable {
nulls
.map(|nulls| {
if nulls.null_count() == nulls.len() {
Validity::AllInvalid
} else {
Validity::from(BitBuffer::from(nulls.inner().clone()))
}
})
.unwrap_or_else(|| Validity::AllValid)
} else {
assert!(nulls.map(|x| x.null_count() == 0).unwrap_or(true));
Validity::NonNullable
}
}| let pv = | ||
| ParquetVariantArray::try_new_with_validity(validity, metadata, value, typed_value)?; |
There was a problem hiding this comment.
Yeah I think it makes sense to just have try_new that takes validity, and also we could consider having a new_unchecked that does no validation here instead since we trust that the arrow array is well-formed (lengths are correct).
| let nulls = match &self.validity { | ||
| Validity::NonNullable | Validity::AllValid => None, | ||
| Validity::AllInvalid => Some(NullBuffer::new_null(len)), | ||
| Validity::Array(array) => { | ||
| let mask = array.clone().execute::<Mask>(ctx)?; | ||
| match mask { | ||
| Mask::AllTrue(_) => None, | ||
| Mask::AllFalse(l) => Some(NullBuffer::new_null(l)), | ||
| Mask::Values(values) => Some(NullBuffer::from( | ||
| arrow_buffer::BooleanBuffer::from(values.bit_buffer().clone()), | ||
| )), | ||
| } | ||
| } | ||
| }; |
There was a problem hiding this comment.
We have use vortex::array::arrow::executor::validity::to_arrow_null_buffer; for this which is pub(super) right now but arguably should be pub.
| ParquetVariantArray::try_new_with_validity(validity, metadata, value, typed_value)? | ||
| .into_array(), |
There was a problem hiding this comment.
Do we care about an unchecked constructor? If we're just validating lengths it's probably not worth it...
| match (&array.value, &other.value) { | ||
| (Some(a), Some(b)) => { | ||
| if !a.array_eq(b, precision) { | ||
| return false; | ||
| } | ||
| } | ||
| (None, None) => {} | ||
| _ => return false, | ||
| } | ||
| match (&array.typed_value, &other.typed_value) { | ||
| (Some(a), Some(b)) => a.array_eq(b, precision), | ||
| (None, None) => true, | ||
| _ => false, | ||
| } | ||
| } |
There was a problem hiding this comment.
Do we want to do array equality via structural equality or value equality? Since we can have different representations for the same data we might want to just do a loop of scalar_at in here (after the validity and metadata checks).
|
|
||
| fn scalar_from_variant_storage( | ||
| metadata: &[u8], | ||
| value: Option<&vortex_array::ArrayRef>, |
| pub(crate) value: Option<ArrayRef>, | ||
| pub(crate) typed_value: Option<ArrayRef>, |
There was a problem hiding this comment.
Crazy idea after a quick pass over everything, is it worth having a dedicated enum that enforces at least 1 of these exist? There's a lot of logic that has to check for that everywhere (including in some free standing functions). Maybe that's overkill though and introduces more complexity than it is worth?
There was a problem hiding this comment.
Definitely a good idea

Summary
This PR introduces a new encoding to support Arrow's canonical extension type, but does not integrate it with anything else.
It does include basic pieces like slice/take/filter, and it can (or at least should) roundtrip with the equivalent arrow array.
Testing
The encoding includes a bunch of basic tests, both for making sure it roundtrips with arrow and for the various nullability cases.