You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When querying a Delta table from DuckDB with a where clause on a non-nullable partition field, I get the following error from delta-kernel-rs:
Error: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: '<table_path>'): Hit error: 2 (ArrowError) with message (Json error: whilst decoding field 'nullCount': Encountered unmasked nulls in non-nullable StructArray child: Field { name: "partition_col", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }
0: delta_kernel::error::Error::with_backtrace
1: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
2: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
3: core::iter::adapters::try_process
4: delta_kernel::engine::arrow_utils::parse_json
5: <delta_kernel::engine::default::json::DefaultJsonHandler<E> as delta_kernel::JsonHandler>::parse_json
6: delta_kernel::scan::data_skipping::DataSkippingFilter::apply
7: delta_kernel::scan::log_replay::LogReplayScanner::process_scan_batch
8: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut
9: <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold
10: <core::iter::adapters::flatten::Flatten<I> as core::iter::traits::iterator::Iterator>::next
11: kernel_scan_data_next
To Reproduce
I'm not that familiar with delta-kernel-rs itself so don't have a repro that only uses delta-kernel-rs, but this demonstrates the problem using DuckDB from Python:
The error happens because the nullability of the null count is derived from the nullability of the column, but because the column is a partition column, it has no entries in the statistics:
I made a partial fix that gets past the error for nullCount:
--- a/kernel/src/scan/data_skipping.rs+++ b/kernel/src/scan/data_skipping.rs@@ -77,6 +77,22 @@ impl DataSkippingFilter {
) -> Option<Cow<'a, PrimitiveType>> {
Some(Cow::Owned(PrimitiveType::Long))
}
++ fn transform_struct_field(&mut self, field: &'a StructField) -> Option<Cow<'a, StructField>> {+ // Change any struct fields to be nullable, as eg. a non-nullable field+ // used for partitioning won't have any null counts+ use Cow::*;+ let field = match self.transform(&field.data_type)? {+ Borrowed(_) => Borrowed(field),+ Owned(new_data_type) => Owned(StructField {+ name: field.name.clone(),+ data_type: new_data_type,+ nullable: true,+ metadata: field.metadata.clone(),+ }),+ };+ Some(field)+ }
}
let nullcount_schema = NullCountStatsTransform
.transform_struct(&referenced_schema)?
But then I get the same error for the minValues field an I'm not sure how best to handle that, as we probably only want to treat partition columns as nullable. And this feels a bit like treating the symptom rather than the cause; delta-kernel-rs doesn't really need to parse stats at all if filtering on a partition column.
The text was updated successfully, but these errors were encountered:
Good find. I suspect this same problem would also impact min/max stats for non-nullable columns that lack stats, if other stats are available. We probably need to force all columns in the stats schema to be nullable. Partition pruning (in progress effort) would not have this problem, so we don't want to mess with the physical schema directly. Instead, DataSkippingFilter::new should transform referenced_schema into an all-nullable stats_schema, which feeds min/max and also becomes input for nullcount_schema.
Fixes#698
## What changes are proposed in this pull request?
Updates the `DataSkippingFilter` to treat all columns as nullable for
the purpose of parsing stats, as suggested in
#698 (comment).
This is particularly important for partition columns, which won't have
values present in stats. But stats are also only usually stored for the
first 32 columns, so we shouldn't rely on stats being present for
non-partition fields either.
## How was this change tested?
I've added a new unit test.
I've also tested building duckdb-delta with this change (cherry-picked
onto 0.6.1) and verified that the code in #698 now works.
Describe the bug
This is related to this DuckDB-Delta issue: duckdb/duckdb-delta#96
When querying a Delta table from DuckDB with a where clause on a non-nullable partition field, I get the following error from delta-kernel-rs:
To Reproduce
I'm not that familiar with delta-kernel-rs itself so don't have a repro that only uses delta-kernel-rs, but this demonstrates the problem using DuckDB from Python:
Tested with deltalake v0.24.0, duckdb v1.2.0.
Expected behavior
No error
Additional context
The error happens because the nullability of the null count is derived from the nullability of the column, but because the column is a partition column, it has no entries in the statistics:
delta-kernel-rs/kernel/src/scan/data_skipping.rs
Lines 81 to 83 in e6aefda
I made a partial fix that gets past the error for
nullCount
:But then I get the same error for the
minValues
field an I'm not sure how best to handle that, as we probably only want to treat partition columns as nullable. And this feels a bit like treating the symptom rather than the cause; delta-kernel-rs doesn't really need to parse stats at all if filtering on a partition column.The text was updated successfully, but these errors were encountered: