Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Encountered unmasked nulls" error parsing stats for a non-nullable partition column #698

Closed
adamreeve opened this issue Feb 14, 2025 · 1 comment · Fixed by #700
Closed
Labels
bug Something isn't working

Comments

@adamreeve
Copy link
Contributor

Describe the bug

This is related to this DuckDB-Delta issue: duckdb/duckdb-delta#96

When querying a Delta table from DuckDB with a where clause on a non-nullable partition field, I get the following error from delta-kernel-rs:

Error: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: '<table_path>'): Hit error: 2 (ArrowError) with message (Json error: whilst decoding field 'nullCount': Encountered unmasked nulls in non-nullable StructArray child: Field { name: "partition_col", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }
   0: delta_kernel::error::Error::with_backtrace
   1: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
   2: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
   3: core::iter::adapters::try_process
   4: delta_kernel::engine::arrow_utils::parse_json
   5: <delta_kernel::engine::default::json::DefaultJsonHandler<E> as delta_kernel::JsonHandler>::parse_json
   6: delta_kernel::scan::data_skipping::DataSkippingFilter::apply
   7: delta_kernel::scan::log_replay::LogReplayScanner::process_scan_batch
   8: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut
   9: <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold
  10: <core::iter::adapters::flatten::Flatten<I> as core::iter::traits::iterator::Iterator>::next
  11: kernel_scan_data_next

To Reproduce

I'm not that familiar with delta-kernel-rs itself so don't have a repro that only uses delta-kernel-rs, but this demonstrates the problem using DuckDB from Python:

import duckdb
from deltalake import write_deltalake
import pyarrow as pa


delta_dir = "./delta_table"

schema = pa.schema([
    pa.field("partition_col", pa.string(), nullable=False),
    pa.field("x", pa.float32(), nullable=True),
])

for partition_col in ["AB", "CD", "EF"]:
    num_rows = 1_000
    table = pa.Table.from_pydict({
        'partition_col': pa.array([partition_col] * num_rows, type=pa.string()),
        'x': pa.array([i / 100.0 for i in range(num_rows)], type=pa.float32()),
    })
    write_deltalake(
            table_or_uri=delta_dir,
            data=table,
            partition_by=["partition_col"],
            mode='append',
            schema=schema)

results = duckdb.query(f"SELECT * FROM delta_scan('{delta_dir}') WHERE partition_col = 'CD'")

Tested with deltalake v0.24.0, duckdb v1.2.0.

Expected behavior

No error

Additional context

The error happens because the nullability of the null count is derived from the nullability of the column, but because the column is a partition column, it has no entries in the statistics:

let nullcount_schema = NullCountStatsTransform
.transform_struct(&referenced_schema)?
.into_owned();

I made a partial fix that gets past the error for nullCount:

--- a/kernel/src/scan/data_skipping.rs
+++ b/kernel/src/scan/data_skipping.rs
@@ -77,6 +77,22 @@ impl DataSkippingFilter {
             ) -> Option<Cow<'a, PrimitiveType>> {
                 Some(Cow::Owned(PrimitiveType::Long))
             }
+
+            fn transform_struct_field(&mut self, field: &'a StructField) -> Option<Cow<'a, StructField>> {
+                // Change any struct fields to be nullable, as eg. a non-nullable field
+                // used for partitioning won't have any null counts
+                use Cow::*;
+                let field = match self.transform(&field.data_type)? {
+                    Borrowed(_) => Borrowed(field),
+                    Owned(new_data_type) => Owned(StructField {
+                        name: field.name.clone(),
+                        data_type: new_data_type,
+                        nullable: true,
+                        metadata: field.metadata.clone(),
+                    }),
+                };
+                Some(field)
+            }
         }
         let nullcount_schema = NullCountStatsTransform
             .transform_struct(&referenced_schema)?

But then I get the same error for the minValues field an I'm not sure how best to handle that, as we probably only want to treat partition columns as nullable. And this feels a bit like treating the symptom rather than the cause; delta-kernel-rs doesn't really need to parse stats at all if filtering on a partition column.

@scovich
Copy link
Collaborator

scovich commented Feb 14, 2025

Good find. I suspect this same problem would also impact min/max stats for non-nullable columns that lack stats, if other stats are available. We probably need to force all columns in the stats schema to be nullable. Partition pruning (in progress effort) would not have this problem, so we don't want to mess with the physical schema directly. Instead, DataSkippingFilter::new should transform referenced_schema into an all-nullable stats_schema, which feeds min/max and also becomes input for nullcount_schema.

zachschuermann pushed a commit that referenced this issue Feb 19, 2025
Fixes #698

## What changes are proposed in this pull request?

Updates the `DataSkippingFilter` to treat all columns as nullable for
the purpose of parsing stats, as suggested in
#698 (comment).

This is particularly important for partition columns, which won't have
values present in stats. But stats are also only usually stored for the
first 32 columns, so we shouldn't rely on stats being present for
non-partition fields either.

## How was this change tested?

I've added a new unit test.

I've also tested building duckdb-delta with this change (cherry-picked
onto 0.6.1) and verified that the code in #698 now works.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants