-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to perform WHERE queries on partitioned column #96
Comments
I experience the same. Maybe this has to do with the fact that the partition column does not exist in the underlying Parquet files and must be read from the transaction log instead. |
I was able to modify the metadata of delta tables produced by delta-rs so duckdb can filter partition now. Root cause is I'm not sure who's wrong here: delta kernel or delta rs. @samansmink what is your experience with delta-kernel? are bugs fixed quickly? otherwise @jorritsandbrink maybe we should file this in delta-rs. they were AFAIK pretty responsive |
delta-rs team is super responsive, you can open a PR there, they will discuss if it should move to delta-kernel I've personnally added partition column into the data directly but I still have a weird issue when filtering on it (despite also being in the data) |
Here's a minimal reproduction of the issue: import duckdb
from deltalake import write_deltalake
import pyarrow as pa
delta_dir = "./delta_table"
schema = pa.schema([
pa.field("partition_col", pa.string(), nullable=False),
pa.field("x", pa.float32(), nullable=True),
])
for partition_col in ["AB", "CD", "EF"]:
num_rows = 1_000
table = pa.Table.from_pydict({
'partition_col': pa.array([partition_col] * num_rows, type=pa.string()),
'x': pa.array([i / 100.0 for i in range(num_rows)], type=pa.float32()),
})
write_deltalake(
table_or_uri=delta_dir,
data=table,
partition_by=["partition_col"],
mode='append',
schema=schema)
results = duckdb.query(f"SELECT * FROM delta_scan('{delta_dir}') WHERE partition_col = 'CD'") Tested with deltalake v0.24.0, duckdb v1.2.0. This depends on the partition column being non-nullable, as the nullability of the null count is derived from the nullability of the partition column: https://github.com/delta-io/delta-kernel-rs/blob/e6aefda96a620ae4b4fb2752afc20e175322a392/kernel/src/scan/data_skipping.rs#L81-L83
I think this is definitely a delta-kernel bug. It's redundant to write the min/max/nullCount stats for partition columns as they're always a constant value for the file, so I wouldn't expect delta-rs to write them. Plus, if delta-rs was changed to work around this behaviour, then every other Delta writing implementation would also need to be fixed. It doesn't look like this has been reported in delta-kernel already, so I've reported it at delta-io/delta-kernel-rs#698. |
I have a dataset partitioned on a specific column using delta-rs. I encounter an exception when I execute a SELECT query with a WHERE clause targeting the partition column.
Query:
select * from delta_scan('D://partitioned') WHERE PartitionColumn= 1 LIMIT 10;
Error:
IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'D://partitioned/'): Hit error: 2 (ArrowError) with message (Json error: whilst decoding field 'minValues': Encountered unmasked nulls in non-nullable StructArray child: Field { name: "PartitionColumn", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} })
Other queries on this table work (and WHERE clause targeting other columns)
The text was updated successfully, but these errors were encountered: