Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering with OR condition causes all files to be scanned #157

Open
eeroel opened this issue Feb 13, 2025 · 1 comment
Open

Filtering with OR condition causes all files to be scanned #157

eeroel opened this issue Feb 13, 2025 · 1 comment

Comments

@eeroel
Copy link

eeroel commented Feb 13, 2025

Hi,

I have a case where data is clustered so that each file has a different value for the (integer) column foo I want to filter on. If I query with a simple filter where foo = 1, the file skipping works OK and only one file will be scanned. However, if I query for multiple values with an OR filter where foo = 1 OR foo = 2, this causes a full scan of all files. This seems to affect OR only, as where foo > 9 and foo < 12 also skips files. Here's a reproducible example, requires polars for writing the table:

import duckdb
import polars as pl

for file in range(50):
    df = pl.DataFrame(
        [
            pl.Series("foo", [file], dtype=pl.Int64),
        ]
    )
    df.write_delta("test_table4", mode="append")


duckdb.execute("force install delta from core_nightly; UPDATE EXTENSIONS; load delta;").fetchall()
duckdb.execute("ATTACH './test_table4' AS my_delta_table (TYPE delta)")

# this scans only a subset of files as expected
res = duckdb.execute("""
    explain analyze SELECT COUNT(*) from my_delta_table
    where foo = 10
""").fetchall()
print(res[0][1])

# this scans all files
res = duckdb.execute("""
    explain analyze SELECT COUNT(*) from my_delta_table
    where foo = 10 or foo = 11
""").fetchall()
print(res[0][1])

# this skips files
res = duckdb.execute("""
    explain analyze SELECT COUNT(*) from my_delta_table
    where foo > 9 and foo < 12
""").fetchall()
print(res[0][1])

@samansmink
Copy link
Collaborator

Thanks for reporting @eeroel This is expected for now. The delta-kernel-rs FFI does not currently implement pushing down OR filters. Once it does I'll make sure to fix it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants