`delta_scan` is much slower than `deltalake` from delta-rs #141

mrjsj · 2025-01-26T18:32:49Z

I have a delta table on a Cloudflare R2 bucket (similar to S3), and it takes significantly more time to do basic queries through delta_scan than doing the initial reference with deltalake.DeltaTable. I've tried using both the r2 and the s3 httpfs in DuckDB, however there's no difference.

The layout of the table is:

25 files
total size of 2.8 GB
~2.500.000 records
48 columns, however many nested structs, so it has about 470 fields in total

However, I have also tried with much simpler test tables with few rows and simple columns.

The text was updated successfully, but these errors were encountered:

mrjsj · 2025-01-27T12:30:53Z

I can't figure out how to compare with delta-rs, but it seems it's doing an excessive amount of GET requests on the bucket.
I did EXPLAIN ANALYZE, and got these results:

With delta_scan:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
     explain analyze select * from delta_scan('r2://[redacted]') WHERE updated_at > '2025-01-15 00:00:00' LIMIT 100 

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         HTTPFS HTTP Stats         ││
││                                   ││
││            in: 2.6 MiB            ││
││            out: 0 bytes           ││
││              #HEAD: 2             ││
││             #GET: 974             ││
││              #PUT: 0              ││
││              #POST: 0             ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 106.08s             ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘

I then tried the same thing with read_parquet, so it would seem that is the culprit

With read_parquet:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
     explain analyze select * from read_parquet(['[redacted]','[redacted]','[redacted]',...,'[redacted]']) WHERE updated_at > '2025-01-15 00:00:00' LIMIT 100 
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         HTTPFS HTTP Stats         ││
││                                   ││
││            in: 2.6 MiB            ││
││            out: 0 bytes           ││
││              #HEAD: 2             ││
││             #GET: 974             ││
││              #PUT: 0              ││
││              #POST: 0             ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 96.68s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`delta_scan` is much slower than `deltalake` from delta-rs #141

`delta_scan` is much slower than `deltalake` from delta-rs #141

mrjsj commented Jan 26, 2025

mrjsj commented Jan 27, 2025

delta_scan is much slower than deltalake from delta-rs #141

delta_scan is much slower than deltalake from delta-rs #141

Comments

mrjsj commented Jan 26, 2025

mrjsj commented Jan 27, 2025

`delta_scan` is much slower than `deltalake` from delta-rs #141

`delta_scan` is much slower than `deltalake` from delta-rs #141