Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delta_scan is much slower than deltalake from delta-rs #141

Open
mrjsj opened this issue Jan 26, 2025 · 1 comment
Open

delta_scan is much slower than deltalake from delta-rs #141

mrjsj opened this issue Jan 26, 2025 · 1 comment

Comments

@mrjsj
Copy link

mrjsj commented Jan 26, 2025

I have a delta table on a Cloudflare R2 bucket (similar to S3), and it takes significantly more time to do basic queries through delta_scan than doing the initial reference with deltalake.DeltaTable. I've tried using both the r2 and the s3 httpfs in DuckDB, however there's no difference.

Image

Image

The layout of the table is:

  • 25 files
  • total size of 2.8 GB
  • ~2.500.000 records
  • 48 columns, however many nested structs, so it has about 470 fields in total

However, I have also tried with much simpler test tables with few rows and simple columns.

@mrjsj
Copy link
Author

mrjsj commented Jan 27, 2025

I can't figure out how to compare with delta-rs, but it seems it's doing an excessive amount of GET requests on the bucket.
I did EXPLAIN ANALYZE, and got these results:

With delta_scan:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
     explain analyze select * from delta_scan('r2://[redacted]') WHERE updated_at > '2025-01-15 00:00:00' LIMIT 100 

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         HTTPFS HTTP Stats         ││
││                                   ││
││            in: 2.6 MiB            ││
││            out: 0 bytes           ││
││              #HEAD: 2             ││
││             #GET: 974             ││
││              #PUT: 0              ││
││              #POST: 0             ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 106.08s             ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘

I then tried the same thing with read_parquet, so it would seem that is the culprit

With read_parquet:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
     explain analyze select * from read_parquet(['[redacted]','[redacted]','[redacted]',...,'[redacted]']) WHERE updated_at > '2025-01-15 00:00:00' LIMIT 100 
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         HTTPFS HTTP Stats         ││
││                                   ││
││            in: 2.6 MiB            ││
││            out: 0 bytes           ││
││              #HEAD: 2             ││
││             #GET: 974             ││
││              #PUT: 0              ││
││              #POST: 0             ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 96.68s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant