Description
Environment
Delta-rs version: python v0.16.0
Binding: Python
Environment:
- Cloud provider: AWS
Bug
What happened:
On three of our large Delta tables (out of 500, most of the others are a lot smaller) I've repeatedly been seeing errors related to an S3 timeout. I get two kinds of errors; one indicates an error reading data from a parquet file, the other is just a much more generic error. Here they are:
Failed to parse parquet: Parquet error: AsyncChunkReader::get_bytes error: Generic S3 error: Error after 10 retries in 174.881515313s, max_retries:10, retry_timeout:180s, source:error sending request for url (https://s3.us-east-1.amazonaws.com/...redacted...some-data-file.parquet): operation timed out
AND
ERROR - Error deleting events from s3a://path-to-delta-table: Failed to parse parquet: Parquet error: AsyncChunkReader::get_bytes error: Generic S3 error: request or response body error: operation timed out
This can happen in various contexts, but the prominent example in my case is needing to scan the entire Delta table for records matching a value and deleting those matches. I've logged out right before this delete op runs, and the time delta between that log and the error is usually around 5 minutes. This time delta is relevant.
The second one in particular comes up a lot. This is on a table that gets around 250,000 rows added every hour and is partitioned by date 2024-03-18
and hour 15
as example.
My initial thought was that this was due to a failure to download a larger file such as a checkpoint; but those are "only" 668 MB and it is certainly not unheard of, moving files using various AWS SDK's will gladly handle larger files.
Our current theory is that instead this timeout indicate that the hyper
client makes a request, then the library does some in-mem operations (such as scanning a large delta table) that takes a long time, and then when delta-rs is ready to reach out to S3 again, it exceeds some kind of keep-alive timeout and fails. This is why that time delta of 5 minutes I mentioned earlier is important.
This seems related to this comment:
delta-rs/crates/aws/src/storage.rs
Lines 401 to 406 in 785667e
And finally I think this sort of timeout is expected from the hyper
client, per this (admittedly old) ticket but it explains this exact situation in another context: hyperium/hyper#763
What you expected to happen:
Even on long running operations, such as a scan over the whole table for deletion, the operation would eventually succeed barring no other issues.
How to reproduce it:
IF I AM CORRECT: perform a long running Delta operation, such as a delete on a table with millions of rows and have it try to make a final reach out to S3
More details:
Note this is just a theory. I wanted to get this issue out there to see what others think, I'm going to dig in and see if I can figure out whether it's a issue with the library or something else.