S3 requests after the hyper s3 timeout connection duration fail over large table operations

# Environment

**Delta-rs version**: python v0.16.0

**Binding**: Python

**Environment**:
- **Cloud provider**: AWS

***
# Bug

**What happened**:

On three of our large Delta tables (out of 500, most of the others are a lot smaller) I've repeatedly been seeing errors related to an S3 timeout. I get two kinds of errors; one indicates an error reading data from a parquet file, the other is just a much more generic error. Here they are:

```
Failed to parse parquet: Parquet error: AsyncChunkReader::get_bytes error: Generic S3 error: Error after 10 retries in 174.881515313s, max_retries:10, retry_timeout:180s, source:error sending request for url (https://s3.us-east-1.amazonaws.com/...redacted...some-data-file.parquet): operation timed out

AND

ERROR - Error deleting events from s3a://path-to-delta-table: Failed to parse parquet: Parquet error: AsyncChunkReader::get_bytes error: Generic S3 error: request or response body error: operation timed out
```

This can happen in various contexts, but the prominent example in my case is needing to scan the entire Delta table for records matching a value and deleting those matches. I've logged out right before this delete op runs, and the time delta between that log and the error is usually around 5 minutes. This time delta is relevant.

The second one in particular comes up a lot. This is on a table that gets around 250,000 rows added every hour and is partitioned by date `2024-03-18` and hour `15` as example.

My initial thought was that this was due to a failure to download a larger file such as a checkpoint; but those are "only" 668 MB and it is certainly not unheard of, moving files using various AWS SDK's will gladly handle larger files.

**Our current theory** is that instead this timeout indicate that the `hyper` client makes a request, then the library does some in-mem operations (such as scanning a large delta table) that takes a long time, and then when delta-rs is ready to reach out to S3 again, it exceeds some kind of keep-alive timeout and fails. This is why that time delta of 5 minutes I mentioned earlier is important.

This seems related to this comment: https://github.com/delta-io/delta-rs/blob/785667e5d65e3b45cf27e0a74cf297479f42d307/crates/aws/src/storage.rs#L401-L406

And finally I think this sort of timeout is expected from the `hyper` client, per this (admittedly old) ticket but it explains this exact situation in another context: https://github.com/hyperium/hyper/issues/763


**What you expected to happen**:

Even on long running operations, such as a scan over the whole table for deletion, the operation would eventually succeed barring no other issues.

**How to reproduce it**:

IF I AM CORRECT: perform a long running Delta operation, such as a delete on a table with millions of rows and have it try to make a final reach out to S3

**More details**:

Note this is just a theory. I wanted to get this issue out there to see what others think, I'm going to dig in and see if I can figure out whether it's a issue with the library or something else.


	/// The `pool_idle_timeout` option of aws http client. Has to be lower than 20 seconds, which is
	/// default S3 server timeout <https://aws.amazon.com/premiumsupport/knowledge-center/s3-socket-connection-timeout-error/>.
	/// However, since rusoto uses hyper as a client, its default timeout is 90 seconds
	/// <https://docs.rs/hyper/0.13.2/hyper/client/struct.Builder.html#method.keep_alive_timeout>.
	/// Hence, the `connection closed before message completed` could occur.
	/// To avoid that, the default value of this setting is 15 seconds if it's not set otherwise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

S3 requests after the hyper s3 timeout connection duration fail over large table operations #2301

Environment

Bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

S3 requests after the hyper s3 timeout connection duration fail over large table operations #2301

Description

Environment

Bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions