Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching for frequently accessed data files #135

Open
igor-lobanov-maersk opened this issue Jan 15, 2025 · 10 comments
Open

Caching for frequently accessed data files #135

igor-lobanov-maersk opened this issue Jan 15, 2025 · 10 comments

Comments

@igor-lobanov-maersk
Copy link

I have a scenario when I need to provide a lookup API on top of a delta lake table, and I'm considering duckdb straight on top of ADLS. I have a conceptual question regarding delta scan implementation for which I cannot find any technical details documented, so would appreciate your input.

Most API calls will be clustered around a small subset of the data, so I'm likely going to have a few 'hot' data files getting most of the traffic. I wonder if duckdb does (or can be configured to) cache recently accessed data files of a delta lake table, so that the number of blob reads to ADLS is reduced and the likelihood of Azure API request throttling is reduced?

@djouallah
Copy link

No, it is not supported unfortunately 😔

@samansmink
Copy link
Collaborator

This is indeed not yet supported, a filesystem-level cache in DuckDB is really high on my wishlist though. Having DuckDB + buffer managed (i.e. disk offloadable) FS-level cache + delta sounds like a super sweet setup

@igor-lobanov-maersk
Copy link
Author

Thanks @djouallah and @samansmink, I am glad to hear that it's on the list!

It does look like duckdb supports forward HTTP proxy. Would delta extension honour the HTTP proxy settings when accessing the blob storage? A wild idea I have is to try putting a forward proxy like squid next to duckdb, give it a large disk cache, and see if it can reduce round-trip times to the blob storage. Does this sounds like a workable approach?

@samansmink
Copy link
Collaborator

@igor-lobanov-maersk we actually use squid for testing azure, see this and that.

I suspect this doesn't work right now on delta though because i think we'd have to forward the proxy config to the delta kernel since it does its own IO right now.

It would make for an interesting experiment for sure!

@igor-lobanov-maersk
Copy link
Author

Nice, thanks @samansmink. Do you know if there a way to configure forward proxy at the delta kernel level? A quick check in the delta kernel repo did not yield anything apparently useful, but this is far from conclusive.

@igor-lobanov-maersk
Copy link
Author

For the record, I raised this as a feature request with delta kernel team as delta-io/delta-kernel-rs#649.

@samansmink could you give any indication whether supporting FS-offloadable cache for duckdb-delta is on the roadmap?

@roeap
Copy link

roeap commented Jan 18, 2025

Hey all - came over here for some context for the kernel-rs issue.

I suspect this doesn't work right now on delta though because i think we'd have to forward the proxy config to the delta kernel since it does its own IO right now.

In principle all IO is (or can be) handled by the Engine implementation. While I am not great at reading C++ code, am I correct in assuming, that we are currently using some implementations form the default engine in duckdb-delta?

duckdb-delta/CMakeLists.txt

Lines 177 to 178 in 7f8cc36

# Add the default client
add_compile_definitions(DEFINE_DEFAULT_ENGINE)

In this case the object store interaction would be handled by the object_store crate which does include options to configure a proxy, but these are in fact not exposed.

Turns out, you can just pass in http client config as part of the "regular" configuration options.

https://github.com/apache/arrow-rs/blob/af777cd53e56f8382382137b6e08af249c475397/object_store/src/client/mod.rs#L146-L174

Not sure how the config is wired through duckdb through :).

@igor-lobanov-maersk
Copy link
Author

I finally got around to do some experiments with Azure blob storage and the delta extension. Good news: it seems that delta-kernel-rs honours HTTPS_PROXY environment variable in Linux. In a basic setup with mitmproxy I can see HEAD and GET requests to table metadata and individual parquet files appearing in the web console. I'll try with squid next.

@igor-lobanov-maersk
Copy link
Author

Having looked at the traces of GET requests for the data files I learned that Azure Storage API client used by duckdb-delta extension relies on a non-standard header x-ms-range for specifying bytes range to fetch. This makes it much harder to use a standard forward proxy to cache blobs. Here's an example:

Image

Interestingly, GET requests to metadata checkpoints in _delta_log use normal Range headers:

Image

It seems to get this to work with a caching forward proxy would require some complex request rewriting, and I am no longer sure if it worth getting into that unless I can somehow convince the API client to always use Range header for the data files accepting the consequences that it may not work for files larger than 4 GiB. However, I could not figure out which part of code between duckdb-azure, duckdb-delta, arrow-rs/object_store and delta-kernel-rs is responsible for generating the request headers. Does anyone know where to look for that? Perhaps @roeap or @samansmink would know definitely.

@samansmink
Copy link
Collaborator

@samansmink could you give any indication whether supporting FS-offloadable cache for duckdb-delta is on the roadmap?

It's on an internal wish-list of features, but I can't give any concrete timeline.

Does anyone know where to look for that?

So when querying a Delta Table on Azure using this extension, IO will be performed by:

  • The default client of delta-kernel-rs (using the object_store crate like @roeap mentioned)
  • The duckdb azure extension using the azure SDK.

If adding a generic http proxy cache turns out to be complex or require significant reworks of the azure extension, I would say its likely not worth it. A cache at the DuckDB filesystem level seems the most natural to me and the way forwards here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants