Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codec pipeline memory usage #2904

Open
TomAugspurger opened this issue Mar 10, 2025 · 6 comments
Open

Codec pipeline memory usage #2904

TomAugspurger opened this issue Mar 10, 2025 · 6 comments
Labels
performance Potential issues with Zarr performance (I/O, memory, etc.)

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 10, 2025

We discussed memory usage on Friday's community call. https://github.com/TomAugspurger/zarr-python-memory-benchmark started to look at some stuff.

https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/refs/heads/main/reports/memray-flamegraph-read-uncompressed.html has the memray flamegraph for reading an uncompressed array (400 MB total, split into 10 chunks of 40 MB each). I think the optimal memory usage here is about 400 MB. Our peak memory is about 2x that.

https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/refs/heads/main/reports/memray-flamegraph-read-compressed.html has the zstd compressed version. Peak memory is about 1.1 GiB.

I haven't looked too closely at the code, but I wonder if we could be smarter about a few things in certain cases:

  1. For the uncompressed case, we might be able to do a readinto directly into (an appropriate slice of)the out array. We might need to expand the Store API to add some kind of readinto, where the user provides the buffer to read into rather than the store allocating new memory.
  2. For the compressed case, we might be able to improve things once we know the size of the output buffers. I see that numcodec's zstd.decode takes an output buffer here that we could maybe use. And past that point, maybe all the codecs could reuse one or two buffers, rather than allocating a new buffer for each stage of the codec (one buffer if doing stuff inplace, two buffers if something can't be done inplace)?

I'm not too familiar with the codec pipeline stuff, but will look into this as I have time. Others should feel free to take this if someone gets an itch though. There's some work to be done :)

@TomAugspurger TomAugspurger changed the title Memory usage Codec pipeline memory usage Mar 10, 2025
@TomAugspurger TomAugspurger added the performance Potential issues with Zarr performance (I/O, memory, etc.) label Mar 10, 2025
@TomAugspurger
Copy link
Contributor Author

https://github.com/TomAugspurger/zarr-python-memory-benchmark/blob/4039ba687452d65eef081bce1d4714165546422a/sol.py#L41 has a POC for using readinto to read an uncompressed zarr dataset into a pre-allocated buffer. https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/3567246b852d7adacbc10f32a58b0b3f6ac3d50b/reports/memray-flamegraph-sol-read-uncompressed.html shows that that takes ~exactly the size of the output ndarray (so no overhead from Zarr).

https://github.com/TomAugspurger/zarr-python-memory-benchmark/blob/4039ba687452d65eef081bce1d4714165546422a/sol.py#L63 shows an example reading a Zstd compressed dataset. https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/3567246b852d7adacbc10f32a58b0b3f6ac3d50b/reports/memray-flamegraph-sol-read-compressed.html shows that the peak memory usage is ~ the size of the compressed dataset + the output ndarray (this does all the decompression first; we could do those sequentially to lower the peak memory usage).

There are some complications around slices that don't align with zarr chunk boundaries that this ignores, but is maybe enough to prove that we could do better.

@tomwhite
Copy link
Contributor

Thanks for doing this work @TomAugspurger! Coincidentally, I've been looking at memory overheads for Zarr storage operations across different filesystems (local/cloud), compression settings, and Zarr versions: https://github.com/tomwhite/memray-array

There are some complications around slices that don't align with zarr chunk boundaries that this ignores, but is maybe enough to prove that we could do better.

Just reducing the number of buffer copies for aligned slices would be a big win for everyone who uses Zarr, since it would improve performance and reduce memory pressure. Hopefully similar techniques could be used for cloud storage too.

@TomAugspurger
Copy link
Contributor Author

Very cool!

[from https://github.com/tomwhite/memray-array] Reads with no compression incur a single copy from local files, but two copies from S3. This seems to be because the S3 libraries read lots of small blocks then join them into a larger one, whereas local files can be read in one go into a single buffer.

I was wondering about this while looking into the performance of obstore and KvikIO. KvikIO lets the caller provide the out buffer that the data are read into, which lets you avoid the smaller buffer allocations and the set of memcopies into the final output buffer. Probably worth looking into at some point.

@tomwhite
Copy link
Contributor

I wonder if any of the memory management machinery that has been developed for Apache Arrow would be of use here?

@TomAugspurger
Copy link
Contributor Author

I looked into implementing this today and it'll be a decent amount of effort. There are some issues in the interface provided by the codec pipeline ABC (read takes an out buffer, but decode doesn't) and I got pretty lost in the codec_pipeline implementation (so many iterables of tuples!). I'm not sure where the best place to start is.

Beyond the codec pipeline, I think we'll also need to update the Store and Codec interfaces to add APIs for reading / decoding into an out buffer. This probably has to be opt in (we can't have codecs / stores silently not using an out buffer).

@dcherian
Copy link
Contributor

dcherian commented Apr 4, 2025

and I got pretty lost in the codec_pipeline implementation (so many iterables of tuples!)

Not the first person! I did made it out alive, but only barely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Potential issues with Zarr performance (I/O, memory, etc.)
Projects
None yet
Development

No branches or pull requests

3 participants