Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider changing default consolidated=None to False for zarr_version=3 in to_zarr() #10122

Open
aladinor opened this issue Mar 13, 2025 · 5 comments
Labels
topic-zarr Related to zarr storage library

Comments

@aladinor
Copy link
Contributor

What is your issue?

Currently, in Dataset.to_zarr(), the consolidated parameter defaults to None, which means that xarray attempts to consolidate metadata by default. However, when using zarr_version=3, consolidated metadata is not required and might create issues. See this discussion

Would it make sense to change the default to False when zarr_version=3 is set, given that None currently implies metadata consolidation?

@aladinor aladinor added the needs triage Issue that has not been reviewed by xarray team member label Mar 13, 2025
@max-sixty
Copy link
Collaborator

Interesting that @rabernat sounds less keen on consolidated metadata at that link

From afar, I had thought that consolidated metadata was seen as good! In particular, that zarr stores with lots of files were much faster with this enabled, particularly on blob or network-attached storage

I had also thought that appending to a zarr from xarray updated the consolidated metadata. I guess that's wrong?

@TomNicholas TomNicholas added topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Mar 13, 2025
@TomNicholas
Copy link
Member

Coincidentally @jhamman and I were just been talking about how we feel that consolidated metadata should no longer be the default in general, not just for zarr 3! The logic is:

  • Consolidated metadata never made it into the zarr spec, not even the v3 spec, so our current default writing behaviour immediately encourages you to go off spec and our current default reading behaviour will chastise you for doing something spec-compliant.
  • It is only of use for certain stores (i.e. helping with latency of cloud object stores and helping with traversability of http stores) but our default behaviour is to write consolidated metadata even for stores where it has no use, e.g. MemoryStore and local directory store.
  • The main argument for it is to reduce latency when opening deeply nested cloud object stores, but recent improvements to zarr-python by @d-v-b now mean this is a lot faster even without consolidated metadata, lessening the need for it.

We propose changing consolidated metadata to be opt-in (both write and read), following a deprecation cycle.

@max-sixty
Copy link
Collaborator

Awesome blog post! Thanks a lot. +1 from me, AFAIU

@shoyer
Copy link
Member

shoyer commented Mar 20, 2025

I am not convinced that stopping Xarray's use of consolidated metadata by default would be a service to our users. My main concern is that there are no good alternatives that acheive comparable performance. In the long term, I think something like Icechunk might solve this problem, but for now, even with @d-v-b's very impressive speed-ups, it is still too slow to open cloud based Zarr stores without consolidate metadata. For example, when I try the example in Earthmover's blog post in Google Colab, it takes 12 seconds with consolidated=False vs 900 ms with consolidated=True.

Two changes that I do think would make sense to improve user experience with consolidated metadata:

  1. We could update the default heuristics to only use consolidated metadata for stores where it is needed, e.g., for stores that access data from remote object stores.
  2. Xarray could ensure that appending to a Xarray store with mode='a' updates consolidated metadata by default. This would avoid the consistency issues discussed in Nested groups not listed in zarr store (?) zarr-developers/zarr-python#2830.

To respond to Tom's specific points:

  • our current default writing behaviour immediately encourages you to go off spec and our current default reading behaviour will chastise you for doing something spec-compliant.

The other way to fix this would be to stop chastising users :). Yes, consolidated metadata is off spec, but unlike other Zarr extensions it doesn't result in creating data that isn't readable by other Zarr implementations. It only runs the risk of being potentially inconsistent.

  • It is only of use for certain stores (i.e. helping with latency of cloud object stores and helping with traversability of http stores) but our default behaviour is to write consolidated metadata even for stores where it has no use, e.g. MemoryStore and local directory store.

This is a fair concern, but such stores are rarely used in my experience. Distributed filesystems & cloud stores are the norm with Zarr.

  • The main argument for it is to reduce latency when opening deeply nested cloud object stores, but recent improvements to zarr-python by @d-v-b now mean this is a lot faster even without consolidated metadata, lessening the need for it.

Yes, this is a significant improvement, but as noted above opening large Zarr stores without consolidated metadata is too still slow (12 vs 1 second).

@TomNicholas
Copy link
Member

TomNicholas commented Mar 28, 2025

Thanks for the input @shoyer, those are all great points.

After further discussion with @jhamman I've made an alternative suggestion for how to remove this annoyance through upstream changes in zarr-python instead - see zarr-developers/zarr-python#2937.

tl;dr: Whether or not one should be using consolidated metadata is fundamentally a property of the store, so there should be an actual property on zarr.Store which expresses this preference for each implementation. Then xarray can just quietly look at that instead of making users think about it.

I think that should allow us to keep reading/writing consolidated metadata by default for stores that benefit from it, whilst not use it for stores which don't, without the user having to understand and specify which is which.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

4 participants