-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider changing default consolidated=None
to False
for zarr_version=3 in to_zarr()
#10122
Comments
Interesting that @rabernat sounds less keen on consolidated metadata at that link From afar, I had thought that consolidated metadata was seen as good! In particular, that zarr stores with lots of files were much faster with this enabled, particularly on blob or network-attached storage I had also thought that appending to a zarr from xarray updated the consolidated metadata. I guess that's wrong? |
Coincidentally @jhamman and I were just been talking about how we feel that consolidated metadata should no longer be the default in general, not just for zarr 3! The logic is:
We propose changing consolidated metadata to be opt-in (both write and read), following a deprecation cycle. |
Awesome blog post! Thanks a lot. +1 from me, AFAIU |
I am not convinced that stopping Xarray's use of consolidated metadata by default would be a service to our users. My main concern is that there are no good alternatives that acheive comparable performance. In the long term, I think something like Icechunk might solve this problem, but for now, even with @d-v-b's very impressive speed-ups, it is still too slow to open cloud based Zarr stores without consolidate metadata. For example, when I try the example in Earthmover's blog post in Google Colab, it takes 12 seconds with Two changes that I do think would make sense to improve user experience with consolidated metadata:
To respond to Tom's specific points:
The other way to fix this would be to stop chastising users :). Yes, consolidated metadata is off spec, but unlike other Zarr extensions it doesn't result in creating data that isn't readable by other Zarr implementations. It only runs the risk of being potentially inconsistent.
This is a fair concern, but such stores are rarely used in my experience. Distributed filesystems & cloud stores are the norm with Zarr.
Yes, this is a significant improvement, but as noted above opening large Zarr stores without consolidated metadata is too still slow (12 vs 1 second). |
Thanks for the input @shoyer, those are all great points. After further discussion with @jhamman I've made an alternative suggestion for how to remove this annoyance through upstream changes in zarr-python instead - see zarr-developers/zarr-python#2937. tl;dr: Whether or not one should be using consolidated metadata is fundamentally a property of the store, so there should be an actual property on I think that should allow us to keep reading/writing consolidated metadata by default for stores that benefit from it, whilst not use it for stores which don't, without the user having to understand and specify which is which. |
What is your issue?
Currently, in
Dataset.to_zarr()
, theconsolidated
parameter defaults toNone
, which means that xarray attempts to consolidate metadata by default. However, when using zarr_version=3, consolidated metadata is not required and might create issues. See this discussionWould it make sense to change the default to False when zarr_version=3 is set, given that None currently implies metadata consolidation?
The text was updated successfully, but these errors were encountered: