Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidated metadata preferences on a Store-specific basis #2937

Open
TomNicholas opened this issue Mar 28, 2025 · 8 comments
Open

Consolidated metadata preferences on a Store-specific basis #2937

TomNicholas opened this issue Mar 28, 2025 · 8 comments

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Mar 28, 2025

Problem

The consolidated metadata abstraction is leaky, in the sense that users are often forced to make explicit choices about something that should ideally just be an automatic hidden optimization.

For example when interacting with zarr via xarray our users often have to pass consolidated=True/False in order to benefit from it or avoid warnings. This is annoying as it adds boilerplate kwargs to every single xarray.open_zarr() and Dataset.to_zarr() call. It comes up for icechunk, which doesn't need explicit consolidated metadata (as it effectively has its own implementation of consolidated metadata).

Coming up with a general one-size-fits-all rule for consolidated metadata doesn't work - see the differing opinions in pydata/xarray#10122.

The problem is that fundamentally whether or not to try to use consolidated metadata is a store-implementation-specific choice. For some stores it's really important (cloud stores), for some it doesn't really matter (local stores), and for some it's not implemented (or even not even implementable at all!).

Proposed solution

We teach the Store implementation to know whether or not it wants you to use consolidated metadata, so libraries like xarray can ask the store for its preference.

We could do this by adding a new property to the Store ABC:

class Store:
    @property
    @abstractmethod
    def supports_consolidated_metadata(self) -> bool:
        """Does the store support consolidated metadata?"""
        ...

This could be False for subclasses by default, but True for e.g. FsspecStore or IcechunkStore.

That way xarray can learn what the expected value of the consolidated kwarg should be. The user could then override that value by passing consolidated explicitly, but xarray would be able to default to the sensible choice without explicit specification by the user.

I think that should allow xarray to keep reading/writing consolidated metadata by default for stores that benefit from it, whilst not use it for stores which don't, without the user having to understand and specify which is which.

I don't think this requires any changes to the zarr spec, because consolidated metadata is currently not in the spec.

cc @d-v-b @jhamman @shoyer @ianhi @aladinor

P.S. There is a similar issue for passing zarr_version, which could be fixed in a similar way, but I think that deserves it's own issue.

@max-sixty
Copy link

one small point to add, without wanting to complicate an already complicated issue:

NFS is commonly used by folks (including by me!) storing huge zarr archive. and NFS:

  1. looks like a file-system, by design
  2. has much higher latency than a local SSD — small single-digit ms are not uncommon

so you'd want consolidated metadata, assuming the numbers that @shoyer quotes in pydata/xarray#10122 are representative

so purely discriminating by the store may not be sufficient for all cases

maaaaybe the code could check for block size and use that to estimate whether it's nfs vs local, but could also be confusing

(tbc, this isn't dispositive, just something to weigh)

@d-v-b
Copy link
Contributor

d-v-b commented Mar 28, 2025

For some stores it's really important (cloud stores), for some it doesn't really matter (local stores), and for some it's not implemented (or even not even implementable at all!)

which stores cannot support consolidated metadata?

@TomNicholas
Copy link
Member Author

so purely discriminating by the store may not be sufficient for all cases

Interesting @max-sixty - I still think this is a store-level concern though.

maaaaybe the code could check for block size and use that to estimate whether it's nfs vs local, but could also be confusing

Something like that could maybe be done in the implementation of the property / store subclass, but my argument is simply that client code such as xarray should be told what to do by the store.


which stores cannot support consolidated metadata?

I don't know, hence why I didn't write an example 😅 I just thought I heard @jhamman say that earlier. I don't think the difference between "does not support", and "could never support" matters for my argument though.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 28, 2025

I'm not sure this is actually something that should be a property of a store?

  • All stores can support consolidated metadata (it's just another key: value pair). So supports_consolidated_metadata taken literally should always be True.
  • FSSpecStore wraps fsspec, which supports cloud storage or the local filesystem (and many other file systems). So I don't think a class attribute on FSSpecStore could express your intent.

Instead I feel like your problem would be better solved by a function that takes a store class instance and returns a boolean: True for use consolidated metadata, False for not using it. And I think this function could exist in xarray, no?

@jhamman
Copy link
Member

jhamman commented Mar 28, 2025

which stores cannot support consolidated metadata?

Icechunk for one! There is little value to supporting it in the ZipStore as well (though it technically works).

@shoyer
Copy link
Contributor

shoyer commented Mar 29, 2025

As a subtle variation of this idea, what consolidated metadata was implemented by specific Stores, not merely something that might be indirectly supported by stores?

For example:

class ConsolidatedFsspecStore(FsspecStore):
    def close(self):
        if self._metadata_changed():
            consolidate_metadata(self)

There is probably something I'm missing about why this wouldn't work, but it could make for a simpler user experience.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 29, 2025

@shoyer one reason why that would be hard today is because our Stores are generic key-value engines, with no formal representation of "metadata". This is just a our current state, not a rule we have to follow. But I think adding this method to zarr groups would work, provided the consuming library is happy using that group as a context manager.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 31, 2025

@jhamman explained the fundamental issue to me and I think I understand why my initial suggestion of a function that maps stores to bools wouldn't work: if someone (e.g., earthmover) defines their own Store that doesn't support consolidated metadata, then a function defined within zarr-python won't be able to infer anything useful about that externally-defined Store.

so I agree that individual store implementations need to carry around information about whether they support consolidated metadata. Here is one idea to solve this:

  • we imbue our storage layer with information about zarr specs (this means either changing the store ABC or introducing a new class that specializes stores to only support zarr v2 or v3 IO operations. personal preference would be to start with option 2).
  • we add an attribute to stores / new zarr IO class that expresses which metadata keys they can retrieve
  • we change the v3 consolidated metadata implementation to use a separate .zmetadata file instead of in-attributes consolidated metadata
  • whether a store supports consolidated metadata == whether that store instance contains .zmetadata in the list of metadata keys it can fetch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants