Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor v3 data types #2874

Open
wants to merge 80 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 28, 2025

As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.

In main,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.

This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 28, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 28, 2025

copying a comment @nenb made in this zulip discussion:

The first thing that caught my eye was that you are using numpy character codes. What was the motivation for this? numpy character codes are not extensible in their current format, and lead to issues like: jax-ml/ml_dtypes#41.

A feature of the character code is that it provides a way to distinguish parametric types like U* from parametrized instances of those types (like U3). Defining a class with the character code U means instances of the class can be initialized with a "length" parameter, and then we can make U2, U3, etc, as instances of the same class. If instead we bind a concrete numpy dtype as class attributes, we need a separate class for U2, U3, etc, which is undesirable. I do think I can work around this, but I figured the explanation might be helpful.

name: ClassVar[str]
dtype_cls: ClassVar[type[TDType]] # this class will create a numpy dtype
kind: ClassVar[DataTypeFlavor]
default_value: TScalar
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

child classes define a string name (which feeds into the zarr metadata), a dtype class dtype_cls (which gets assigned automatically from the generic type parameter) , a string kind (we use this for classifying scalars internally), and a default value (putting this here seems simpler than maintaining a function that matches dtype to default value, but we could potentially do that)

Comment on lines 268 to 283
class IntWrapperBase(DTypeWrapper[TDType, TScalar]):
kind = "numeric"

@classmethod
def from_dtype(cls, dtype: TDType) -> Self:
return cls()

def to_json_value(self, data: np.generic, zarr_format: ZarrFormat) -> int:
return int(data)

def from_json_value(
self, data: JSON, *, zarr_format: ZarrFormat, endianness: Endianness | None = None
) -> TScalar:
if check_json_int(data):
return self.to_dtype(endianness=endianness).type(data)
raise TypeError(f"Invalid type: {data}. Expected an integer.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use inheritance for dtypes like the integers, which really only differ in their concrete dtype + scalar types.

object.__setattr__(self, "_get_chunk_spec", lru_cache()(self._get_chunk_spec))
object.__setattr__(self, "_get_index_chunk_spec", lru_cache()(self._get_index_chunk_spec))
object.__setattr__(self, "_get_chunks_per_shard", lru_cache()(self._get_chunks_per_shard))
# TODO: fix these when we don't get hashability errors for certain numpy dtypes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to fix this. I think the LRU store cache was attempting to hash a non-hashable numpy dtype, and this caused very hard to debug errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to open an issue and link to it in this comment.

@nenb
Copy link

nenb commented Mar 3, 2025

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 4, 2025

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

Thanks for the summary! I have implemented the proposed solution.

@d-v-b d-v-b mentioned this pull request Mar 5, 2025
6 tasks
@nenb
Copy link

nenb commented Mar 24, 2025

I'm taking this out of draft, as I think the basic design is settled, but I still have some changes I need to make before this can be considered mergeable.

The config

I'm not happy with the config as it stands in this PR.

In main we put JSON forms of codecs in the config, keyed by a classification of a data type, e.g. "string" | "bytes" | "numeric". This requires a function that takes a data type and returns its classification. With this PR we will need a new version of that function, and a new set of categories for data types, because presumably fixed-length strings should have a different default encoding from variable-length strings, suggesting the need for "vlen-string" and "string" categories.

In this PR I have put the names of data types in the config. This is unambiguous by clunky. it requires mangling the name of the data types, because any data type with a "." in the name, like "numpy.datetime64", breaks donfig, and so I replace "." characters with "__" for config purposes. We need to find something better than this.

Signature of from_json / from_dtype

These signatures need to be widened. Specifically, I need to add a parameter for the codec pipeline. There are two reasons for this:

  • Creating a zarr data type from np.dtype('O') is ambiguous until you know the codecs used for the array.
  • Creating a zarr data type from an zarr v3 array metadata document is ambiguous if that data type is sensitive to byte order. We need to know the value of the bytes codec before we can assign an endianness to the in-memory representation of the data type. So we need to know the codecs for this too.

entry points

I haven't set up a test for registering a data type via an entrypoint. This needs to be done. This is done in def5eb2

docs

The docs need to cover the new data type API and the process for registering user-defined data types.

typing

I had to add enough type: ignore directives to make me think I'm doing something fundamentally wrong with the type structure. The basic problems seem to stem from my use of generics. I'd love some help in this area. Check out src/zarr/core/dtype/wrapper.py if you are interested.

@d-v-b Could you point out exactly where the types were causing an issue? I noticed quite a few in src/zarr/core/dtype/_numpy.py, but on deeper inspection, all of these type issues seemed to be deliberate, and not obviously related to the use of generics. Could you point out one or two examples that were causing problems.

We chatted on Zulip about the config. The TL;DR for folks not on Zulip was that we concluded that you would try to not touch the Config module in this PR as much as possible (and this would hopefully mean that your issues above are no longer relevant).

I also had a (non-blocking) question on the performance implications of creating these new dtypes from the zarr metadata. The lookup logic in src/zarr/core/dtype/__init__.py seems like it could be relatively expensive (in the sense that I would may potentially need to do a lot of Python lookups and checks just to get the dtype for each chunk). Any thoughts on whether this may introduce a noticeable overhead, and whether it's something that could be (easily) improved on in the future?

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 25, 2025

@d-v-b Could you point out exactly where the types were causing an issue? I noticed quite a few in src/zarr/core/dtype/_numpy.py, but on deeper inspection, all of these type issues seemed to be deliberate, and not obviously related to the use of generics. Could you point out one or two examples that were causing problems.

ZDType takes two type parameters. Previously, they were not covariant, which was causing problems for me, because covariance is required for statements like this:

class ZDType(Generic[D, S]):
  pass
x: ZDType[AllDtypes, AllScalars]
x = ZDType[IntDtype, IntScalar] # if D and S are not covariant, mypy complains here

The two type parameters of ZDType are now covariant, which means mypy is happy with some comparisons that I had previously hidden behind type: ignore statements. Because of the rules around covariant type parameters, we can no longer use either of the two type parameters to annotate method parameters. But I figured this was worth it. If necessary we should continue this chat on Zulip because I need to set up a fair bit of boilerplate to illustrate this stuff.

I also had a (non-blocking) question on the performance implications of creating these new dtypes from the zarr metadata. The lookup logic in src/zarr/core/dtype/__init__.py seems like it could be relatively expensive (in the sense that I would may potentially need to do a lot of Python lookups and checks just to get the dtype for each chunk). Any thoughts on whether this may introduce a noticeable overhead, and whether it's something that could be (easily) improved on in the future?

The lookup logic you see in src/zarr/core/dtype/__init__.py is to resolve native data types provided by a user, or JSON data types contained in metadata documents. This occurs once per array, and before we start fetching any chunks. When we start fetching chunks we already know what dtype to expect, based on the metadata document, so we shouldn't be doing any dtype resolution at that point.

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 25, 2025

a few notable changes added yesterday:

  • When creating an array, the endianness in the BytesCodec is now sensitive to the endianness of the data type provided by the user.
    • If the user didn't specify a serializer explicitly, the BytesCodec is adjusted to match the endianness of the dtype.
    • If the user does specify a serializer explicitly, and its endianness does not match the dtype's endianness, then a valuerror is raised. We want to ensure as much as possible that create_array(dtype=x).dtype == x the user sees a warning indicating that the serializer's endianness takes priority.
  • docs are reworked. Instead of doing pseudo-api docs for the ZDType class, I link to the autogenerated api docs for that class.
  • Some changes to the structure of ZDType that let me use a bit more inheritance when defining the numpy data types. The volume of boilerplate was becoming a source of friction. A notable addition is a cast_value method that takes a python object and returns an instance of the wrapped scalar type.
  • The two type parameters of ZDType are now covariant, which means mypy is happy with some comparisons that I had previously hidden behind type: ignore statements.

I have added some dtype round-trip tests that are currently failing at an interesting place -- variable length strings. I'm hoping to get this sorted out today.

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 25, 2025

I have a question about numpy object dtype arrays. In zarr-python 2.x, numpy object dtype arrays could contain variable-length strings (object dtype + vlen utf8 codec), variable-length arrays (object dtype + vlen array codec), or arbitrary python objects.

Do we want to support variable-length arrays in zarr-python 3? An equivalent question is "do we want to always map numpy object arrays to variable length strings?"

Right now, there's a 1:1 mapping between numpy data types and zarr data types. Allowing variable-length arrays will break this relationship, because both variable-length strings and variable-length arrays will potentially use the Object dtype (depending on the numpy version).

In terms of this PR, to support vlen arrays, I would need to change the function signatures of the data type resolution methods to take additional information about chunk encoding. But we can also do that at any time down the road.

@LDeakin
Copy link
Contributor

LDeakin commented Mar 25, 2025

  • If the user does specify a serializer explicitly, and its endianness does not match the dtype's endianness, then a valuerror is raised. We want to ensure as much as possible that create_array(dtype=x).dtype == x.

The in-memory endianness and the on-disk endianness seem like separate concerns, why impose this restriction? Also, have you considered that codecs like AsType, Quantize, FixedScaleOffset etc. could change the dtype (and in-memory endianness) of a chunk before it reaches the bytes codec anyway?

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 25, 2025

  • If the user does specify a serializer explicitly, and its endianness does not match the dtype's endianness, then a valuerror is raised. We want to ensure as much as possible that create_array(dtype=x).dtype == x.

The in-memory endianness and the on-disk endianness seem like separate concerns, why impose this restriction? Also, have you considered that codecs like AsType, Quantize, FixedScaleOffset etc. could change the dtype (and in-memory endianness) of a chunk before it reaches the bytes codec anyway?

I just edited that comment to indicate that now the user sees a warning instead of an exception.

Also, have you considered that codecs like AsType, Quantize, FixedScaleOffset etc. could change the dtype (and in-memory endianness) of a chunk before it reaches the bytes codec anyway?

I have considered it but there's not really anything we can do about this in zarr-python right now. I think this is a broader problem that the spec needs to solve.

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 25, 2025

The in-memory endianness and the on-disk endianness seem like separate concerns, why impose this restriction?

Maybe I'm misunderstanding the issue, but my operating assumption has been that when people create an array with a big-endian data type, they expect to open that array, index it, and get arrays with the same big-endian data type they requested. I don't think we have anywhere to store the in-memory endianness in zarr-python, so the only place left is on-disk, via the codec configuration for v3 data or the dtype for v2 data.

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 25, 2025

I have considered it but there's not really anything we can do about this in zarr-python right now. I think this is a broader problem that the spec needs to solve.

correction: @normanrz showed me#Zarr > extensible dtypes @ 💬 that the array-array codecs do implement methods that allow resolution of their dtype transformation, so I will try to use this information properly in this PR.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Mar 25, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 28, 2025

I think this is ready for final reviews. some notable recent changes:

  • I reverted the link between the on-disk endianness (specified by the endian attribute of the bytes codec). In-memory endianness and on-disk endianness are independent, thanks to @LDeakin and @normanrz for helping me with this
  • the config has been altered. for the purposes of chunk encoding we previously categorized data types as "string", "numeric", or "bytes". As of this PR there are two categories, "variable-length-string" and "default". Only variable-length strings actually need a special default chunk encoding scheme. All the other types are fixed-length and can use the same default chunk encoding. If we add the variable-length arrays supported in 2.x we would have to revisit this.
  • I haven't added timedeltas. I don't know if there's anything particular complex about that dtype.

I don't think the names of the new-to-zarr-v3 data types is settled. It might make sense to have separate PRs for each data type (e.g., numpy fixed-length-null-terminated bytes, fixed-length-non-null-terminated bytes, datetime64, fixed-length unicode, variable-length string, etc)

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had time to review all of this (especially the actual implementation), but I've left plenty of comments/requests that I think are worthwhile anyway.

Two points not covered in my inline comments:

  • Are there any breaking changes in the config? If so we need to work out how to deal with them so they're nicely deprecated before the old values stop working.
  • ZDtype is imported in several places from the non-public part of the API (zarr.core). This should be changed so it's imported from the public part of the API. Even better, the definition should just move out of zarr.core to zarr.dtype to avoid errors like this.

@@ -0,0 +1,2 @@
Adds zarr-specific data type classes. This replaces the direct use of numpy data types for zarr
v2 and a fixed set of string enums for zarr v3. For more on this new feature, see the `documentation <https://zarr.readthedocs.io/en/stable/user-guide/data_types.html>`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of hard linking, this should use a sphinx anchor so it always references the current built docs (I found this when looking at the PR doc build, where this link doesn't resolve properly)

Compressors : (ZstdCodec(level=0, checksum=False),)
No. bytes : 100000000 (95.4M)
No. bytes stored : 3981552
No. bytes stored : 3981473
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why the number of bytes has changed here? Does that mean the data/bytes being stored has changed somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because the bytes codec no longer specifies endianness, so the JSON document is slightly smaller, but I haven't confirmed this.

@@ -43,39 +43,30 @@ This is the current default configuration::

>>> zarr.config.pprint()
{'array': {'order': 'C',
'v2_default_compressor': {'bytes': {'checksum': False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I manually set the config to this old default value (which I could do in the current v3 branch), does it work properly after this PR? I guess the bigger question here is, are there any breaking changes to what is/isn't allowed in the config with this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the config in this PR has undergone breaking changes compared to main. We could make those changes backwards-compatible and add deprecation warnings to deprecated keys but this will require some effort.

----------------------

Every Zarr array has a "data type", which defines the meaning and physical layout of the
array's elements. Zarr is heavily influenced by `NumPy <https://numpy.org/doc/stable/>`_, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zarr is heavily influenced

Do you mean the data format, or Zarr-Python here? Would be good to clarify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both are true



# TODO: find a better name for this function
def get_data_type_from_native_dtype(dtype: npt.DTypeLike) -> ZDType[_BaseDType, _BaseScalar]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_data_type_from_native_dtype(dtype: npt.DTypeLike) -> ZDType[_BaseDType, _BaseScalar]:
def zarr_dtype_from_numpy_dtype(dtype: npt.DTypeLike) -> ZDType[_BaseDType, _BaseScalar]:

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to over-fit to numpy data types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we add support for non-numpy data types (like arrow data types) then we don't have to rename this function.

# This branch assumes that the data type has been specified in the JSON form
# but it's also possible for numpy data types to be specified as dictionaries, which will
# cause an error in the `get_data_type_from_json`, but that's ok for now
return get_data_type_from_json(dtype, zarr_format=zarr_format) # type: ignore[arg-type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the case where dtype is a dict needs a test (codecov complains on this line)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I can try to create such a case

@@ -1007,7 +1007,7 @@ async def test_asyncgroup_create_array(
assert subnode.dtype == dtype
# todo: fix the type annotation of array.metadata.chunk_grid so that we get some autocomplete
# here.
assert subnode.metadata.chunk_grid.chunk_shape == chunk_shape
assert subnode.chunk_grid.chunk_shape == chunk_shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related to library changes in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I removed the chunk_grid attribute from ArrayV2Metadata

@@ -503,7 +504,7 @@ async def test_consolidated_metadata_backwards_compatibility(
async def test_consolidated_metadata_v2(self):
store = zarr.storage.MemoryStore()
g = await AsyncGroup.from_store(store, attributes={"key": "root"}, zarr_format=2)
dtype = "uint8"
dtype = parse_data_type("uint8", zarr_format=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to just pass "uint8" any more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possible, i'm just not doing that in this test

@@ -219,7 +217,7 @@ async def test_read_consolidated_metadata(
fill_value=0,
chunks=(730,),
attributes={"_ARRAY_DIMENSIONS": ["time"], "dataset": "NMC Reanalysis"},
dtype=np.dtype("int16"),
dtype=Int16(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking API change right, because the type of dtype that is returned in the ArrayV2Metadata object has changed? If so, this needs a clear changelog entry communicating this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is breaking, and to be expected since the goal if this PR is to introduce a new model for data types which necessarily replaces the old one. I can make this more clear in the release notes.

d-v-b and others added 3 commits March 28, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.