-
-
Notifications
You must be signed in to change notification settings - Fork 323
refactor v3 data types #2874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
refactor v3 data types #2874
Changes from all commits
f5e3f78
b4e71e2
3c50f54
d74e7a4
5000dcb
9cd5c51
042fac1
556e390
b588f70
4ed41c6
1b2c773
24930b3
703e0e1
3c232a4
b7fe986
d9b44b4
bf24d69
c1a8566
2868994
9ab0b1e
e9f5e26
6df84a9
e14279d
381a264
6a7857b
e8fd72c
b22f324
b7a231e
7dfcd0f
706e6b6
8fbf673
e9aff64
44e78f5
60cac04
120df57
0d9922b
2075952
44369d6
4f3381f
c8d7680
2a7b5a8
e855e54
a2da99a
5ea3fa4
cbb159d
c506d09
bb11867
7a619e0
ea2d0bf
042c9e5
def5eb2
1b7273b
60b2e9d
83f508c
4ceb6ed
5b9cff0
65f0453
cb0a7d4
40f0063
9989c64
a276c84
6285739
e9241b9
2bffe1a
aa32271
617d3f0
2b5fd8f
1831f20
a427a16
41d7e58
c08ffd9
778d740
269215e
8af0ce4
df60d05
7f54bbf
be83f03
8e6924d
25b1527
0a5d14e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
Adds zarr-specific data type classes. This replaces the direct use of numpy data types for zarr | ||
v2 and a fixed set of string enums for zarr v3. For more on this new feature, see the `data types user guide <https://zarr.readthedocs.io/en/stable/user-guide/data_types.html>`_ |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -43,39 +43,30 @@ This is the current default configuration:: | |
|
||
>>> zarr.config.pprint() | ||
{'array': {'order': 'C', | ||
'v2_default_compressor': {'bytes': {'checksum': False, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I manually set the config to this old default value (which I could do in the current v3 branch), does it work properly after this PR? I guess the bigger question here is, are there any breaking changes to what is/isn't allowed in the config with this PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, the config in this PR has undergone breaking changes compared to |
||
'id': 'zstd', | ||
'level': 0}, | ||
'numeric': {'checksum': False, | ||
'id': 'zstd', | ||
'level': 0}, | ||
'string': {'checksum': False, | ||
'v2_default_compressor': {'default': {'checksum': False, | ||
'id': 'zstd', | ||
'level': 0}}, | ||
'v2_default_filters': {'bytes': [{'id': 'vlen-bytes'}], | ||
'numeric': None, | ||
'raw': None, | ||
'string': [{'id': 'vlen-utf8'}]}, | ||
'v3_default_compressors': {'bytes': [{'configuration': {'checksum': False, | ||
'level': 0}, | ||
'name': 'zstd'}], | ||
'numeric': [{'configuration': {'checksum': False, | ||
'level': 0}, | ||
'variable-length-string': {'checksum': False, | ||
'id': 'zstd', | ||
'level': 0}}, | ||
'v2_default_filters': {'default': None, | ||
'variable-length-string': [{'id': 'vlen-utf8'}]}, | ||
'v3_default_compressors': {'default': [{'configuration': {'checksum': False, | ||
'level': 0}, | ||
'name': 'zstd'}], | ||
'string': [{'configuration': {'checksum': False, | ||
'level': 0}, | ||
'name': 'zstd'}]}, | ||
'v3_default_filters': {'bytes': [], 'numeric': [], 'string': []}, | ||
'v3_default_serializer': {'bytes': {'name': 'vlen-bytes'}, | ||
'numeric': {'configuration': {'endian': 'little'}, | ||
'name': 'bytes'}, | ||
'string': {'name': 'vlen-utf8'}}, | ||
'write_empty_chunks': False}, | ||
'async': {'concurrency': 10, 'timeout': None}, | ||
'buffer': 'zarr.core.buffer.cpu.Buffer', | ||
'codec_pipeline': {'batch_size': 1, | ||
'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'}, | ||
'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec', | ||
'variable-length-string': [{'configuration': {'checksum': False, | ||
'level': 0}, | ||
'name': 'zstd'}]}, | ||
'v3_default_filters': {'default': [], 'variable-length-string': []}, | ||
'v3_default_serializer': {'default': {'configuration': {'endian': 'little'}, | ||
'name': 'bytes'}, | ||
'variable-length-string': {'name': 'vlen-utf8'}}, | ||
'write_empty_chunks': False}, | ||
'async': {'concurrency': 10, 'timeout': None}, | ||
'buffer': 'zarr.core.buffer.cpu.Buffer', | ||
'codec_pipeline': {'batch_size': 1, | ||
'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'}, | ||
'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec', | ||
'bytes': 'zarr.codecs.bytes.BytesCodec', | ||
'crc32c': 'zarr.codecs.crc32c_.Crc32cCodec', | ||
'endian': 'zarr.codecs.bytes.BytesCodec', | ||
|
@@ -85,7 +76,7 @@ This is the current default configuration:: | |
'vlen-bytes': 'zarr.codecs.vlen_utf8.VLenBytesCodec', | ||
'vlen-utf8': 'zarr.codecs.vlen_utf8.VLenUTF8Codec', | ||
'zstd': 'zarr.codecs.zstd.ZstdCodec'}, | ||
'default_zarr_format': 3, | ||
'json_indent': 2, | ||
'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer', | ||
'threading': {'max_workers': None}} | ||
'default_zarr_format': 3, | ||
'json_indent': 2, | ||
'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer', | ||
'threading': {'max_workers': None}} |
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,110 @@ | ||||||||
Data types | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file is a super useful read. I'm wondering what to do with it though. Were you thinking it would go under the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No strong opinion from me. IMO our docs right now are not the most logically organized, so I anticipate some churn there in any case. |
||||||||
========== | ||||||||
|
||||||||
Zarr's data type model | ||||||||
---------------------- | ||||||||
|
||||||||
Every Zarr array has a "data type", which defines the meaning and physical layout of the | ||||||||
array's elements. Zarr is heavily influenced by `NumPy <https://numpy.org/doc/stable/>`_, and | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Do you mean the data format, or Zarr-Python here? Would be good to clarify. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. both are true |
||||||||
Zarr-Python supports creating arrays with Numpy data types:: | ||||||||
|
||||||||
>>> import zarr | ||||||||
>>> import numpy as np | ||||||||
>>> z = zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8')) | ||||||||
>>> z | ||||||||
<Array memory:... shape=(10,) dtype=uint8> | ||||||||
|
||||||||
Unlike Numpy arrays, Zarr arrays are designed to be persisted to storage and read by Zarr implementations in different programming languages. | ||||||||
This means Zarr data types must be interpreted correctly when clients read an array. So each Zarr data type defines a procedure for | ||||||||
encoding/decoding that data type to/from Zarr array metadata, and also encoding/decoding **instances** of that data type to/from | ||||||||
array metadata. These serialization procedures depend on the version of the Zarr format used. | ||||||||
|
||||||||
Data types in Zarr version 2 | ||||||||
----------------------------- | ||||||||
|
||||||||
Version 2 of the Zarr format defined its data types relative to `Numpy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_, and added a few non-Numpy data types as well. | ||||||||
Thus the JSON identifier for a Numpy-compatible data type is just the Numpy ``str`` attribute of that dtype:: | ||||||||
|
||||||||
>>> import zarr | ||||||||
>>> import numpy as np | ||||||||
>>> import json | ||||||||
>>> store = {} | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
to break it up a bit? |
||||||||
>>> np_dtype = np.dtype('int64') | ||||||||
>>> z = zarr.create_array(store=store, shape=(1,), dtype=np_dtype, zarr_format=2) | ||||||||
>>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"] | ||||||||
d-v-b marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
>>> assert dtype_meta == np_dtype.str # True | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of claiming this assert is true in a comment, it would be nicer just to show it explicitly
Suggested change
|
||||||||
>>> dtype_meta | ||||||||
'<i8' | ||||||||
|
||||||||
.. note:: | ||||||||
The ``<`` character in the data type metadata encodes the `endianness <https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html>`_, or "byte order", of the data type. Following Numpy's example, | ||||||||
in Zarr version 2 each data type has an endianness where applicable. However, Zarr version 3 data types do not store endianness information. | ||||||||
|
||||||||
In addition to defining a representation of the data type itself (which in the example above was just a simple string ``"<i8"``), Zarr also | ||||||||
defines a metadata representation of scalars associated with that data type. Integers are stored as ``JSON`` numbers, | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I scratched my head for a bit wondering why a scalar representation was needed, before realising (I think I'm right?). I'm not sure my suggestion is very well written, but something similar to explain why here migth be nice? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the scalar representation is only used for the fill value metadata, so I will say as much in the docs |
||||||||
as are floats, with the caveat that `NaN`, positive infinity, and negative infinity are stored as special strings. | ||||||||
|
||||||||
Data types in Zarr version 3 | ||||||||
----------------------------- | ||||||||
|
||||||||
Zarr V3 brings several key changes to how data types are represented: | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might be nice to link out the relevant bit of the Zarr spec here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good idea |
||||||||
|
||||||||
- Zarr V3 identifies the basic data types as strings like ``int8``, ``int16``, etc. In Zarr V2 ``int8`` would represented as ``|i1``, ``int16`` would be ``>i2`` **or** ``<i2``, depending on the endianness. | ||||||||
- A Zarr V3 data type does not have endianness. This is a departure from Zarr V2, where multi-byte data types would be stored in ``JSON`` with an encoding that included endianness. Instead, | ||||||||
Zarr V3 requires that endianness, where applicable, is specified in the ``codecs`` attribute of array metadata. | ||||||||
- Zarr V3 data types can also take the form of a ``JSON`` object like | ||||||||
``{"name": "foo", "configuration": {"parameter": "value"}}``. This structure facilitates specifying data types that take parameters. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you give an example of a data type that takes parameters? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no I cannot! the zarr v3 spec does not define any data types that use the |
||||||||
|
||||||||
|
||||||||
Data types in Zarr-Python | ||||||||
------------------------- | ||||||||
|
||||||||
The two Zarr formats that Zarr-Python supports specify data types differently: | ||||||||
data types in Zarr version 2 are encoded as Numpy-compatible strings, while data types in Zarr version | ||||||||
3 are encoded as either strings or ``JSON`` objects, | ||||||||
and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types. | ||||||||
|
||||||||
To abstract over these syntactical and semantic differences, Zarr-Python uses a class called `ZDType <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_ to wrap native data types (e.g., Numpy data types) and provide Zarr V2 and Zarr V3 compatibility routines. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Is there a reason to use 'native' (new jargon) instead of Numpy if native just means Numpy? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. part of the goal of this PR is to abstract over any array data type, including but not limited to numpy data types. |
||||||||
Each data type supported by Zarr-Python is modeled by a subclass of ``ZDType``, which provides an API for the following operations: | ||||||||
|
||||||||
- Wrapping / unwrapping a native data type | ||||||||
- Encoding / decoding a data type to / from Zarr V2 and Zarr V3 array metadata. | ||||||||
- Encoding / decoding a scalar value to / from Zarr V2 and Zarr V3 array metadata. | ||||||||
|
||||||||
|
||||||||
Example Usage | ||||||||
~~~~~~~~~~~~~ | ||||||||
|
||||||||
.. code-block:: python | ||||||||
|
||||||||
from zarr.core.dtype.wrapper import Int8 | ||||||||
|
||||||||
# Create a ZDType instance from a native dtype | ||||||||
int8 = Int8.from_dtype(np.dtype('int8')) | ||||||||
|
||||||||
# Convert back to native dtype | ||||||||
native_dtype = int8.to_dtype() | ||||||||
assert native_dtype == np.dtype('int8') | ||||||||
|
||||||||
# Get the default value | ||||||||
default_value = int8.default_value() | ||||||||
assert default_value == np.int8(0) | ||||||||
|
||||||||
# Serialize to JSON | ||||||||
json_representation = int8.to_json(zarr_format=3) | ||||||||
|
||||||||
# Serialize a scalar value | ||||||||
json_value = int8.to_json_value(42, zarr_format=3) | ||||||||
assert json_value == 42 | ||||||||
|
||||||||
# Deserialize a scalar value | ||||||||
scalar_value = int8.from_json_value(42, zarr_format=3) | ||||||||
assert scalar_value == np.int8(42) | ||||||||
Comment on lines
+78
to
+102
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be changed to a doctest like in other guides. Then the output would be shown/tested. |
||||||||
|
||||||||
Custom Data Types | ||||||||
~~~~~~~~~~~~~~~~~ | ||||||||
|
||||||||
Users can define custom data types by subclassing `ZDType` and implementing the required methods. | ||||||||
Once defined, the custom data type can be registered with Zarr-Python to enable seamless integration with the library. | ||||||||
|
||||||||
<TODO: example of defining a custom data type> | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I would get rid of this line, and open an issue to keep track of this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a super nice guide! |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,6 +8,7 @@ User guide | |
|
||
installation | ||
arrays | ||
data_types | ||
groups | ||
attributes | ||
storage | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -52,7 +52,7 @@ a chunk shape is based on simple heuristics and may be far from optimal. E.g.:: | |
|
||
>>> z4 = zarr.create_array(store={}, shape=(10000, 10000), chunks='auto', dtype='int32') | ||
>>> z4.chunks | ||
(625, 625) | ||
(313, 625) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The fact that automatic chunk determination has changed in some cases should be documented in a changelog entry. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that would require figuring out why the automatic chunk determination changed, which I have not done yet |
||
|
||
If you know you are always going to be loading the entire array into memory, you | ||
can turn off chunks by providing ``chunks`` equal to ``shape``, in which case there | ||
|
@@ -91,15 +91,15 @@ To use sharding, you need to specify the ``shards`` parameter when creating the | |
>>> z6.info | ||
Type : Array | ||
Zarr format : 3 | ||
Data type : DataType.uint8 | ||
Data type : UInt8() | ||
Shape : (10000, 10000, 1000) | ||
Shard shape : (1000, 1000, 1000) | ||
Chunk shape : (100, 100, 100) | ||
Order : C | ||
Read-only : False | ||
Store type : MemoryStore | ||
Filters : () | ||
Serializer : BytesCodec(endian=<Endian.little: 'little'>) | ||
Serializer : BytesCodec(endian=None) | ||
Compressors : (ZstdCodec(level=0, checksum=False),) | ||
No. bytes : 100000000000 (93.1G) | ||
|
||
|
@@ -121,7 +121,7 @@ ratios, depending on the correlation structure within the data. E.g.:: | |
>>> c.info_complete() | ||
Type : Array | ||
Zarr format : 3 | ||
Data type : DataType.int32 | ||
Data type : Int32(endianness='little') | ||
Shape : (10000, 10000) | ||
Chunk shape : (1000, 1000) | ||
Order : C | ||
|
@@ -140,7 +140,7 @@ ratios, depending on the correlation structure within the data. E.g.:: | |
>>> f.info_complete() | ||
Type : Array | ||
Zarr format : 3 | ||
Data type : DataType.int32 | ||
Data type : Int32(endianness='little') | ||
Shape : (10000, 10000) | ||
Chunk shape : (1000, 1000) | ||
Order : F | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
from __future__ import annotations | ||
|
||
from abc import abstractmethod | ||
from typing import TYPE_CHECKING, Any, Generic, TypeVar | ||
from typing import TYPE_CHECKING, Generic, TypeVar | ||
|
||
from zarr.abc.metadata import Metadata | ||
from zarr.core.buffer import Buffer, NDBuffer | ||
|
@@ -12,11 +12,10 @@ | |
from collections.abc import Awaitable, Callable, Iterable | ||
from typing import Self | ||
|
||
import numpy as np | ||
|
||
from zarr.abc.store import ByteGetter, ByteSetter | ||
from zarr.core.array_spec import ArraySpec | ||
from zarr.core.chunk_grids import ChunkGrid | ||
from zarr.core.dtype.wrapper import ZDType, _BaseDType, _BaseScalar | ||
from zarr.core.indexing import SelectorTuple | ||
|
||
__all__ = [ | ||
|
@@ -93,7 +92,13 @@ def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self: | |
""" | ||
return self | ||
|
||
def validate(self, *, shape: ChunkCoords, dtype: np.dtype[Any], chunk_grid: ChunkGrid) -> None: | ||
def validate( | ||
self, | ||
*, | ||
shape: ChunkCoords, | ||
dtype: ZDType[_BaseDType, _BaseScalar], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given this is a type on a public part of the API, I think |
||
chunk_grid: ChunkGrid, | ||
) -> None: | ||
"""Validates that the codec configuration is compatible with the array metadata. | ||
Raises errors when the codec configuration is not compatible. | ||
|
||
|
@@ -285,7 +290,9 @@ def supports_partial_decode(self) -> bool: ... | |
def supports_partial_encode(self) -> bool: ... | ||
|
||
@abstractmethod | ||
def validate(self, *, shape: ChunkCoords, dtype: np.dtype[Any], chunk_grid: ChunkGrid) -> None: | ||
def validate( | ||
self, *, shape: ChunkCoords, dtype: ZDType[_BaseDType, _BaseScalar], chunk_grid: ChunkGrid | ||
) -> None: | ||
"""Validates that all codec configurations are compatible with the array metadata. | ||
Raises errors when a codec configuration is not compatible. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why the number of bytes has changed here? Does that mean the data/bytes being stored has changed somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because the bytes codec no longer specifies endianness, so the JSON document is slightly smaller, but I haven't confirmed this.