Skip to content

refactor v3 data types #2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 80 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
f5e3f78
modernize typing
d-v-b Feb 21, 2025
b4e71e2
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Feb 24, 2025
3c50f54
lint
d-v-b Feb 24, 2025
d74e7a4
new dtypes
d-v-b Feb 26, 2025
5000dcb
rename base dtype, change type to kind
d-v-b Feb 26, 2025
9cd5c51
start working on JSON serialization
d-v-b Feb 27, 2025
042fac1
get json de/serialization largely working, and start making tests pass
d-v-b Feb 27, 2025
556e390
tweak json type guards
d-v-b Feb 27, 2025
b588f70
fix dtype sizes, adjust fill value parsing in from_dict, fix tests
d-v-b Feb 27, 2025
4ed41c6
mid-refactor commit
d-v-b Mar 2, 2025
1b2c773
working form for dtype classes
d-v-b Mar 2, 2025
24930b3
remove unused code
d-v-b Mar 2, 2025
703e0e1
use wrap / unwrap instead of to_dtype / from_dtype; push into v2 code…
d-v-b Mar 2, 2025
3c232a4
push into v2
d-v-b Mar 3, 2025
b7fe986
remove endianness kwarg to methods, make it an instance variable instead
d-v-b Mar 3, 2025
d9b44b4
make wrapping safe by default
d-v-b Mar 4, 2025
bf24d69
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 4, 2025
c1a8566
dtype-specific tests
d-v-b Mar 4, 2025
2868994
more tests, fix void type default value logic
d-v-b Mar 5, 2025
9ab0b1e
fix dtype mechanics in bytescodec
d-v-b Mar 5, 2025
e9f5e26
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 5, 2025
6df84a9
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Mar 7, 2025
e14279d
remove __post_init__ magic in favor of more explicit declaration
d-v-b Mar 7, 2025
381a264
fix tests
d-v-b Mar 9, 2025
6a7857b
refactor data types
d-v-b Mar 12, 2025
e8fd72c
start design doc
d-v-b Mar 13, 2025
b22f324
more design doc
d-v-b Mar 13, 2025
b7a231e
update docs
d-v-b Mar 13, 2025
7dfcd0f
fix sphinx warnings
d-v-b Mar 13, 2025
706e6b6
tweak docs
d-v-b Mar 13, 2025
8fbf673
info about v3 data types
d-v-b Mar 13, 2025
e9aff64
adjust note
d-v-b Mar 13, 2025
44e78f5
fix: use unparametrized types in direct assignment
d-v-b Mar 13, 2025
60cac04
start fixing config
d-v-b Mar 17, 2025
120df57
Update src/zarr/core/_info.py
d-v-b Mar 17, 2025
0d9922b
add placeholder disclaimer to v3 data types summary
d-v-b Mar 17, 2025
2075952
make example runnable
d-v-b Mar 17, 2025
44369d6
placeholder section for adding a custom dtype
d-v-b Mar 17, 2025
4f3381f
define native data type and native scalar
d-v-b Mar 17, 2025
c8d7680
update data type names
d-v-b Mar 17, 2025
2a7b5a8
fix config test failures
d-v-b Mar 17, 2025
e855e54
call to_dtype once in blosc evolve_from_array_spec
d-v-b Mar 17, 2025
a2da99a
refactor dtypewrapper -> zdtype
d-v-b Mar 19, 2025
5ea3fa4
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 19, 2025
cbb159d
update code examples in docs; remove native endianness
d-v-b Mar 19, 2025
c506d09
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 19, 2025
bb11867
adjust type annotations
d-v-b Mar 20, 2025
7a619e0
fix info tests to use zdtype
d-v-b Mar 20, 2025
ea2d0bf
remove dead code and add code coverage exemption to zarr format checks
d-v-b Mar 20, 2025
042c9e5
fix: add special check for resolving int32 on windows
d-v-b Mar 20, 2025
def5eb2
add dtype entry point test
d-v-b Mar 20, 2025
1b7273b
remove default parameters for parametric dtypes; add mixin classes fo…
d-v-b Mar 21, 2025
60b2e9d
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 21, 2025
83f508c
Update docs/user-guide/data_types.rst
d-v-b Mar 24, 2025
4ceb6ed
refactor: use inheritance to remove boilerplate in dtype definitions
d-v-b Mar 24, 2025
5b9cff0
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
65f0453
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 24, 2025
cb0a7d4
update data types documentation, and expose core/dtype module to autodoc
d-v-b Mar 24, 2025
40f0063
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
9989c64
add failing endianness round-trip test
d-v-b Mar 24, 2025
a276c84
fix endianness
d-v-b Mar 24, 2025
6285739
additional check in test_explicit_endianness
d-v-b Mar 24, 2025
e9241b9
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 24, 2025
2bffe1a
add failing test for round-tripping vlen strings
d-v-b Mar 24, 2025
aa32271
route object dtype arrays to vlen string dtype when numpy > 2
d-v-b Mar 25, 2025
617d3f0
relax endianness mismatch to a warning instead of an error
d-v-b Mar 25, 2025
2b5fd8f
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
1831f20
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
a427a16
silence mypy error about array indexing
d-v-b Mar 25, 2025
41d7e58
add release note
d-v-b Mar 25, 2025
c08ffd9
fix doctests, excluding config tests
d-v-b Mar 25, 2025
778d740
revert addition of linkage between dtype endianness and bytes codec e…
d-v-b Mar 26, 2025
269215e
remove Any types
d-v-b Mar 26, 2025
8af0ce4
add docstring for wrapper module
d-v-b Mar 26, 2025
df60d05
simplify config and docs
d-v-b Mar 26, 2025
7f54bbf
update config test
d-v-b Mar 26, 2025
be83f03
fix S dtype test for v2
d-v-b Mar 26, 2025
8e6924d
Update changes/2874.feature.rst
d-v-b Mar 28, 2025
25b1527
Update docs/user-guide/data_types.rst
d-v-b Mar 28, 2025
0a5d14e
Update docs/user-guide/data_types.rst
d-v-b Mar 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions changes/2874.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Adds zarr-specific data type classes. This replaces the direct use of numpy data types for zarr
v2 and a fixed set of string enums for zarr v3. For more on this new feature, see the `data types user guide <https://zarr.readthedocs.io/en/stable/user-guide/data_types.html>`_
14 changes: 7 additions & 7 deletions docs/user-guide/arrays.rst
Original file line number Diff line number Diff line change
@@ -182,7 +182,7 @@ which can be used to print useful diagnostics, e.g.::
>>> z.info
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : Int32(endianness='little')
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
@@ -199,7 +199,7 @@ prints additional diagnostics, e.g.::
>>> z.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : Int32(endianness='little')
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
@@ -246,7 +246,7 @@ built-in delta filter::
The default compressor can be changed by setting the value of the using Zarr's
:ref:`user-guide-config`, e.g.::

>>> with zarr.config.set({'array.v2_default_compressor.numeric': {'id': 'blosc'}}):
>>> with zarr.config.set({'array.v2_default_compressor.default': {'id': 'blosc'}}):
... z = zarr.create_array(store={}, shape=(100000000,), chunks=(1000000,), dtype='int32', zarr_format=2)
>>> z.filters
()
@@ -286,7 +286,7 @@ Here is an example using a delta filter with the Blosc compressor::
>>> z.info
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : Int32(endianness='little')
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
@@ -600,18 +600,18 @@ Sharded arrays can be created by providing the ``shards`` parameter to :func:`za
>>> a.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.uint8
Data type : UInt8()
Shape : (10000, 10000)
Shard shape : (1000, 1000)
Chunk shape : (100, 100)
Order : C
Read-only : False
Store type : LocalStore
Filters : ()
Serializer : BytesCodec(endian=<Endian.little: 'little'>)
Serializer : BytesCodec(endian=None)
Compressors : (ZstdCodec(level=0, checksum=False),)
No. bytes : 100000000 (95.4M)
No. bytes stored : 3981552
No. bytes stored : 3981473
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why the number of bytes has changed here? Does that mean the data/bytes being stored has changed somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because the bytes codec no longer specifies endianness, so the JSON document is slightly smaller, but I haven't confirmed this.

Storage ratio : 25.1
Shards Initialized : 100

59 changes: 25 additions & 34 deletions docs/user-guide/config.rst
Original file line number Diff line number Diff line change
@@ -43,39 +43,30 @@ This is the current default configuration::

>>> zarr.config.pprint()
{'array': {'order': 'C',
'v2_default_compressor': {'bytes': {'checksum': False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I manually set the config to this old default value (which I could do in the current v3 branch), does it work properly after this PR? I guess the bigger question here is, are there any breaking changes to what is/isn't allowed in the config with this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the config in this PR has undergone breaking changes compared to main. We could make those changes backwards-compatible and add deprecation warnings to deprecated keys but this will require some effort.

'id': 'zstd',
'level': 0},
'numeric': {'checksum': False,
'id': 'zstd',
'level': 0},
'string': {'checksum': False,
'v2_default_compressor': {'default': {'checksum': False,
'id': 'zstd',
'level': 0}},
'v2_default_filters': {'bytes': [{'id': 'vlen-bytes'}],
'numeric': None,
'raw': None,
'string': [{'id': 'vlen-utf8'}]},
'v3_default_compressors': {'bytes': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}],
'numeric': [{'configuration': {'checksum': False,
'level': 0},
'variable-length-string': {'checksum': False,
'id': 'zstd',
'level': 0}},
'v2_default_filters': {'default': None,
'variable-length-string': [{'id': 'vlen-utf8'}]},
'v3_default_compressors': {'default': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}],
'string': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}]},
'v3_default_filters': {'bytes': [], 'numeric': [], 'string': []},
'v3_default_serializer': {'bytes': {'name': 'vlen-bytes'},
'numeric': {'configuration': {'endian': 'little'},
'name': 'bytes'},
'string': {'name': 'vlen-utf8'}},
'write_empty_chunks': False},
'async': {'concurrency': 10, 'timeout': None},
'buffer': 'zarr.core.buffer.cpu.Buffer',
'codec_pipeline': {'batch_size': 1,
'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'},
'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec',
'variable-length-string': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}]},
'v3_default_filters': {'default': [], 'variable-length-string': []},
'v3_default_serializer': {'default': {'configuration': {'endian': 'little'},
'name': 'bytes'},
'variable-length-string': {'name': 'vlen-utf8'}},
'write_empty_chunks': False},
'async': {'concurrency': 10, 'timeout': None},
'buffer': 'zarr.core.buffer.cpu.Buffer',
'codec_pipeline': {'batch_size': 1,
'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'},
'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec',
'bytes': 'zarr.codecs.bytes.BytesCodec',
'crc32c': 'zarr.codecs.crc32c_.Crc32cCodec',
'endian': 'zarr.codecs.bytes.BytesCodec',
@@ -85,7 +76,7 @@ This is the current default configuration::
'vlen-bytes': 'zarr.codecs.vlen_utf8.VLenBytesCodec',
'vlen-utf8': 'zarr.codecs.vlen_utf8.VLenUTF8Codec',
'zstd': 'zarr.codecs.zstd.ZstdCodec'},
'default_zarr_format': 3,
'json_indent': 2,
'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer',
'threading': {'max_workers': None}}
'default_zarr_format': 3,
'json_indent': 2,
'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer',
'threading': {'max_workers': None}}
6 changes: 3 additions & 3 deletions docs/user-guide/consolidated_metadata.rst
Original file line number Diff line number Diff line change
@@ -47,7 +47,7 @@ that can be used.:
>>> from pprint import pprint
>>> pprint(dict(sorted(consolidated_metadata.items())))
{'a': ArrayV3Metadata(shape=(1,),
data_type=<DataType.float64: 'float64'>,
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(1,)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
@@ -60,7 +60,7 @@ that can be used.:
node_type='array',
storage_transformers=()),
'b': ArrayV3Metadata(shape=(2, 2),
data_type=<DataType.float64: 'float64'>,
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(2, 2)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
@@ -73,7 +73,7 @@ that can be used.:
node_type='array',
storage_transformers=()),
'c': ArrayV3Metadata(shape=(3, 3, 3),
data_type=<DataType.float64: 'float64'>,
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(3, 3, 3)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
110 changes: 110 additions & 0 deletions docs/user-guide/data_types.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
Data types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a super useful read. I'm wondering what to do with it though. Were you thinking it would go under the Advanced Topics section in the user guide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion from me. IMO our docs right now are not the most logically organized, so I anticipate some churn there in any case.

==========

Zarr's data type model
----------------------

Every Zarr array has a "data type", which defines the meaning and physical layout of the
array's elements. Zarr is heavily influenced by `NumPy <https://numpy.org/doc/stable/>`_, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zarr is heavily influenced

Do you mean the data format, or Zarr-Python here? Would be good to clarify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both are true

Zarr-Python supports creating arrays with Numpy data types::

>>> import zarr
>>> import numpy as np
>>> z = zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8'))
>>> z
<Array memory:... shape=(10,) dtype=uint8>

Unlike Numpy arrays, Zarr arrays are designed to be persisted to storage and read by Zarr implementations in different programming languages.
This means Zarr data types must be interpreted correctly when clients read an array. So each Zarr data type defines a procedure for
encoding/decoding that data type to/from Zarr array metadata, and also encoding/decoding **instances** of that data type to/from
array metadata. These serialization procedures depend on the version of the Zarr format used.

Data types in Zarr version 2
-----------------------------

Version 2 of the Zarr format defined its data types relative to `Numpy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_, and added a few non-Numpy data types as well.
Thus the JSON identifier for a Numpy-compatible data type is just the Numpy ``str`` attribute of that dtype::

>>> import zarr
>>> import numpy as np
>>> import json
>>> store = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> store = {}
>>>
>>> store = {}

to break it up a bit?

>>> np_dtype = np.dtype('int64')
>>> z = zarr.create_array(store=store, shape=(1,), dtype=np_dtype, zarr_format=2)
>>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"]
>>> assert dtype_meta == np_dtype.str # True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of claiming this assert is true in a comment, it would be nicer just to show it explicitly

Suggested change
>>> assert dtype_meta == np_dtype.str # True
>>> np_dtype.str
'<i8'

>>> dtype_meta
'<i8'

.. note::
The ``<`` character in the data type metadata encodes the `endianness <https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html>`_, or "byte order", of the data type. Following Numpy's example,
in Zarr version 2 each data type has an endianness where applicable. However, Zarr version 3 data types do not store endianness information.

In addition to defining a representation of the data type itself (which in the example above was just a simple string ``"<i8"``), Zarr also
defines a metadata representation of scalars associated with that data type. Integers are stored as ``JSON`` numbers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defines a metadata representation of scalars associated with that data type. Integers are stored as ``JSON`` numbers,
defines a metadata representation of scalars associated with that data type (that can be used e.g., for storing fill values in the metadata). Integers are stored as ``JSON`` numbers,

I scratched my head for a bit wondering why a scalar representation was needed, before realising (I think I'm right?). I'm not sure my suggestion is very well written, but something similar to explain why here migth be nice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the scalar representation is only used for the fill value metadata, so I will say as much in the docs

as are floats, with the caveat that `NaN`, positive infinity, and negative infinity are stored as special strings.

Data types in Zarr version 3
-----------------------------

Zarr V3 brings several key changes to how data types are represented:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to link out the relevant bit of the Zarr spec here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea


- Zarr V3 identifies the basic data types as strings like ``int8``, ``int16``, etc. In Zarr V2 ``int8`` would represented as ``|i1``, ``int16`` would be ``>i2`` **or** ``<i2``, depending on the endianness.
- A Zarr V3 data type does not have endianness. This is a departure from Zarr V2, where multi-byte data types would be stored in ``JSON`` with an encoding that included endianness. Instead,
Zarr V3 requires that endianness, where applicable, is specified in the ``codecs`` attribute of array metadata.
- Zarr V3 data types can also take the form of a ``JSON`` object like
``{"name": "foo", "configuration": {"parameter": "value"}}``. This structure facilitates specifying data types that take parameters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give an example of a data type that takes parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I cannot! the zarr v3 spec does not define any data types that use the {name, config} metadata representation (although it does define that representation)



Data types in Zarr-Python
-------------------------

The two Zarr formats that Zarr-Python supports specify data types differently:
data types in Zarr version 2 are encoded as Numpy-compatible strings, while data types in Zarr version
3 are encoded as either strings or ``JSON`` objects,
and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types.

To abstract over these syntactical and semantic differences, Zarr-Python uses a class called `ZDType <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_ to wrap native data types (e.g., Numpy data types) and provide Zarr V2 and Zarr V3 compatibility routines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To abstract over these syntactical and semantic differences, Zarr-Python uses a class called `ZDType <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_ to wrap native data types (e.g., Numpy data types) and provide Zarr V2 and Zarr V3 compatibility routines.
To abstract over these syntactical and semantic differences, Zarr-Python uses a class called `ZDType <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_ to wrap Numpy data types and provide Zarr V2 and Zarr V3 compatibility routines.

Is there a reason to use 'native' (new jargon) instead of Numpy if native just means Numpy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part of the goal of this PR is to abstract over any array data type, including but not limited to numpy data types.

Each data type supported by Zarr-Python is modeled by a subclass of ``ZDType``, which provides an API for the following operations:

- Wrapping / unwrapping a native data type
- Encoding / decoding a data type to / from Zarr V2 and Zarr V3 array metadata.
- Encoding / decoding a scalar value to / from Zarr V2 and Zarr V3 array metadata.


Example Usage
~~~~~~~~~~~~~

.. code-block:: python

from zarr.core.dtype.wrapper import Int8

# Create a ZDType instance from a native dtype
int8 = Int8.from_dtype(np.dtype('int8'))

# Convert back to native dtype
native_dtype = int8.to_dtype()
assert native_dtype == np.dtype('int8')

# Get the default value
default_value = int8.default_value()
assert default_value == np.int8(0)

# Serialize to JSON
json_representation = int8.to_json(zarr_format=3)

# Serialize a scalar value
json_value = int8.to_json_value(42, zarr_format=3)
assert json_value == 42

# Deserialize a scalar value
scalar_value = int8.from_json_value(42, zarr_format=3)
assert scalar_value == np.int8(42)
Comment on lines +78 to +102
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be changed to a doctest like in other guides. Then the output would be shown/tested.


Custom Data Types
~~~~~~~~~~~~~~~~~

Users can define custom data types by subclassing `ZDType` and implementing the required methods.
Once defined, the custom data type can be registered with Zarr-Python to enable seamless integration with the library.

<TODO: example of defining a custom data type>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<TODO: example of defining a custom data type>

I would get rid of this line, and open an issue to keep track of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super nice guide!

4 changes: 2 additions & 2 deletions docs/user-guide/groups.rst
Original file line number Diff line number Diff line change
@@ -128,7 +128,7 @@ property. E.g.::
>>> bar.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.int64
Data type : Int64(endianness='little')
Shape : (1000000,)
Chunk shape : (100000,)
Order : C
@@ -144,7 +144,7 @@ property. E.g.::
>>> baz.info
Type : Array
Zarr format : 3
Data type : DataType.float32
Data type : Float32(endianness='little')
Shape : (1000, 1000)
Chunk shape : (100, 100)
Order : C
1 change: 1 addition & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
@@ -8,6 +8,7 @@ User guide

installation
arrays
data_types
groups
attributes
storage
10 changes: 5 additions & 5 deletions docs/user-guide/performance.rst
Original file line number Diff line number Diff line change
@@ -52,7 +52,7 @@ a chunk shape is based on simple heuristics and may be far from optimal. E.g.::

>>> z4 = zarr.create_array(store={}, shape=(10000, 10000), chunks='auto', dtype='int32')
>>> z4.chunks
(625, 625)
(313, 625)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that automatic chunk determination has changed in some cases should be documented in a changelog entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would require figuring out why the automatic chunk determination changed, which I have not done yet


If you know you are always going to be loading the entire array into memory, you
can turn off chunks by providing ``chunks`` equal to ``shape``, in which case there
@@ -91,15 +91,15 @@ To use sharding, you need to specify the ``shards`` parameter when creating the
>>> z6.info
Type : Array
Zarr format : 3
Data type : DataType.uint8
Data type : UInt8()
Shape : (10000, 10000, 1000)
Shard shape : (1000, 1000, 1000)
Chunk shape : (100, 100, 100)
Order : C
Read-only : False
Store type : MemoryStore
Filters : ()
Serializer : BytesCodec(endian=<Endian.little: 'little'>)
Serializer : BytesCodec(endian=None)
Compressors : (ZstdCodec(level=0, checksum=False),)
No. bytes : 100000000000 (93.1G)

@@ -121,7 +121,7 @@ ratios, depending on the correlation structure within the data. E.g.::
>>> c.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : Int32(endianness='little')
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
@@ -140,7 +140,7 @@ ratios, depending on the correlation structure within the data. E.g.::
>>> f.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : Int32(endianness='little')
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : F
17 changes: 12 additions & 5 deletions src/zarr/abc/codec.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from __future__ import annotations

from abc import abstractmethod
from typing import TYPE_CHECKING, Any, Generic, TypeVar
from typing import TYPE_CHECKING, Generic, TypeVar

from zarr.abc.metadata import Metadata
from zarr.core.buffer import Buffer, NDBuffer
@@ -12,11 +12,10 @@
from collections.abc import Awaitable, Callable, Iterable
from typing import Self

import numpy as np

from zarr.abc.store import ByteGetter, ByteSetter
from zarr.core.array_spec import ArraySpec
from zarr.core.chunk_grids import ChunkGrid
from zarr.core.dtype.wrapper import ZDType, _BaseDType, _BaseScalar
from zarr.core.indexing import SelectorTuple

__all__ = [
@@ -93,7 +92,13 @@ def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
"""
return self

def validate(self, *, shape: ChunkCoords, dtype: np.dtype[Any], chunk_grid: ChunkGrid) -> None:
def validate(
self,
*,
shape: ChunkCoords,
dtype: ZDType[_BaseDType, _BaseScalar],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is a type on a public part of the API, I think _BaseDtype and _BaseScalar should be made public and documented?

chunk_grid: ChunkGrid,
) -> None:
"""Validates that the codec configuration is compatible with the array metadata.
Raises errors when the codec configuration is not compatible.

@@ -285,7 +290,9 @@ def supports_partial_decode(self) -> bool: ...
def supports_partial_encode(self) -> bool: ...

@abstractmethod
def validate(self, *, shape: ChunkCoords, dtype: np.dtype[Any], chunk_grid: ChunkGrid) -> None:
def validate(
self, *, shape: ChunkCoords, dtype: ZDType[_BaseDType, _BaseScalar], chunk_grid: ChunkGrid
) -> None:
"""Validates that all codec configurations are compatible with the array metadata.
Raises errors when a codec configuration is not compatible.

Loading