Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor v3 data types #2874

Open
wants to merge 80 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 54 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
f5e3f78
modernize typing
d-v-b Feb 21, 2025
b4e71e2
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Feb 24, 2025
3c50f54
lint
d-v-b Feb 24, 2025
d74e7a4
new dtypes
d-v-b Feb 26, 2025
5000dcb
rename base dtype, change type to kind
d-v-b Feb 26, 2025
9cd5c51
start working on JSON serialization
d-v-b Feb 27, 2025
042fac1
get json de/serialization largely working, and start making tests pass
d-v-b Feb 27, 2025
556e390
tweak json type guards
d-v-b Feb 27, 2025
b588f70
fix dtype sizes, adjust fill value parsing in from_dict, fix tests
d-v-b Feb 27, 2025
4ed41c6
mid-refactor commit
d-v-b Mar 2, 2025
1b2c773
working form for dtype classes
d-v-b Mar 2, 2025
24930b3
remove unused code
d-v-b Mar 2, 2025
703e0e1
use wrap / unwrap instead of to_dtype / from_dtype; push into v2 code…
d-v-b Mar 2, 2025
3c232a4
push into v2
d-v-b Mar 3, 2025
b7fe986
remove endianness kwarg to methods, make it an instance variable instead
d-v-b Mar 3, 2025
d9b44b4
make wrapping safe by default
d-v-b Mar 4, 2025
bf24d69
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 4, 2025
c1a8566
dtype-specific tests
d-v-b Mar 4, 2025
2868994
more tests, fix void type default value logic
d-v-b Mar 5, 2025
9ab0b1e
fix dtype mechanics in bytescodec
d-v-b Mar 5, 2025
e9f5e26
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 5, 2025
6df84a9
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Mar 7, 2025
e14279d
remove __post_init__ magic in favor of more explicit declaration
d-v-b Mar 7, 2025
381a264
fix tests
d-v-b Mar 9, 2025
6a7857b
refactor data types
d-v-b Mar 12, 2025
e8fd72c
start design doc
d-v-b Mar 13, 2025
b22f324
more design doc
d-v-b Mar 13, 2025
b7a231e
update docs
d-v-b Mar 13, 2025
7dfcd0f
fix sphinx warnings
d-v-b Mar 13, 2025
706e6b6
tweak docs
d-v-b Mar 13, 2025
8fbf673
info about v3 data types
d-v-b Mar 13, 2025
e9aff64
adjust note
d-v-b Mar 13, 2025
44e78f5
fix: use unparametrized types in direct assignment
d-v-b Mar 13, 2025
60cac04
start fixing config
d-v-b Mar 17, 2025
120df57
Update src/zarr/core/_info.py
d-v-b Mar 17, 2025
0d9922b
add placeholder disclaimer to v3 data types summary
d-v-b Mar 17, 2025
2075952
make example runnable
d-v-b Mar 17, 2025
44369d6
placeholder section for adding a custom dtype
d-v-b Mar 17, 2025
4f3381f
define native data type and native scalar
d-v-b Mar 17, 2025
c8d7680
update data type names
d-v-b Mar 17, 2025
2a7b5a8
fix config test failures
d-v-b Mar 17, 2025
e855e54
call to_dtype once in blosc evolve_from_array_spec
d-v-b Mar 17, 2025
a2da99a
refactor dtypewrapper -> zdtype
d-v-b Mar 19, 2025
5ea3fa4
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 19, 2025
cbb159d
update code examples in docs; remove native endianness
d-v-b Mar 19, 2025
c506d09
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 19, 2025
bb11867
adjust type annotations
d-v-b Mar 20, 2025
7a619e0
fix info tests to use zdtype
d-v-b Mar 20, 2025
ea2d0bf
remove dead code and add code coverage exemption to zarr format checks
d-v-b Mar 20, 2025
042c9e5
fix: add special check for resolving int32 on windows
d-v-b Mar 20, 2025
def5eb2
add dtype entry point test
d-v-b Mar 20, 2025
1b7273b
remove default parameters for parametric dtypes; add mixin classes fo…
d-v-b Mar 21, 2025
60b2e9d
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 21, 2025
83f508c
Update docs/user-guide/data_types.rst
d-v-b Mar 24, 2025
4ceb6ed
refactor: use inheritance to remove boilerplate in dtype definitions
d-v-b Mar 24, 2025
5b9cff0
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
65f0453
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 24, 2025
cb0a7d4
update data types documentation, and expose core/dtype module to autodoc
d-v-b Mar 24, 2025
40f0063
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
9989c64
add failing endianness round-trip test
d-v-b Mar 24, 2025
a276c84
fix endianness
d-v-b Mar 24, 2025
6285739
additional check in test_explicit_endianness
d-v-b Mar 24, 2025
e9241b9
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 24, 2025
2bffe1a
add failing test for round-tripping vlen strings
d-v-b Mar 24, 2025
aa32271
route object dtype arrays to vlen string dtype when numpy > 2
d-v-b Mar 25, 2025
617d3f0
relax endianness mismatch to a warning instead of an error
d-v-b Mar 25, 2025
2b5fd8f
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
1831f20
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
a427a16
silence mypy error about array indexing
d-v-b Mar 25, 2025
41d7e58
add release note
d-v-b Mar 25, 2025
c08ffd9
fix doctests, excluding config tests
d-v-b Mar 25, 2025
778d740
revert addition of linkage between dtype endianness and bytes codec e…
d-v-b Mar 26, 2025
269215e
remove Any types
d-v-b Mar 26, 2025
8af0ce4
add docstring for wrapper module
d-v-b Mar 26, 2025
df60d05
simplify config and docs
d-v-b Mar 26, 2025
7f54bbf
update config test
d-v-b Mar 26, 2025
be83f03
fix S dtype test for v2
d-v-b Mar 26, 2025
8e6924d
Update changes/2874.feature.rst
d-v-b Mar 28, 2025
25b1527
Update docs/user-guide/data_types.rst
d-v-b Mar 28, 2025
0a5d14e
Update docs/user-guide/data_types.rst
d-v-b Mar 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docs/user-guide/arrays.rst
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ which can be used to print useful diagnostics, e.g.::
>>> z.info
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Expand All @@ -199,7 +199,7 @@ prints additional diagnostics, e.g.::
>>> z.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Expand Down Expand Up @@ -286,7 +286,7 @@ Here is an example using a delta filter with the Blosc compressor::
>>> z.info
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Expand Down Expand Up @@ -600,18 +600,18 @@ Sharded arrays can be created by providing the ``shards`` parameter to :func:`za
>>> a.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.uint8
Data type : uint8
Shape : (10000, 10000)
Shard shape : (1000, 1000)
Chunk shape : (100, 100)
Order : C
Read-only : False
Store type : LocalStore
Filters : ()
Serializer : BytesCodec(endian=<Endian.little: 'little'>)
Serializer : BytesCodec(endian=None)
Compressors : (ZstdCodec(level=0, checksum=False),)
No. bytes : 100000000 (95.4M)
No. bytes stored : 3981552
No. bytes stored : 3981473
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why the number of bytes has changed here? Does that mean the data/bytes being stored has changed somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because the bytes codec no longer specifies endianness, so the JSON document is slightly smaller, but I haven't confirmed this.

Storage ratio : 25.1
Shards Initialized : 100

Expand Down
6 changes: 3 additions & 3 deletions docs/user-guide/consolidated_metadata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ that can be used.:
>>> from pprint import pprint
>>> pprint(dict(sorted(consolidated_metadata.items())))
{'a': ArrayV3Metadata(shape=(1,),
data_type=<DataType.float64: 'float64'>,
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(1,)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
Expand All @@ -60,7 +60,7 @@ that can be used.:
node_type='array',
storage_transformers=()),
'b': ArrayV3Metadata(shape=(2, 2),
data_type=<DataType.float64: 'float64'>,
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(2, 2)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
Expand All @@ -73,7 +73,7 @@ that can be used.:
node_type='array',
storage_transformers=()),
'c': ArrayV3Metadata(shape=(3, 3, 3),
data_type=<DataType.float64: 'float64'>,
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(3, 3, 3)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
Expand Down
189 changes: 189 additions & 0 deletions docs/user-guide/data_types.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
Data types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a super useful read. I'm wondering what to do with it though. Were you thinking it would go under the Advanced Topics section in the user guide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion from me. IMO our docs right now are not the most logically organized, so I anticipate some churn there in any case.

==========

Zarr's data type model
----------------------

Every Zarr array has a "data type", which defines the meaning and physical layout of the
array's elements. Zarr is heavily influenced by `NumPy <https://numpy.org/doc/stable/>`_, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zarr is heavily influenced

Do you mean the data format, or Zarr-Python here? Would be good to clarify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both are true

Zarr-Python supports creating arrays with Numpy data types::

>>> import zarr
>>> import numpy as np
>>> z = zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8'))
>>> z
<Array memory:... shape=(10,) dtype=uint8>

Unlike Numpy arrays, Zarr arrays are designed to be persisted to storage and read by Zarr implementations in different programming languages.
This means Zarr data types must be interpreted correctly when clients read an array. So each Zarr data type defines a procedure for
encoding / decoding that data type to / from Zarr array metadata, and also encoding / decoding **instances** of that data type to / from
array metadata. These serialization procedures depend on the Zarr format.

Data types in Zarr version 2
-----------------------------

Version 2 of the Zarr format defined its data types relative to `Numpy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_, and added a few non-Numpy data types as well.
Thus the JSON identifier for a Numpy-compatible data type is just the Numpy ``str`` attribute of that dtype:

>>> import zarr
>>> import numpy as np
>>> import json
>>> store = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> store = {}
>>>
>>> store = {}

to break it up a bit?

>>> np_dtype = np.dtype('int64')
>>> z = zarr.create_array(store=store, shape=(1,), dtype=np_dtype, zarr_format=2)
>>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"]
>>> assert dtype_meta == np_dtype.str # True
>>> dtype_meta
'<i8'

.. note::
The ``<`` character in the data type metadata encodes the `endianness <https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html>`_, or "byte order", of the data type. Following Numpy's example,
in Zarr version 2 each data type has an endianness where applicable. However, Zarr version 3 data types do not store endianness information.

In addition to defining a representation of the data type itself (which in the example above was just a simple string ``"<i8"``), Zarr also
defines a metadata representation of scalars associated with that data type. Integers are stored as ``JSON`` numbers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defines a metadata representation of scalars associated with that data type. Integers are stored as ``JSON`` numbers,
defines a metadata representation of scalars associated with that data type (that can be used e.g., for storing fill values in the metadata). Integers are stored as ``JSON`` numbers,

I scratched my head for a bit wondering why a scalar representation was needed, before realising (I think I'm right?). I'm not sure my suggestion is very well written, but something similar to explain why here migth be nice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the scalar representation is only used for the fill value metadata, so I will say as much in the docs

as are floats, with the caveat that `NaN`, positive infinity, and negative infinity are stored as special strings.

Data types in Zarr version 3
----------------------------
(note: placeholder text)
* Data type names are different -- Zarr V2 represented the 16 bit unsigned integer data type as ``>i2``; Zarr V3 represents the same data type as ``int16``.
* No endianness
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a document somewhere that explains this decision and/or how endianness should be handled in zarr v3? If so, it should be linked here; if not, perhaps a paragraph or two are warranted in this doc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: I see it's dealt with below. It would have been good to have something like:

Suggested change
* No endianness
* No endianness; endianness is instead defined as part of the codec pipeline (see below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the v3 section is very placeholder right now. eventually it will get a proper prose form that explains the endianness thing

* A data type can be encoded in metadata as a string or a ``JSON`` object with the structure ``{"name": <string identifier>, "configuration": {...}}``

Data types in Zarr-Python
-------------------------

Zarr-Python supports two different Zarr formats, and those two formats specify data types in rather different ways:
data types in Zarr version 2 are encoded as Numpy-compatible strings, while data types in Zarr version 3 are encoded as either strings or ``JSON`` objects,
and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types.

We aspire for Zarr-Python to eventually be array-library-agnostic.
In the context of data types, this means that we should not design an API that overfits to Numpy's data types.
We will use the term "native data type" to refer to a data type used by any external array library (including Numpy), e.g. ``np.dtypes.Float64DType()``.
We will also use the term "native scalar" or "native scalar type" to refer to a scalar value of a native data type. For example, ``np.float64(0)`` generates a scalar with the data dtype ``np.dtypes.Float64DType``

Zarr-Python needs to support the following operations on native data types:

* Round-trip native data types to fields in array metadata documents.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to define "native data types"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a definition in 4f3381f, let me know what you think

For example, the Numpy data type ``np.dtype('>i2')`` should be saved as ``{..., "dtype" : ">i2"}`` in Zarr V2 metadata.

In Zarr V3 metadata, the same Numpy data type would be saved as ``{..., "data_type": "int16", "codecs": [..., {"name": "bytes", "configuration": {"endian": "big"}, ...]}``

* Associate a default fill value with a native data type. This is not mandated by the Zarr specifications, but it's convenient for users
to have a useful default. For numeric types like integers and floats the default can be statically set to 0, but for
parametric data types like fixed-length strings the default can only be generated after the data type has been parametrized at runtime.

* Round-trip native scalars to the ``fill_value`` field in Zarr V2 and V3 array metadata documents. The Zarr V2 and V3 specifications
define how scalars of each data type should be stored as JSON in array metadata documents, and in principle each data type
can define this encoding separately.

* Do all of the above for *user-defined data types*. Zarr-Python should support data types added as extensions,so we cannot
hard-code the list of data types. We need to ensure that users can easily (or easily enough) define a python object
that models their custom data type and register this object with Zarr-Python, so that the above operations all succeed for their
custom data type.

To achieve these goals, Zarr Python uses a class called :class:`zarr.core.dtype.DTypeWrapper` to wrap native data types. Each data type
supported by Zarr Python is modeled by a subclass of `DTypeWrapper`, which has the following structure:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If developers are to subclass the DtypeWrapper class, perhaps we drop the Wrapper and just call it a Dtype? Or DtypeABC?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other alternative names:

  • ZarrDType
  • CanonicalDType
  • AbstractDType
  • LocalDType
  • UniversalDType
  • HarmonizedDType
  • DTypeSpec
  • CrossLibraryDType

I don't like terms like Wrapper or ABC because they are vague computery terms. It would be good to use a descriptive term (in the vein of the above list) about what the wrapper is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like terms like Wrapper or ABC because they are vague computery terms. It would be good to use a descriptive term (in the vein of the above list) about what the wrapper is doing.

The DTypeWrapper class is wrapping / abstracting over / managing creation of a dtype used by the library responsible for creating the in-memory arrays used by zarr-python for reading and writing data. I don't think any of your suggested names capture this behavior.

Maybe it's better to avoid attempting to convey the behavior of the class. I like ZarrDtype or ZDtype or DTypeABC. And I think we can ask people who choose to dig into our data type API to tolerate some "computery terms" :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of a2da99a I'm going with ZDType, how does that work for yall

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am against unnecessary abbreviations FWIW - ZarrDtype is my favorite


(attribute) ``dtype_cls``
^^^^^^^^^^^^^^^^^^^^^^^^^
The ``dtype_cls`` attribute is a **class variable** that is bound to a class that can produce
an instance of a native data type. For example, on the ``DTypeWrapper`` used to model the boolean
data type, the ``dtype_cls`` attribute is bound to the numpy bool data type class: ``np.dtypes.BoolDType``.
This attribute is used when we need to create an instance of the native data type, for example when
defining a Numpy array that will contain Zarr data.

It might seem odd that ``DTypeWrapper.dtype_cls`` binds to a *class* that produces a native data type instead of an instance of that native data type --
why not have a ``DTypeWrapper.dtype`` attribute that binds to ``np.dtypes.BoolDType()``? The reason why ``DTypeWrapper``
doesn't wrap a concrete data type instance is because data type instances may have endianness information, but Zarr V3
data types do not. To model Zarr V3 data types, we need endianness to be an **instance variable** which is
defined when creating an instance of the ```DTypeWrapper``. Subclasses of ``DTypeWrapper`` that model data types with
byte order semantics thus have ``endianness`` as an instance variable, and this value can be set when creating an instance of the wrapper.


(attribute) ``_zarr_v3_name``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``_zarr_v3_name`` attribute encodes the canonical name for a data type for Zarr V3. For many data types these names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the thought behind making this a private attribute? If it is required to be implemented, should we make it public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason it's private is because there's a public method for getting the name of a dtype instance (get_name), which takes a zarr_format parameter. The _zarr_v3_name is the name of the class, but at least in the case of the wonky r* dtype, the name of the class will never be the name of an actual dtype instance. r* is the name of the class, but r8, r16, etc would be the names of the data type instances. I would love to remove support for the r* dtype, but even if we did, zarr v2 dtypes like U4 would still require us to compute the name based on instance attributes.

are defined in the `Zarr V3 specification <https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#data-types>`_ For nearly all of the
data types defined in Zarr V3, this name can be used to uniquely specify a data type. The one exception is the ``r*`` data type,
which is parametrized by a number of bits, and so may take the form ``r8``, ``r16``, ... etc.

(class method) ``from_dtype(cls, dtype) -> Self``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method defines a procedure for safely converting a native dtype instance into an instance of ``DTypeWrapper``. It should perform
validation of its input to ensure that the native dtype is an instance of the ``dtype_cls`` class attribute, for example. For some
data types, additional checks are needed -- in Numpy "structured" data types and "void" data types use the same class, with different properties.
A ``DTypeWrapper`` that wraps Numpy structured data types must do additional checks to ensure that the input ``dtype`` is actually a structured data type.
If input validation succeeds, this method will call ``_from_dtype_unsafe``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why even have an unsafe version? Can the check ever be expensive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these docs are stale by now, but the idea was that from_dtype does input validation, but _from_dtype_unsafe does not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's an example for int32:

    @classmethod
    def from_dtype(cls: type[Self], dtype: _BaseDType) -> Self:
        # We override the base implementation to address a windows-specific, pre-numpy 2 issue where
        # ``np.dtype('i')`` is an instance of ``np.dtypes.IntDType`` that acts like `int32` instead of ``np.dtype('int32')``
        # In this case, ``type(np.dtype('i')) == np.dtypes.Int32DType``  will evaluate to ``True``,
        # despite the two classes being different. Thus we will create an instance of `cls` with the
        # latter dtype, after pulling in the byte order of the input
        if dtype == np.dtypes.Int32DType():
            return cls._from_dtype_unsafe(np.dtypes.Int32DType().newbyteorder(dtype.byteorder))
        else:
            return super().from_dtype(dtype)

    @classmethod
    def _from_dtype_unsafe(cls, dtype: _BaseDType) -> Self:
        byte_order = cast("EndiannessNumpy", dtype.byteorder)
        return cls(endianness=endianness_from_numpy_str(byte_order))

from_dtype has to do some platform-specific input validation to ensure that the dtype instance is actually correct, and _from_dtype_unsafe just creates an instance of the data type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these docs are stale by now, but the idea was that from_dtype does input validation, but _from_dtype_unsafe does not.

I guess my question was more along the lines of why provide the option of doing this if the check is cheap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the two operations are logically separable. so my default approach is to separate them. this allows us to write subclasses that only override the input validation step without needing to also override the object creation step.


(method) ``to_dtype(self) -> dtype``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method produces a native data type consistent with the properties of the ``DTypeWrapper``. Together
with ``from_dtype``, this method allows round-trip conversion of a native data type in to a wrapper class and then out again.

That is, for some ``DTypeWrapper`` class ``FooWrapper`` that wraps a native data type called ``foo``, ``FooWrapper.from_dtype(instance_of_foo).to_dtype() == instance_of_foo`` should be true.

(method) ``to_dict(self) -> dict``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method generates a JSON-serialiazable representation of the wrapped data type which can be stored in
Zarr metadata.

(method) ``cast_value(self, value: object) -> scalar``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method converts a python object to an instance of the wrapped data type. It is used for generating the default
value associated with this data type.


(method) ``default_value(self) -> scalar``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method returns the default value for the wrapped data type. Zarr-Python uses this method to generate a default fill value
for an array when a user has not requested one.

Why is this a method and not a static attribute? Although some data types
can have a static default value, parametrized data types like fixed-length strings or structured data types cannot. For these data types,
a default value must be calculated based on the attributes of the wrapped data type.

(class method) ``check_dtype(cls, dtype) -> bool``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This class method checks if a native dtype is compatible with the ``DTypeWrapper`` class. It returns ``True``
if ``dtype`` is compatible with the wrapper class, and ``False`` otherwise. For many data types, this check is as simple
as checking that ``cls.dtype_cls`` matches ``type(dtype)``, i.e. checking that the data type class wrapped
by the ``DTypeWrapper`` is the same as the class of ``dtype``. But there are some data types where this check alone is not sufficient,
in which case this method is overridden so that additional properties of ``dtype`` can be inspected and compared with
the expectations of ``cls``.

(class method) ``from_dict(cls, dtype) -> Self``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This class method creates a ``DTypeWrapper`` from an appropriately structured dictionary. The default
implementation first checks that the dictionary has the correct structure, and then uses its data
to instantiate the ``DTypeWrapper`` instance.

(method) ``to_dict(self) -> dict[str, JSON]``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Returns a dictionary form of the wrapped data type. This is used prior to writing array metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listed twice


(class method) ``get_name(self, zarr_format: Literal[2, 3]) -> str``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method generates a name for the wrapped data type, depending on the Zarr format. If ``zarr_format`` is
2 and the wrapped data type is a Numpy data type, then the Numpy string representation of that data type is returned.
If ``zarr_format`` is 3, then the Zarr V3 name for the wrapped data type is returned. For most data types
the Zarr V3 name will be stored as the ``_zarr_v3_name`` class attribute, but for parametric data types the
name must be computed at runtime based on the parameters of the data type.


(method) ``to_json_value(self, data: scalar, zarr_format: Literal[2, 3]) -> JSON``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method converts a scalar instance of the data type into a JSON-serialiazable value.
For some data types like bool and integers this conversion is simple -- just return a JSON boolean
or number -- but other data types define a JSON serialization for scalars that is a bit more involved.
And this JSON serialization depends on the Zarr format.

(method) ``from_json_value(self, data: JSON, zarr_format: Literal[2, 3]) -> scalar``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Convert a JSON-serialiazed scalar to a native scalar. This inverts the operation of ``to_json_value``.

Using a custom data type
------------------------

TODO
4 changes: 2 additions & 2 deletions docs/user-guide/groups.rst
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ property. E.g.::
>>> bar.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.int64
Data type : int64
Shape : (1000000,)
Chunk shape : (100000,)
Order : C
Expand All @@ -144,7 +144,7 @@ property. E.g.::
>>> baz.info
Type : Array
Zarr format : 3
Data type : DataType.float32
Data type : float32
Shape : (1000, 1000)
Chunk shape : (100, 100)
Order : C
Expand Down
1 change: 1 addition & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ User guide

installation
arrays
data_types
groups
attributes
storage
Expand Down
10 changes: 5 additions & 5 deletions docs/user-guide/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ a chunk shape is based on simple heuristics and may be far from optimal. E.g.::

>>> z4 = zarr.create_array(store={}, shape=(10000, 10000), chunks='auto', dtype='int32')
>>> z4.chunks
(625, 625)
(313, 625)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that automatic chunk determination has changed in some cases should be documented in a changelog entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would require figuring out why the automatic chunk determination changed, which I have not done yet


If you know you are always going to be loading the entire array into memory, you
can turn off chunks by providing ``chunks`` equal to ``shape``, in which case there
Expand Down Expand Up @@ -91,15 +91,15 @@ To use sharding, you need to specify the ``shards`` parameter when creating the
>>> z6.info
Type : Array
Zarr format : 3
Data type : DataType.uint8
Data type : uint8
Shape : (10000, 10000, 1000)
Shard shape : (1000, 1000, 1000)
Chunk shape : (100, 100, 100)
Order : C
Read-only : False
Store type : MemoryStore
Filters : ()
Serializer : BytesCodec(endian=<Endian.little: 'little'>)
Serializer : BytesCodec(endian=None)
Compressors : (ZstdCodec(level=0, checksum=False),)
No. bytes : 100000000000 (93.1G)

Expand All @@ -121,7 +121,7 @@ ratios, depending on the correlation structure within the data. E.g.::
>>> c.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Expand All @@ -140,7 +140,7 @@ ratios, depending on the correlation structure within the data. E.g.::
>>> f.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : F
Expand Down
Loading
Loading