Skip to content

refactor v3 data types #2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 80 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
f5e3f78
modernize typing
d-v-b Feb 21, 2025
b4e71e2
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Feb 24, 2025
3c50f54
lint
d-v-b Feb 24, 2025
d74e7a4
new dtypes
d-v-b Feb 26, 2025
5000dcb
rename base dtype, change type to kind
d-v-b Feb 26, 2025
9cd5c51
start working on JSON serialization
d-v-b Feb 27, 2025
042fac1
get json de/serialization largely working, and start making tests pass
d-v-b Feb 27, 2025
556e390
tweak json type guards
d-v-b Feb 27, 2025
b588f70
fix dtype sizes, adjust fill value parsing in from_dict, fix tests
d-v-b Feb 27, 2025
4ed41c6
mid-refactor commit
d-v-b Mar 2, 2025
1b2c773
working form for dtype classes
d-v-b Mar 2, 2025
24930b3
remove unused code
d-v-b Mar 2, 2025
703e0e1
use wrap / unwrap instead of to_dtype / from_dtype; push into v2 code…
d-v-b Mar 2, 2025
3c232a4
push into v2
d-v-b Mar 3, 2025
b7fe986
remove endianness kwarg to methods, make it an instance variable instead
d-v-b Mar 3, 2025
d9b44b4
make wrapping safe by default
d-v-b Mar 4, 2025
bf24d69
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 4, 2025
c1a8566
dtype-specific tests
d-v-b Mar 4, 2025
2868994
more tests, fix void type default value logic
d-v-b Mar 5, 2025
9ab0b1e
fix dtype mechanics in bytescodec
d-v-b Mar 5, 2025
e9f5e26
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 5, 2025
6df84a9
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Mar 7, 2025
e14279d
remove __post_init__ magic in favor of more explicit declaration
d-v-b Mar 7, 2025
381a264
fix tests
d-v-b Mar 9, 2025
6a7857b
refactor data types
d-v-b Mar 12, 2025
e8fd72c
start design doc
d-v-b Mar 13, 2025
b22f324
more design doc
d-v-b Mar 13, 2025
b7a231e
update docs
d-v-b Mar 13, 2025
7dfcd0f
fix sphinx warnings
d-v-b Mar 13, 2025
706e6b6
tweak docs
d-v-b Mar 13, 2025
8fbf673
info about v3 data types
d-v-b Mar 13, 2025
e9aff64
adjust note
d-v-b Mar 13, 2025
44e78f5
fix: use unparametrized types in direct assignment
d-v-b Mar 13, 2025
60cac04
start fixing config
d-v-b Mar 17, 2025
120df57
Update src/zarr/core/_info.py
d-v-b Mar 17, 2025
0d9922b
add placeholder disclaimer to v3 data types summary
d-v-b Mar 17, 2025
2075952
make example runnable
d-v-b Mar 17, 2025
44369d6
placeholder section for adding a custom dtype
d-v-b Mar 17, 2025
4f3381f
define native data type and native scalar
d-v-b Mar 17, 2025
c8d7680
update data type names
d-v-b Mar 17, 2025
2a7b5a8
fix config test failures
d-v-b Mar 17, 2025
e855e54
call to_dtype once in blosc evolve_from_array_spec
d-v-b Mar 17, 2025
a2da99a
refactor dtypewrapper -> zdtype
d-v-b Mar 19, 2025
5ea3fa4
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 19, 2025
cbb159d
update code examples in docs; remove native endianness
d-v-b Mar 19, 2025
c506d09
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 19, 2025
bb11867
adjust type annotations
d-v-b Mar 20, 2025
7a619e0
fix info tests to use zdtype
d-v-b Mar 20, 2025
ea2d0bf
remove dead code and add code coverage exemption to zarr format checks
d-v-b Mar 20, 2025
042c9e5
fix: add special check for resolving int32 on windows
d-v-b Mar 20, 2025
def5eb2
add dtype entry point test
d-v-b Mar 20, 2025
1b7273b
remove default parameters for parametric dtypes; add mixin classes fo…
d-v-b Mar 21, 2025
60b2e9d
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 21, 2025
83f508c
Update docs/user-guide/data_types.rst
d-v-b Mar 24, 2025
4ceb6ed
refactor: use inheritance to remove boilerplate in dtype definitions
d-v-b Mar 24, 2025
5b9cff0
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
65f0453
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 24, 2025
cb0a7d4
update data types documentation, and expose core/dtype module to autodoc
d-v-b Mar 24, 2025
40f0063
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
9989c64
add failing endianness round-trip test
d-v-b Mar 24, 2025
a276c84
fix endianness
d-v-b Mar 24, 2025
6285739
additional check in test_explicit_endianness
d-v-b Mar 24, 2025
e9241b9
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 24, 2025
2bffe1a
add failing test for round-tripping vlen strings
d-v-b Mar 24, 2025
aa32271
route object dtype arrays to vlen string dtype when numpy > 2
d-v-b Mar 25, 2025
617d3f0
relax endianness mismatch to a warning instead of an error
d-v-b Mar 25, 2025
2b5fd8f
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
1831f20
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
a427a16
silence mypy error about array indexing
d-v-b Mar 25, 2025
41d7e58
add release note
d-v-b Mar 25, 2025
c08ffd9
fix doctests, excluding config tests
d-v-b Mar 25, 2025
778d740
revert addition of linkage between dtype endianness and bytes codec e…
d-v-b Mar 26, 2025
269215e
remove Any types
d-v-b Mar 26, 2025
8af0ce4
add docstring for wrapper module
d-v-b Mar 26, 2025
df60d05
simplify config and docs
d-v-b Mar 26, 2025
7f54bbf
update config test
d-v-b Mar 26, 2025
be83f03
fix S dtype test for v2
d-v-b Mar 26, 2025
8e6924d
Update changes/2874.feature.rst
d-v-b Mar 28, 2025
25b1527
Update docs/user-guide/data_types.rst
d-v-b Mar 28, 2025
0a5d14e
Update docs/user-guide/data_types.rst
d-v-b Mar 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 17 additions & 11 deletions src/zarr/api/asynchronous.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,13 @@
import numpy.typing as npt
from typing_extensions import deprecated

from zarr.core.array import Array, AsyncArray, create_array, get_array_metadata
from zarr.core.array import (
Array,
AsyncArray,
_get_default_chunk_encoding_v2,
create_array,
get_array_metadata,
)
from zarr.core.array_spec import ArrayConfig, ArrayConfigLike, ArrayConfigParams
from zarr.core.buffer import NDArrayLike
from zarr.core.common import (
Expand All @@ -21,7 +27,6 @@
_default_zarr_format,
_warn_order_kwarg,
_warn_write_empty_chunks_kwarg,
parse_dtype,
)
from zarr.core.group import (
AsyncGroup,
Expand All @@ -30,7 +35,7 @@
create_hierarchy,
)
from zarr.core.metadata import ArrayMetadataDict, ArrayV2Metadata, ArrayV3Metadata
from zarr.core.metadata.v2 import _default_compressor, _default_filters
from zarr.core.metadata.dtype import get_data_type_from_numpy
from zarr.errors import NodeTypeValidationError
from zarr.storage._common import make_store_path

Expand Down Expand Up @@ -428,11 +433,12 @@ async def save_array(
shape = arr.shape
chunks = getattr(arr, "chunks", None) # for array-likes with chunks attribute
overwrite = kwargs.pop("overwrite", None) or _infer_overwrite(mode)
zarr_dtype = get_data_type_from_numpy(arr.dtype)
new = await AsyncArray._create(
store_path,
zarr_format=zarr_format,
shape=shape,
dtype=arr.dtype,
dtype=zarr_dtype,
chunks=chunks,
overwrite=overwrite,
**kwargs,
Expand Down Expand Up @@ -978,15 +984,15 @@ async def create(
_handle_zarr_version_or_format(zarr_version=zarr_version, zarr_format=zarr_format)
or _default_zarr_format()
)

dtype_wrapped = get_data_type_from_numpy(dtype)
if zarr_format == 2:
if chunks is None:
chunks = shape
dtype = parse_dtype(dtype, zarr_format)
if not filters:
filters = _default_filters(dtype)
if not compressor:
compressor = _default_compressor(dtype)
default_filters, default_compressor = _get_default_chunk_encoding_v2(dtype_wrapped)
if filters is None:
filters = default_filters
if compressor is None:
compressor = default_compressor
elif zarr_format == 3 and chunk_shape is None: # type: ignore[redundant-expr]
if chunks is not None:
chunk_shape = chunks
Expand Down Expand Up @@ -1051,7 +1057,7 @@ async def create(
store_path,
shape=shape,
chunks=chunks,
dtype=dtype,
dtype=dtype_wrapped,
compressor=compressor,
fill_value=fill_value,
overwrite=overwrite,
Expand Down
6 changes: 3 additions & 3 deletions src/zarr/codecs/_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,15 +48,15 @@ async def _decode_single(
# segfaults and other bad things happening
if chunk_spec.dtype != object:
try:
chunk = chunk.view(chunk_spec.dtype)
chunk = chunk.view(chunk_spec.dtype.unwrap())
except TypeError:
# this will happen if the dtype of the chunk
# does not match the dtype of the array spec i.g. if
# the dtype of the chunk_spec is a string dtype, but the chunk
# is an object array. In this case, we need to convert the object
# array to the correct dtype.

chunk = np.array(chunk).astype(chunk_spec.dtype)
chunk = np.array(chunk).astype(chunk_spec.dtype.unwrap())

elif chunk.dtype != object:
# If we end up here, someone must have hacked around with the filters.
Expand All @@ -80,7 +80,7 @@ async def _encode_single(
chunk = chunk_array.as_ndarray_like()

# ensure contiguous and correct order
chunk = chunk.astype(chunk_spec.dtype, order=chunk_spec.order, copy=False)
chunk = chunk.astype(chunk_spec.dtype.unwrap(), order=chunk_spec.order, copy=False)

# apply filters
if self.filters:
Expand Down
8 changes: 6 additions & 2 deletions src/zarr/codecs/blosc.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,11 +139,15 @@ def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
dtype = array_spec.dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling to_dtype here would help avoid having to call it twice below.

new_codec = self
if new_codec.typesize is None:
new_codec = replace(new_codec, typesize=dtype.itemsize)
new_codec = replace(new_codec, typesize=dtype.unwrap().itemsize)
if new_codec.shuffle is None:
new_codec = replace(
new_codec,
shuffle=(BloscShuffle.bitshuffle if dtype.itemsize == 1 else BloscShuffle.shuffle),
shuffle=(
BloscShuffle.bitshuffle
if dtype.unwrap().itemsize == 1
else BloscShuffle.shuffle
),
)

return new_codec
Expand Down
12 changes: 3 additions & 9 deletions src/zarr/codecs/bytes.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def to_dict(self) -> dict[str, JSON]:
return {"name": "bytes", "configuration": {"endian": self.endian.value}}

def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
if array_spec.dtype.itemsize == 0:
if array_spec.dtype.unwrap().itemsize == 1:
if self.endian is not None:
return replace(self, endian=None)
elif self.endian is None:
Expand All @@ -71,14 +71,8 @@ async def _decode_single(
chunk_spec: ArraySpec,
) -> NDBuffer:
assert isinstance(chunk_bytes, Buffer)
if chunk_spec.dtype.itemsize > 0:
if self.endian == Endian.little:
prefix = "<"
else:
prefix = ">"
dtype = np.dtype(f"{prefix}{chunk_spec.dtype.str[1:]}")
else:
dtype = np.dtype(f"|{chunk_spec.dtype.str[1:]}")

dtype = chunk_spec.dtype.with_endianness(self.endian).unwrap()

as_array_like = chunk_bytes.as_array_like()
if isinstance(as_array_like, NDArrayLike):
Expand Down
17 changes: 12 additions & 5 deletions src/zarr/codecs/sharding.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@
get_indexer,
morton_order_iter,
)
from zarr.core.metadata.dtype import DTypeWrapper
from zarr.core.metadata.v3 import parse_codecs
from zarr.registry import get_ndbuffer_class, get_pipeline_class, register_codec

Expand Down Expand Up @@ -355,9 +356,10 @@ def __init__(
object.__setattr__(self, "index_location", index_location_parsed)

# Use instance-local lru_cache to avoid memory leaks
object.__setattr__(self, "_get_chunk_spec", lru_cache()(self._get_chunk_spec))
object.__setattr__(self, "_get_index_chunk_spec", lru_cache()(self._get_index_chunk_spec))
object.__setattr__(self, "_get_chunks_per_shard", lru_cache()(self._get_chunks_per_shard))
# TODO: fix these when we don't get hashability errors for certain numpy dtypes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to fix this. I think the LRU store cache was attempting to hash a non-hashable numpy dtype, and this caused very hard to debug errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to open an issue and link to it in this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that needs fixing before this PR is merged?

# object.__setattr__(self, "_get_chunk_spec", lru_cache()(self._get_chunk_spec))
# object.__setattr__(self, "_get_index_chunk_spec", lru_cache()(self._get_index_chunk_spec))
# object.__setattr__(self, "_get_chunks_per_shard", lru_cache()(self._get_chunks_per_shard))

# todo: typedict return type
def __getstate__(self) -> dict[str, Any]:
Expand Down Expand Up @@ -402,7 +404,9 @@ def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
return replace(self, codecs=evolved_codecs)
return self

def validate(self, *, shape: ChunkCoords, dtype: np.dtype[Any], chunk_grid: ChunkGrid) -> None:
def validate(
self, *, shape: ChunkCoords, dtype: DTypeWrapper[Any, Any], chunk_grid: ChunkGrid
) -> None:
if len(self.chunk_shape) != len(shape):
raise ValueError(
"The shard's `chunk_shape` and array's `shape` need to have the same number of dimensions."
Expand Down Expand Up @@ -483,7 +487,10 @@ async def _decode_partial_single(

# setup output array
out = shard_spec.prototype.nd_buffer.create(
shape=indexer.shape, dtype=shard_spec.dtype, order=shard_spec.order, fill_value=0
shape=indexer.shape,
dtype=shard_spec.dtype.unwrap(),
order=shard_spec.order,
fill_value=0,
)

indexed_chunks = list(indexer)
Expand Down
6 changes: 4 additions & 2 deletions src/zarr/core/_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@

from zarr.abc.codec import ArrayArrayCodec, ArrayBytesCodec, BytesBytesCodec
from zarr.core.common import ZarrFormat
from zarr.core.metadata.v3 import DataType
from zarr.core.metadata.dtype import DTypeWrapper

# from zarr.core.metadata.v3 import DataType


@dataclasses.dataclass(kw_only=True)
Expand Down Expand Up @@ -78,7 +80,7 @@ class ArrayInfo:

_type: Literal["Array"] = "Array"
_zarr_format: ZarrFormat
_data_type: np.dtype[Any] | DataType
_data_type: np.dtype[Any] | DTypeWrapper
_shape: tuple[int, ...]
_shard_shape: tuple[int, ...] | None = None
_chunk_shape: tuple[int, ...] | None = None
Expand Down
Loading
Loading