-
-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor v3 data types #2874
base: main
Are you sure you want to change the base?
refactor v3 data types #2874
Conversation
…into feat/fixed-length-strings
copying a comment @nenb made in this zulip discussion:
A feature of the character code is that it provides a way to distinguish parametric types like |
src/zarr/core/metadata/dtype.py
Outdated
name: ClassVar[str] | ||
dtype_cls: ClassVar[type[TDType]] # this class will create a numpy dtype | ||
kind: ClassVar[DataTypeFlavor] | ||
default_value: TScalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
child classes define a string name
(which feeds into the zarr metadata), a dtype class dtype_cls
(which gets assigned automatically from the generic type parameter) , a string kind
(we use this for classifying scalars internally), and a default value (putting this here seems simpler than maintaining a function that matches dtype to default value, but we could potentially do that)
src/zarr/core/metadata/dtype.py
Outdated
class IntWrapperBase(DTypeWrapper[TDType, TScalar]): | ||
kind = "numeric" | ||
|
||
@classmethod | ||
def from_dtype(cls, dtype: TDType) -> Self: | ||
return cls() | ||
|
||
def to_json_value(self, data: np.generic, zarr_format: ZarrFormat) -> int: | ||
return int(data) | ||
|
||
def from_json_value( | ||
self, data: JSON, *, zarr_format: ZarrFormat, endianness: Endianness | None = None | ||
) -> TScalar: | ||
if check_json_int(data): | ||
return self.to_dtype(endianness=endianness).type(data) | ||
raise TypeError(f"Invalid type: {data}. Expected an integer.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use inheritance for dtypes like the integers, which really only differ in their concrete dtype + scalar types.
object.__setattr__(self, "_get_chunk_spec", lru_cache()(self._get_chunk_spec)) | ||
object.__setattr__(self, "_get_index_chunk_spec", lru_cache()(self._get_index_chunk_spec)) | ||
object.__setattr__(self, "_get_chunks_per_shard", lru_cache()(self._get_chunks_per_shard)) | ||
# TODO: fix these when we don't get hashability errors for certain numpy dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to fix this. I think the LRU store cache was attempting to hash a non-hashable numpy dtype, and this caused very hard to debug errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to open an issue and link to it in this comment.
Summarising from a zulip discussion: @nenb: How is the endianness of a dtype handled? @d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype. Proposed solution: Make |
Thanks for the summary! I have implemented the proposed solution. |
@d-v-b Could you point out exactly where the types were causing an issue? I noticed quite a few in We chatted on Zulip about the config. The TL;DR for folks not on Zulip was that we concluded that you would try to not touch the Config module in this PR as much as possible (and this would hopefully mean that your issues above are no longer relevant). I also had a (non-blocking) question on the performance implications of creating these new dtypes from the zarr metadata. The lookup logic in |
class ZDType(Generic[D, S]):
pass
x: ZDType[AllDtypes, AllScalars]
x = ZDType[IntDtype, IntScalar] # if D and S are not covariant, mypy complains here The two type parameters of
The lookup logic you see in |
a few notable changes added yesterday:
I have added some dtype round-trip tests that are currently failing at an interesting place -- variable length strings. I'm hoping to get this sorted out today. |
I have a question about numpy object dtype arrays. In zarr-python 2.x, numpy object dtype arrays could contain variable-length strings (object dtype + vlen utf8 codec), variable-length arrays (object dtype + vlen array codec), or arbitrary python objects. Do we want to support variable-length arrays in zarr-python 3? An equivalent question is "do we want to always map numpy object arrays to variable length strings?" Right now, there's a 1:1 mapping between numpy data types and zarr data types. Allowing variable-length arrays will break this relationship, because both variable-length strings and variable-length arrays will potentially use the Object dtype (depending on the numpy version). In terms of this PR, to support vlen arrays, I would need to change the function signatures of the data type resolution methods to take additional information about chunk encoding. But we can also do that at any time down the road. |
The in-memory endianness and the on-disk endianness seem like separate concerns, why impose this restriction? Also, have you considered that codecs like |
I just edited that comment to indicate that now the user sees a warning instead of an exception.
I have considered it but there's not really anything we can do about this in |
Maybe I'm misunderstanding the issue, but my operating assumption has been that when people create an array with a big-endian data type, they expect to open that array, index it, and get arrays with the same big-endian data type they requested. I don't think we have anywhere to store the in-memory endianness in zarr-python, so the only place left is on-disk, via the codec configuration for v3 data or the dtype for v2 data. |
correction: @normanrz showed me#Zarr > extensible dtypes @ 💬 that the array-array codecs do implement methods that allow resolution of their dtype transformation, so I will try to use this information properly in this PR. |
I think this is ready for final reviews. some notable recent changes:
I don't think the names of the new-to-zarr-v3 data types is settled. It might make sense to have separate PRs for each data type (e.g., numpy fixed-length-null-terminated bytes, fixed-length-non-null-terminated bytes, datetime64, fixed-length unicode, variable-length string, etc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had time to review all of this (especially the actual implementation), but I've left plenty of comments/requests that I think are worthwhile anyway.
Two points not covered in my inline comments:
- Are there any breaking changes in the config? If so we need to work out how to deal with them so they're nicely deprecated before the old values stop working.
ZDtype
is imported in several places from the non-public part of the API (zarr.core
). This should be changed so it's imported from the public part of the API. Even better, the definition should just move out ofzarr.core
tozarr.dtype
to avoid errors like this.
changes/2874.feature.rst
Outdated
@@ -0,0 +1,2 @@ | |||
Adds zarr-specific data type classes. This replaces the direct use of numpy data types for zarr | |||
v2 and a fixed set of string enums for zarr v3. For more on this new feature, see the `documentation <https://zarr.readthedocs.io/en/stable/user-guide/data_types.html>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of hard linking, this should use a sphinx anchor so it always references the current built docs (I found this when looking at the PR doc build, where this link doesn't resolve properly)
Compressors : (ZstdCodec(level=0, checksum=False),) | ||
No. bytes : 100000000 (95.4M) | ||
No. bytes stored : 3981552 | ||
No. bytes stored : 3981473 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why the number of bytes has changed here? Does that mean the data/bytes being stored has changed somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because the bytes codec no longer specifies endianness, so the JSON document is slightly smaller, but I haven't confirmed this.
@@ -43,39 +43,30 @@ This is the current default configuration:: | |||
|
|||
>>> zarr.config.pprint() | |||
{'array': {'order': 'C', | |||
'v2_default_compressor': {'bytes': {'checksum': False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I manually set the config to this old default value (which I could do in the current v3 branch), does it work properly after this PR? I guess the bigger question here is, are there any breaking changes to what is/isn't allowed in the config with this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, the config in this PR has undergone breaking changes compared to main
. We could make those changes backwards-compatible and add deprecation warnings to deprecated keys but this will require some effort.
---------------------- | ||
|
||
Every Zarr array has a "data type", which defines the meaning and physical layout of the | ||
array's elements. Zarr is heavily influenced by `NumPy <https://numpy.org/doc/stable/>`_, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zarr is heavily influenced
Do you mean the data format, or Zarr-Python here? Would be good to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both are true
|
||
|
||
# TODO: find a better name for this function | ||
def get_data_type_from_native_dtype(dtype: npt.DTypeLike) -> ZDType[_BaseDType, _BaseScalar]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_data_type_from_native_dtype(dtype: npt.DTypeLike) -> ZDType[_BaseDType, _BaseScalar]: | |
def zarr_dtype_from_numpy_dtype(dtype: npt.DTypeLike) -> ZDType[_BaseDType, _BaseScalar]: |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to over-fit to numpy data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we add support for non-numpy data types (like arrow data types) then we don't have to rename this function.
# This branch assumes that the data type has been specified in the JSON form | ||
# but it's also possible for numpy data types to be specified as dictionaries, which will | ||
# cause an error in the `get_data_type_from_json`, but that's ok for now | ||
return get_data_type_from_json(dtype, zarr_format=zarr_format) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the case where dtype is a dict needs a test (codecov complains on this line)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, I can try to create such a case
@@ -1007,7 +1007,7 @@ async def test_asyncgroup_create_array( | |||
assert subnode.dtype == dtype | |||
# todo: fix the type annotation of array.metadata.chunk_grid so that we get some autocomplete | |||
# here. | |||
assert subnode.metadata.chunk_grid.chunk_shape == chunk_shape | |||
assert subnode.chunk_grid.chunk_shape == chunk_shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change related to library changes in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I removed the chunk_grid
attribute from ArrayV2Metadata
@@ -503,7 +504,7 @@ async def test_consolidated_metadata_backwards_compatibility( | |||
async def test_consolidated_metadata_v2(self): | |||
store = zarr.storage.MemoryStore() | |||
g = await AsyncGroup.from_store(store, attributes={"key": "root"}, zarr_format=2) | |||
dtype = "uint8" | |||
dtype = parse_data_type("uint8", zarr_format=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to just pass "uint8"
any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is possible, i'm just not doing that in this test
@@ -219,7 +217,7 @@ async def test_read_consolidated_metadata( | |||
fill_value=0, | |||
chunks=(730,), | |||
attributes={"_ARRAY_DIMENSIONS": ["time"], "dataset": "NMC Reanalysis"}, | |||
dtype=np.dtype("int16"), | |||
dtype=Int16(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking API change right, because the type of dtype
that is returned in the ArrayV2Metadata
object has changed? If so, this needs a clear changelog entry communicating this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is breaking, and to be expected since the goal if this PR is to introduce a new model for data types which necessarily replaces the old one. I can make this more clear in the release notes.
Co-authored-by: David Stansby <[email protected]>
Co-authored-by: David Stansby <[email protected]>
Co-authored-by: David Stansby <[email protected]>
As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.
In
main
,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.