frameworks for data modelling #1625

d-v-b · 2024-01-08T13:24:32Z

d-v-b
Jan 8, 2024
Maintainer

Data modelling for `zarr-python`

The zarr specifications define the properties of zarr metadata documents. In order to faithfully implement the specifications, zarr-python needs to read and write correct metadata documents, and reject incorrect metadata documents. In API terms, I think this means we need a class per metadata document, where the properties of that class map on to attributes described in the metadata document, with simple serialization to / from the native format of the metadata document (JSON). I think it's OK if, where it's helpful, the attributes of the metadata class are not themselves JSON serializable -- e.g., I think the v2 array.dtype should be represented as a np.dtype object, even though it gets serialized to JSON as a string, but maybe this can be a discussion point.

Accordingly, zarr-python should not represent data structures (array or groups) if those data structures cannot be serialized to spec-compliant metadata documents. Simple enough, but the current version of zarr-python does not address this challenge head-on. For example, it was recently possible to create zarr.Array instances with irregular chunk sizes, even though this not valid according to the zarr v2 spec. So we have room for improvement in this area. That's the topic of this discussion.

type- and value-level modelling for array and group objects

Runtime validation is necessary to ensure correctness at runtime, but type annotations are also important for keeping development friction low. I think we need both. Fortunately, the specs aren't too complicated, and they don't change much, so writing validation code and annotating types won't be a big burden.

python tools for data modelling

There are a variety of python libraries for data modelling: dataclasses, attrs, marshmellow, pydantic, etc. zarrita uses attrs, and I have worked extensively with pydantic in personal projects. In short, these libraries make it very easy to:

quickly define classes with typed attributes
define validation routines for attributes, and entire models
to turn those classes to / from dicts or json
make the classes ~immutable, within the limits of python. Immutability eliminates a large class of bugs, and makes data easier to reason about.

Should add one of these libraries as a dependency for zarr-python in v3? Going with attrs would be straightforward, since we are bootstrapping our v3 efforts with @normanrz's work in zarrita, which uses attrs. Others have proposed depending on pydantic, and dataclasses look good if we don't want external dependencies.

my proposal: do it ourselves

I don't think we need a data modelling library for zarr-python. First, I don't think it's important for us to quickly create new data models, because the set of things we have to model in zarr-python is largely static (i.e., the contents of the zarr specifications). By being on the other side of that tradeoff, we get increased flexibility. For the rest of the bullet points listed above, we can do that with vanilla, undecorated classes. Here is how I am currently approaching this in my v3 WIP branch (note that as of this writing that branch only contains an outline of this strategy):

each metadata class (array, group) has typed attributes that structurally match a metadata document described in a zarr specification.
- Where a zarr specification defines structured attributes, e.g. the {name: <>, config: <>} style attributes in v3, or codecs in v2, each of these attributes is modeled the same way as the broader metadata document.
- Metadata classes have to_dict / from_dict methods, and to_json / from_json methods. Because of the constraint described above, nested to_dict calls work by the nesting class calling to_dict on its nested attributes. There are no other methods. These classes exist as data.
- for each piece of data modeled this way, there is a TypedDict class that defines the type of the return / accepted type of to_dict / from_dict method.
- Metadata classes are immutable, which we obtain by overriding __setattr__ and __delattr__
- each attribute is parsed with a stand-alone function (not a method!) that either returns a modified value or raises an exception. This means the parsing functions can be re-used more easily. The entire metadata object is also parsed with a stand-alone function (to ensure that attributes are consistent when combined, e.g. that shape and chunks attributes are consistent).

I think if we use a strategy approximately like this, then we don't have to define our classes according to the rules of a particular data modelling library, but we expose an API that can be used as scaffolding for someone who does want to integrate zarr into a particular data modelling framework. For example, by making the parsing routines stand-alone functions, pydantic users can just import those functions to create pydantic models for zarr.

I would love to hear thoughts from others about this!

normanrz · 2024-01-08T13:28:47Z

normanrz
Jan 8, 2024
Maintainer

Should add one of these libraries as a dependency for zarr-python in v3?

I think zarr-python should have as little dependencies as possible. That disqualifies attrs, pydantic and other external libraries. Using dataclasses would be an option as well as the "do it ourselves" option.

2 replies

jhamman Jan 11, 2024
Maintainer

I would second using dataclasses unless we have a reason not to. dataclasses will cover 90% of what zarrita used attrs for and the last 10% is relatively easy to roll on our own.

normanrz Jan 11, 2024
Maintainer

cattrs has pretty nice validation and reasonably readable error messages for reading jsons. That probably costs the most effort to replace.

d-v-b · 2024-02-13T08:43:44Z

d-v-b
Feb 13, 2024
Maintainer Author

We have put a solid amount of work into a de-attrification effort in this PR: #1660

I think the basic design we are using is sound, but I would appreciate some feedback from other developers.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

frameworks for data modelling #1625

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

frameworks for data modelling #1625

Uh oh!

d-v-b Jan 8, 2024 Maintainer

Data modelling for zarr-python

type- and value-level modelling for array and group objects

python tools for data modelling

my proposal: do it ourselves

Replies: 2 comments · 2 replies

Uh oh!

normanrz Jan 8, 2024 Maintainer

Uh oh!

jhamman Jan 11, 2024 Maintainer

Uh oh!

normanrz Jan 11, 2024 Maintainer

Uh oh!

d-v-b Feb 13, 2024 Maintainer Author

d-v-b
Jan 8, 2024
Maintainer

Data modelling for `zarr-python`

Replies: 2 comments 2 replies

normanrz
Jan 8, 2024
Maintainer

jhamman Jan 11, 2024
Maintainer

normanrz Jan 11, 2024
Maintainer

d-v-b
Feb 13, 2024
Maintainer Author