Replies: 2 comments 2 replies
-
| 
 I think  | 
Beta Was this translation helpful? Give feedback.
                  
                    2 replies
                  
                
            -
| We have put a solid amount of work into a de-attrification effort in this PR: #1660 I think the basic design we are using is sound, but I would appreciate some feedback from other developers. | 
Beta Was this translation helpful? Give feedback.
                  
                    0 replies
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
Data modelling for
zarr-pythonThe zarr specifications define the properties of zarr metadata documents. In order to faithfully implement the specifications,
zarr-pythonneeds to read and write correct metadata documents, and reject incorrect metadata documents. In API terms, I think this means we need a class per metadata document, where the properties of that class map on to attributes described in the metadata document, with simple serialization to / from the native format of the metadata document (JSON). I think it's OK if, where it's helpful, the attributes of the metadata class are not themselves JSON serializable -- e.g., I think the v2array.dtypeshould be represented as anp.dtypeobject, even though it gets serialized to JSON as a string, but maybe this can be a discussion point.Accordingly,
zarr-pythonshould not represent data structures (array or groups) if those data structures cannot be serialized to spec-compliant metadata documents. Simple enough, but the current version ofzarr-pythondoes not address this challenge head-on. For example, it was recently possible to createzarr.Arrayinstances with irregular chunk sizes, even though this not valid according to the zarr v2 spec. So we have room for improvement in this area. That's the topic of this discussion.type- and value-level modelling for array and group objects
Runtime validation is necessary to ensure correctness at runtime, but type annotations are also important for keeping development friction low. I think we need both. Fortunately, the specs aren't too complicated, and they don't change much, so writing validation code and annotating types won't be a big burden.
python tools for data modelling
There are a variety of python libraries for data modelling:
dataclasses,attrs,marshmellow,pydantic, etc.zarritausesattrs, and I have worked extensively withpydanticin personal projects. In short, these libraries make it very easy to:Should add one of these libraries as a dependency for
zarr-pythonin v3? Going withattrswould be straightforward, since we are bootstrapping our v3 efforts with @normanrz's work inzarrita, which usesattrs. Others have proposed depending onpydantic, anddataclasseslook good if we don't want external dependencies.my proposal: do it ourselves
I don't think we need a data modelling library for
zarr-python. First, I don't think it's important for us to quickly create new data models, because the set of things we have to model inzarr-pythonis largely static (i.e., the contents of the zarr specifications). By being on the other side of that tradeoff, we get increased flexibility. For the rest of the bullet points listed above, we can do that with vanilla, undecorated classes. Here is how I am currently approaching this in my v3 WIP branch (note that as of this writing that branch only contains an outline of this strategy):array,group) has typed attributes that structurally match a metadata document described in a zarr specification.{name: <>, config: <>}style attributes in v3, or codecs in v2, each of these attributes is modeled the same way as the broader metadata document.to_dict/from_dictmethods, andto_json/from_jsonmethods. Because of the constraint described above, nestedto_dictcalls work by the nesting class callingto_dicton its nested attributes. There are no other methods. These classes exist as data.TypedDictclass that defines the type of the return / accepted type ofto_dict/from_dictmethod.__setattr__and__delattr__shapeandchunksattributes are consistent).I think if we use a strategy approximately like this, then we don't have to define our classes according to the rules of a particular data modelling library, but we expose an API that can be used as scaffolding for someone who does want to integrate zarr into a particular data modelling framework. For example, by making the parsing routines stand-alone functions,
pydanticusers can just import those functions to createpydanticmodels for zarr.I would love to hear thoughts from others about this!
Beta Was this translation helpful? Give feedback.
All reactions