- 
                Notifications
    
You must be signed in to change notification settings  - Fork 6
 
          Replace direct jsonschema use with LinkML validation tools
          #1287
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           Thank you for implementing this! I'm looking forward to reviewing the diff later today. I agree about replacing direct JSON Schema accesses with SchemaView usage (and agree about that being done via a separate PR).  | 
    
| errors = {doc["id"]: [r.message for r in report.results]} | ||
| else: | ||
| errors = {f"missing id ({count})": [e.message for e in errors]} | ||
| errors = {f"missing id ({count})": [r.message for r in report.results]} | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the validator plugin architecture, I can specify low-level warnings and info, right? these still count as errors here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, individual validation results can have various severities (info, warning, error, fatal). In practice the two validation plugins we're using only ever produce results with the error severity level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Reviewed from iPhone. I'm happy about this evolution of the validation code.
On this branch, I replaced existing use of both the
jsonschemaandfastjsonschemapackages with use of theValidatorclass provided by LinkML along with the custom validation plugin provided bynmdc-schema.Details
Previously,
nmdc-runtimeused different validation approaches in different places in the code:nmdc-schemaJSON Schema artifact via thejsonschemalibrarynmdc-schemaJSON Schema artifact with certain regex patterns stripped out via thejsonschemalibrarynmdc-schemaJSON Schema artifact via thefastjsonschemalibraryThese changes replace all of those with the use of LinkML's
Validatorclass, providing it with theJsonschemaValidationPluginfrom LinkML and theNmdcSchemaValidationPluginfromnmdc-schema.Under the hood,
JsonschemaValidationPluginuses thejsonschemalibrary. In #1152 Donny made a comment about performance. I assume that was also the inspiration for his experimenting (??) with thefastjsonschemapackage. In light of that, a few comments about performance:nmdc-runtimeto confirm that for the/metadata/json:validateendpoint (which does the least other stuff besides validation), using LinkMLValidatorclass imposes very, very little overhead. The majority of the validation time is still spent down in thejsonschemapackage.nmdc-runtimeI did a test of validating all documents in all collections modeled by theDocumentclass using a LinkMLValidatorand the same two plugins. That full validation scan took about 15 minutes on my laptop. Most of the time was spend on the ~20Mfunctional_annotation_aggrecords. In reality, validation done through API requests will be against far, far fewer documents.JsonschemaValidationPlugincould usefastjsonschemabut it doesn't support recent versions of the JSON Schema spec. And LinkML's JSON Schema generator uses features of those later versions.At first I wasn't sure if we could get rid of the ID-pattern-less validation (point 2 above). That's why we did the half-step of #1271. That hasn't been able to be tested in dev yet because of Spin. But the more I looked at the code paths and at what's currently in MongoDB, I reasoned that it was fine to just go ahead and get rid of it.
There are a few places where the
nmdc-runtimecode still uses thenmdc-schemaJSON Schema artifact to do a bit of schema introspection (e.g. finding ID patterns to determine typecodes). I think it would be better to directly inspect the LinkML schema (via aSchemaViewinstance) in those cases, but I'm leaving that as a future task.Finally, I'm honestly not sure what some of the tests in
tests/test_util.pyare even trying to test. But I did my best to make a drop-in replacement of the new validator.Related issue(s)
Fixes #1152
Related subsystem(s)
docsdirectory)Testing
I tested these changes by...
/metadata/json:validate/metadata/json:submit/metadata/changesheets:validate/metadata/changesheets:submitDocumentation
docsdirectory)N/A
Maintainability
study_id: str)# TODOor# FIXMEblackto format all the Python files I created/modified