Replace direct `jsonschema` use with LinkML validation tools #1287

pkalita-lbl · 2025-10-17T18:13:24Z

On this branch, I replaced existing use of both the jsonschema and fastjsonschema packages with use of the Validator class provided by LinkML along with the custom validation plugin provided by nmdc-schema.

Details

Previously, nmdc-runtime used different validation approaches in different places in the code:

Using the nmdc-schema JSON Schema artifact via the jsonschema library
Using the nmdc-schema JSON Schema artifact with certain regex patterns stripped out via the jsonschema library
Using the nmdc-schema JSON Schema artifact via the fastjsonschema library

These changes replace all of those with the use of LinkML's Validator class, providing it with the JsonschemaValidationPlugin from LinkML and the NmdcSchemaValidationPlugin from nmdc-schema.

Under the hood, JsonschemaValidationPlugin uses the jsonschema library. In #1152 Donny made a comment about performance. I assume that was also the inspiration for his experimenting (??) with the fastjsonschema package. In light of that, a few comments about performance:

I used the built-in profiling tools in nmdc-runtime to confirm that for the /metadata/json:validate endpoint (which does the least other stuff besides validation), using LinkML Validator class imposes very, very little overhead. The majority of the validation time is still spent down in the jsonschema package.
Outside of nmdc-runtime I did a test of validating all documents in all collections modeled by the Document class using a LinkML Validator and the same two plugins. That full validation scan took about 15 minutes on my laptop. Most of the time was spend on the ~20M functional_annotation_agg records. In reality, validation done through API requests will be against far, far fewer documents.
I wish LinkML's JsonschemaValidationPlugin could use fastjsonschema but it doesn't support recent versions of the JSON Schema spec. And LinkML's JSON Schema generator uses features of those later versions.

At first I wasn't sure if we could get rid of the ID-pattern-less validation (point 2 above). That's why we did the half-step of #1271. That hasn't been able to be tested in dev yet because of Spin. But the more I looked at the code paths and at what's currently in MongoDB, I reasoned that it was fine to just go ahead and get rid of it.

There are a few places where the nmdc-runtime code still uses the nmdc-schema JSON Schema artifact to do a bit of schema introspection (e.g. finding ID patterns to determine typecodes). I think it would be better to directly inspect the LinkML schema (via a SchemaView instance) in those cases, but I'm leaving that as a future task.

Finally, I'm honestly not sure what some of the tests in tests/test_util.py are even trying to test. But I did my best to make a drop-in replacement of the new validator.

Related issue(s)

Fixes #1152

Related subsystem(s)

Testing

I tested these changes (explain below)
I did not test these changes

I tested these changes by...

Fixing up and running all existing automated tests
Manually testing the following endpoints:
- /metadata/json:validate
- /metadata/json:submit
- /metadata/changesheets:validate
- /metadata/changesheets:submit

Documentation

I have not checked for relevant documentation yet (e.g. in the docs directory)
I have updated all relevant documentation so it will remain accurate
Other (explain below)

N/A

Maintainability

Every Python function I defined includes a docstring (test functions are exempt from this)
Every Python function parameter I introduced includes a type hint (e.g. study_id: str)
All "to do" or "fix me" Python comments I added begin with either # TODO or # FIXME
I used black to format all the Python files I created/modified
The PR title is in the imperative mood (e.g. "Do X") and not the declarative mood (e.g. "Does X" or "Did X")

eecavanna · 2025-10-17T18:28:59Z

Thank you for implementing this! I'm looking forward to reviewing the diff later today.

I agree about replacing direct JSON Schema accesses with SchemaView usage (and agree about that being done via a separate PR).

nmdc_runtime/api/core/metadata.py

sierra-moxon · 2025-10-17T19:02:06Z

nmdc_runtime/site/validation/util.py

+                errors = {doc["id"]: [r.message for r in report.results]}
            else:
-                errors = {f"missing id ({count})": [e.message for e in errors]}
+                errors = {f"missing id ({count})": [r.message for r in report.results]}


I think in the validator plugin architecture, I can specify low-level warnings and info, right? these still count as errors here?

Yes, individual validation results can have various severities (info, warning, error, fatal). In practice the two validation plugins we're using only ever produce results with the error severity level.

eecavanna

LGTM! Reviewed from iPhone. I'm happy about this evolution of the validation code.

pkalita-lbl and others added 3 commits October 17, 2025 09:59

Replace direct jsonschema use with LinkML validation tools

2723fb3

Use pre-built nmdc-schema JSON Schema artifact

d67c43a

style: reformat

5a1c970

pkalita-lbl requested review from aclum, eecavanna and sierra-moxon October 17, 2025 18:19

pkalita-lbl marked this pull request as ready for review October 17, 2025 18:20

sierra-moxon reviewed Oct 17, 2025

View reviewed changes

nmdc_runtime/api/core/metadata.py Show resolved Hide resolved

sierra-moxon reviewed Oct 17, 2025

View reviewed changes

sierra-moxon approved these changes Oct 17, 2025

View reviewed changes

eecavanna approved these changes Oct 17, 2025

View reviewed changes

aclum approved these changes Oct 20, 2025

View reviewed changes

pkalita-lbl merged commit 301adba into main Oct 20, 2025

pkalita-lbl deleted the issue-1152-linkml-validator branch October 20, 2025 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace direct `jsonschema` use with LinkML validation tools #1287

Replace direct `jsonschema` use with LinkML validation tools #1287

Uh oh!

pkalita-lbl commented Oct 17, 2025

Uh oh!

eecavanna commented Oct 17, 2025

Uh oh!

Uh oh!

sierra-moxon Oct 17, 2025

Uh oh!

pkalita-lbl Oct 17, 2025

Uh oh!

eecavanna left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Replace direct jsonschema use with LinkML validation tools #1287

Replace direct jsonschema use with LinkML validation tools #1287

Uh oh!

Conversation

pkalita-lbl commented Oct 17, 2025

Details

Related issue(s)

Related subsystem(s)

Testing

Documentation

Maintainability

Uh oh!

eecavanna commented Oct 17, 2025

Uh oh!

Uh oh!

sierra-moxon Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

pkalita-lbl Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

eecavanna left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Replace direct `jsonschema` use with LinkML validation tools #1287

Replace direct `jsonschema` use with LinkML validation tools #1287