Skip to content

Conversation

@pkalita-lbl
Copy link
Collaborator

On this branch, I replaced existing use of both the jsonschema and fastjsonschema packages with use of the Validator class provided by LinkML along with the custom validation plugin provided by nmdc-schema.

Details

Previously, nmdc-runtime used different validation approaches in different places in the code:

  1. Using the nmdc-schema JSON Schema artifact via the jsonschema library
  2. Using the nmdc-schema JSON Schema artifact with certain regex patterns stripped out via the jsonschema library
  3. Using the nmdc-schema JSON Schema artifact via the fastjsonschema library

These changes replace all of those with the use of LinkML's Validator class, providing it with the JsonschemaValidationPlugin from LinkML and the NmdcSchemaValidationPlugin from nmdc-schema.

Under the hood, JsonschemaValidationPlugin uses the jsonschema library. In #1152 Donny made a comment about performance. I assume that was also the inspiration for his experimenting (??) with the fastjsonschema package. In light of that, a few comments about performance:

  • I used the built-in profiling tools in nmdc-runtime to confirm that for the /metadata/json:validate endpoint (which does the least other stuff besides validation), using LinkML Validator class imposes very, very little overhead. The majority of the validation time is still spent down in the jsonschema package.
  • Outside of nmdc-runtime I did a test of validating all documents in all collections modeled by the Document class using a LinkML Validator and the same two plugins. That full validation scan took about 15 minutes on my laptop. Most of the time was spend on the ~20M functional_annotation_agg records. In reality, validation done through API requests will be against far, far fewer documents.
  • I wish LinkML's JsonschemaValidationPlugin could use fastjsonschema but it doesn't support recent versions of the JSON Schema spec. And LinkML's JSON Schema generator uses features of those later versions.

At first I wasn't sure if we could get rid of the ID-pattern-less validation (point 2 above). That's why we did the half-step of #1271. That hasn't been able to be tested in dev yet because of Spin. But the more I looked at the code paths and at what's currently in MongoDB, I reasoned that it was fine to just go ahead and get rid of it.

There are a few places where the nmdc-runtime code still uses the nmdc-schema JSON Schema artifact to do a bit of schema introspection (e.g. finding ID patterns to determine typecodes). I think it would be better to directly inspect the LinkML schema (via a SchemaView instance) in those cases, but I'm leaving that as a future task.

Finally, I'm honestly not sure what some of the tests in tests/test_util.py are even trying to test. But I did my best to make a drop-in replacement of the new validator.

Related issue(s)

Fixes #1152

Related subsystem(s)

  • Runtime API (except the Minter)
  • Minter
  • Dagster
  • Project documentation (in the docs directory)
  • Translators (metadata ingest pipelines)
  • MongoDB migrations
  • Other

Testing

  • I tested these changes (explain below)
  • I did not test these changes

I tested these changes by...

  • Fixing up and running all existing automated tests
  • Manually testing the following endpoints:
    • /metadata/json:validate
    • /metadata/json:submit
    • /metadata/changesheets:validate
    • /metadata/changesheets:submit

Documentation

  • I have not checked for relevant documentation yet (e.g. in the docs directory)
  • I have updated all relevant documentation so it will remain accurate
  • Other (explain below)

N/A

Maintainability

  • Every Python function I defined includes a docstring (test functions are exempt from this)
  • Every Python function parameter I introduced includes a type hint (e.g. study_id: str)
  • All "to do" or "fix me" Python comments I added begin with either # TODO or # FIXME
  • I used black to format all the Python files I created/modified
  • The PR title is in the imperative mood (e.g. "Do X") and not the declarative mood (e.g. "Does X" or "Did X")

@pkalita-lbl pkalita-lbl marked this pull request as ready for review October 17, 2025 18:20
@eecavanna
Copy link
Collaborator

Thank you for implementing this! I'm looking forward to reviewing the diff later today.

I agree about replacing direct JSON Schema accesses with SchemaView usage (and agree about that being done via a separate PR).

errors = {doc["id"]: [r.message for r in report.results]}
else:
errors = {f"missing id ({count})": [e.message for e in errors]}
errors = {f"missing id ({count})": [r.message for r in report.results]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the validator plugin architecture, I can specify low-level warnings and info, right? these still count as errors here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, individual validation results can have various severities (info, warning, error, fatal). In practice the two validation plugins we're using only ever produce results with the error severity level.

Copy link
Collaborator

@eecavanna eecavanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Reviewed from iPhone. I'm happy about this evolution of the validation code.

@pkalita-lbl pkalita-lbl merged commit 301adba into main Oct 20, 2025
@pkalita-lbl pkalita-lbl deleted the issue-1152-linkml-validator branch October 20, 2025 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update data validation to account for storage_units annotations

5 participants