Extend minified ontology usage to Batch-based validation (SCP-5971) #392
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
BACKGROUND
Currently, all Batch API based metadata validation is done directly against EBI OLS. While the does ensure that all terms are valid at their time of ingest, it has long represented a single point of failure that can block metadata ingest if the API is down, or term updates are rolled out and users have not corrected their files. Recent updates in CSFV have moved to an offline minified ontology approach that precompile the main ontologies used the SCP convention and make those available to the client for fast, on-demand validation. This has proven so successful that we are now extending this to all metadata validation.
CHANGES
Now, any ontology that has been minified and is included in the
validation/ontologies
folder will be used first when attempting to validate an ontology ID or label. The benefits are several: it is faster, much more reliable, and allows us to "freeze" ontologies and update them only when we publish new releases of ingest pipeline. Two new ontologies are also being added to the minification framework:__unit
labelsThere are some drawbacks - not all supported ontologies are available in the condensed JSON format that
minify_ontologies.py
needs in order to process terms. However, all of our main metadata columns are covered - most notably all those used as search facets - so instead of continuing to rely on OLS for these ontologies, a new version of the convention (3.0.0
) is included in this release that removes validation for those columns. These include:The data will still be written to BigQuery as normal - the only difference is that the terms will not be validated first. As none of these are used in dataset search, the risk is very minimal. Lastly, the infrastructure for making OLS calls has largely been left in place in case we find cases where we need to go back to remote validation. All that would be needed to do so would be to re-add the ontology URL in the JSON schema.
MANUAL TESTING
ingest/validation
folder and minify the new ontologies so that no fallback calls to OLS are made:ingest
directory and run the command to ingest/validate metadata:Using fallback EBI OLS call with {params}
anywhere in the logs (this was left in to denote when remote calls are made)