Skip to content

Conversation

bistline
Copy link
Contributor

@bistline bistline commented Apr 8, 2025

BACKGROUND

Currently, all Batch API based metadata validation is done directly against EBI OLS. While the does ensure that all terms are valid at their time of ingest, it has long represented a single point of failure that can block metadata ingest if the API is down, or term updates are rolled out and users have not corrected their files. Recent updates in CSFV have moved to an offline minified ontology approach that precompile the main ontologies used the SCP convention and make those available to the client for fast, on-demand validation. This has proven so successful that we are now extending this to all metadata validation.

CHANGES

Now, any ontology that has been minified and is included in the validation/ontologies folder will be used first when attempting to validate an ontology ID or label. The benefits are several: it is faster, much more reliable, and allows us to "freeze" ontologies and update them only when we publish new releases of ingest pipeline. Two new ontologies are also being added to the minification framework:

  • ethnicity (hancestro)
  • organism_age (uo) # this actually covers many columns that use __unit labels

There are some drawbacks - not all supported ontologies are available in the condensed JSON format that minify_ontologies.py needs in order to process terms. However, all of our main metadata columns are covered - most notably all those used as search facets - so instead of continuing to rely on OLS for these ontologies, a new version of the convention (3.0.0) is included in this release that removes validation for those columns. These include:

  • development_stage (mmusdv)
  • gene_perturbation (ogg)
  • geographical_region (gaz)
  • geographical_region__ontology_label (pr)
  • mouse_strain & race (ncit)
  • small_molecule_perturbation (chebi)
  • vaccination (vo)

The data will still be written to BigQuery as normal - the only difference is that the terms will not be validated first. As none of these are used in dataset search, the risk is very minimal. Lastly, the infrastructure for making OLS calls has largely been left in place in case we find cases where we need to go back to remote validation. All that would be needed to do so would be to re-add the ontology URL in the JSON schema.

MANUAL TESTING

  1. Initialize your dev environment as normal
  2. Go to the ingest/validation folder and minify the new ontologies so that no fallback calls to OLS are made:
$ cd ingest/validation
$ python minify_ontologies.py 
Fetch ontology: https://github.com/monarch-initiative/mondo/releases/latest/download/mondo.json
Fetch ontology: https://github.com/pato-ontology/pato/raw/master/pato.json
Fetch ontology: https://github.com/obophenotype/ncbitaxon/releases/latest/download/taxslim.json
Fetch ontology: https://github.com/EBISPOT/efo/releases/latest/download/efo.json
Fetch ontology: https://github.com/obophenotype/uberon/releases/latest/download/uberon.json
Fetch ontology: https://github.com/obophenotype/cell-ontology/releases/latest/download/cl.json
Fetch ontology: https://raw.githubusercontent.com/EBISPOT/hancestro/refs/heads/main/hancestro.json
Fetch ontology: https://raw.githubusercontent.com/bio-ontology-research-group/unit-ontology/refs/heads/master/uo.json
Wrote ontologies/mondo.min.tsv.gz
Wrote ontologies/pato.min.tsv.gz
Wrote ontologies/ncbitaxon.min.tsv.gz
Wrote ontologies/efo.min.tsv.gz
Wrote ontologies/uberon.min.tsv.gz
Wrote ontologies/cl.min.tsv.gz
Wrote ontologies/hancestro.min.tsv.gz
Wrote ontologies/uo.min.tsv.gz
  1. Go back to the ingest directory and run the command to ingest/validate metadata:
$ python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_cell_metadata --cell-metadata-file ../tests/data/annotation/metadata/convention/valid_no_array_v2.0.0.txt --study-accession SCP123 --ingest-cell-metadata --validate-convention --bq-dataset cell_metadata_development --bq-table alexandria_convention
  1. You should see output in the log file that shows minified ontologies are being loaded:
populating minified ontology cl from /Users/bistline/Documents/Python/scp-ingest-pipeline/ingest/validation/ontologies/cl.min.tsv.gz
populating minified ontology ncbitaxon from /Users/bistline/Documents/Python/scp-ingest-pipeline/ingest/validation/ontologies/ncbitaxon.min.tsv.gz
populating minified ontology mondo from /Users/bistline/Documents/Python/scp-ingest-pipeline/ingest/validation/ontologies/mondo.min.tsv.gz
populating minified ontology hancestro from /Users/bistline/Documents/Python/scp-ingest-pipeline/ingest/validation/ontologies/hancestro.min.tsv.gz
populating minified ontology pato from /Users/bistline/Documents/Python/scp-ingest-pipeline/ingest/validation/ontologies/pato.min.tsv.gz
populating minified ontology uberon from /Users/bistline/Documents/Python/scp-ingest-pipeline/ingest/validation/ontologies/uberon.min.tsv.gz
populating minified ontology uo from /Users/bistline/Documents/Python/scp-ingest-pipeline/ingest/validation/ontologies/uo.min.tsv.gz
populating minified ontology efo from /Users/bistline/Documents/Python/scp-ingest-pipeline/ingest/validation/ontologies/efo.min.tsv.gz
  1. Confirm the process succeeds, and that you do not see the message Using fallback EBI OLS call with {params} anywhere in the logs (this was left in to denote when remote calls are made)
  2. (Optional) you can clean up the rows in BigQuery by running the following command in the BQ Console:
DELETE FROM `cell_metadata_development.alexandria_convention`
WHERE file_id = '5dd5ae25421aa910a723a337'

@bistline bistline requested a review from eweitz April 8, 2025 20:31
Copy link

codecov bot commented Apr 8, 2025

Codecov Report

Attention: Patch coverage is 95.45455% with 2 lines in your changes missing coverage. Please review.

Project coverage is 75.25%. Comparing base (dad95b7) to head (147a836).
Report is 6 commits behind head on development.

Files with missing lines Patch % Lines
ingest/validation/validate_metadata.py 95.12% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #392      +/-   ##
===============================================
- Coverage        75.95%   75.25%   -0.71%     
===============================================
  Files               30       30              
  Lines             4538     4578      +40     
===============================================
- Hits              3447     3445       -2     
- Misses            1091     1133      +42     
Files with missing lines Coverage Δ
ingest/validation/minify_ontologies.py 85.36% <100.00%> (+0.36%) ⬆️
ingest/validation/validate_metadata.py 78.02% <95.12%> (-4.61%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! This will be a nice robustness (and slight speed) improvement.

@bistline bistline merged commit efcd1f2 into development Apr 9, 2025
5 of 6 checks passed
@bistline bistline deleted the jb-minified-ontology-validation branch April 9, 2025 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants