Skip to content

Intake-ESM Integration based on #1218 #2690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from
Draft

Conversation

charles-turner-1
Copy link

@charles-turner-1 charles-turner-1 commented Mar 13, 2025

Description

  • Add intake-dataset class to load datasets via intake.
  • Update config-developer.yml to include intake datasets.

TODO:

  • Our intake catalogs here on Gadi have a bunch of extra keys (facets) that I haven't mapped. Is there any documentation on where to find all potential facets that ESMValCore might accept & what they represent? I've been struggling to find them.
  • Tests - presumably the obvious place to stick these is in tests/unit/test_dataset.py, or is it preferable to add a new test module? I'll hold off writing these until I work out the facets issue.
  • Structure: I've put this in an intake submodule, but I could move it intodataset if that's preferable? Also affects previous point.

Have requested a review but obviously this is nowhere near ready to go on the infrastructure side wrt. tests, etc. A couple pointers in the right direction and that stuff should fly along.

Closes #31

Link to documentation:


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

Copy link

codecov bot commented Mar 13, 2025

Codecov Report

Attention: Patch coverage is 0% with 30 lines in your changes missing coverage. Please review.

Project coverage is 94.92%. Comparing base (217ebac) to head (59d0d02).
Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
esmvalcore/config/_intake.py 0.00% 28 Missing ⚠️
esmvalcore/data/__init__.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2690      +/-   ##
==========================================
- Coverage   95.11%   94.92%   -0.20%     
==========================================
  Files         255      257       +2     
  Lines       14999    15029      +30     
==========================================
- Hits        14267    14266       -1     
- Misses        732      763      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@bouweandela bouweandela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see progress on this @charles-turner-1!

@@ -38,6 +38,34 @@ CMIP6:
SYNDA: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
NCI: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
input_file: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
catalogs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan was to not further extend config-developer, but rather move this to the new configuration that lives in ~/.config/esmvaltool. See #2371 for an example of what we thought the configuration should look like.

- /g/data/oi10/catalog/v2/esm/catalog.json
facets:
# mapping from recipe facets to intake-esm catalog facets
# TODO: Fix these when Gadi is back up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also test on DKRZ Levante, the intake catalogs are located at /pool/data/Catalogs/dkrz_cmip6_disk.json

return ([_CACHE[cat_url] for cat_url in catalog_urls], facet_list)


class IntakeDataset(Dataset):
Copy link
Member

@bouweandela bouweandela Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having some reservations about subclassing the Dataset class for this purpose:

  • A typical use case for many of our users will be that they have most data available from a central catalog that is managed by a central administrator, but want to augment that with the ability to download some files themselves. In that case, it is really useful to have the ability to deduplicate (e.g. pick the latest version of a file). I'm not sure if this can be achieved by subclassing the Dataset object.
  • We will likely want to add support for other catalogs as well, e.g. intake-esgf, xcube, and STAC. If we need a new Dataset class for each of these, it may become confusing to users.
  • How will this work from the recipe?

As an alternative, would it be an option to load the available data sources from the configuration / Dataset.session and then make the Dataset.files method loop over the available sources and deduplicate input files?

@bouweandela
Copy link
Member

bouweandela commented Mar 21, 2025

Is there any documentation on where to find all potential facets that ESMValCore might accept & what they represent?

ESMValCore is quite flexible with what facets it accepts. We have a translation between some of 'our' facets and the official ones in the esmvalcore.esgf.facets module (this is the subset that we use to search for files on ESGF). A few facets are used by ESMValCore for specific purposes such as CMOR checks and fixes (off the top of my head that would be dataset, project, mip, short_name), but others are entirely free-form and only used for finding input files and defining the output file names using the paths described in the config-developer.yml file.

Our intake catalogs here on Gadi have a bunch of extra keys (facets) that I haven't mapped.

If these are completely determined by the other facets, you can add them automatically using the extra facets facility

@bouweandela
Copy link
Member

Structure: I've put this in an intake submodule,

How about adding a new module called e.g. esmvalcore.data or esmvalcore.data_sources or something similar and adding it as a submodule there? We could also move the esmvalcore.local and esmvalcore.esgf modules there (does not have to be in this pull request). I foresee us adding multiple input data sources in the near future.

@charles-turner-1
Copy link
Author

Thanks for the review Bouwe, super helpful! I've only had a skim so far, but I'll get those suggestions incorporated next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider using the intake-esm library
2 participants