Intake-ESM Integration based on #1218 #2690

charles-turner-1 · 2025-03-13T02:03:36Z

Description

Add intake-dataset class to load datasets via intake.
Update config-developer.yml to include intake datasets.

TODO:

Our intake catalogs here on Gadi have a bunch of extra keys (facets) that I haven't mapped. Is there any documentation on where to find all potential facets that ESMValCore might accept & what they represent? I've been struggling to find them.
Tests - presumably the obvious place to stick these is in tests/unit/test_dataset.py, or is it preferable to add a new test module? I'll hold off writing these until I work out the facets issue.
Structure: I've put this in an intake submodule, but I could move it intodataset if that's preferable? Also affects previous point.

Have requested a review but obviously this is nowhere near ready to go on the infrastructure side wrt. tests, etc. A couple pointers in the right direction and that stuff should fly along.

Closes #31

Link to documentation:

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
🛠 Any changed dependencies have been added or removed correctly
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

…er.yml, skeleton of intake-esm inclusiion following #1218

… out

codecov · 2025-03-13T06:12:34Z

Codecov Report

Attention: Patch coverage is 0% with 30 lines in your changes missing coverage. Please review.

Project coverage is 94.92%. Comparing base (217ebac) to head (59d0d02).
Report is 10 commits behind head on main.

Files with missing lines	Patch %	Lines
esmvalcore/config/_intake.py	0.00%	28 Missing ⚠️
esmvalcore/data/__init__.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2690      +/-   ##
==========================================
- Coverage   95.11%   94.92%   -0.20%     
==========================================
  Files         255      257       +2     
  Lines       14999    15029      +30     
==========================================
- Hits        14267    14266       -1     
- Misses        732      763      +31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bouweandela

Great to see progress on this @charles-turner-1!

bouweandela · 2025-03-21T07:52:13Z

esmvalcore/config-developer.yml

@@ -38,6 +38,34 @@ CMIP6:
    SYNDA: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
    NCI: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
  input_file: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
+  catalogs:


The plan was to not further extend config-developer, but rather move this to the new configuration that lives in ~/.config/esmvaltool. See #2371 for an example of what we thought the configuration should look like.

bouweandela · 2025-03-21T07:56:45Z

esmvalcore/config-developer.yml

+        - /g/data/oi10/catalog/v2/esm/catalog.json
+      facets:
+        # mapping from recipe facets to intake-esm catalog facets
+        # TODO: Fix these when Gadi is back up


You could also test on DKRZ Levante, the intake catalogs are located at /pool/data/Catalogs/dkrz_cmip6_disk.json

bouweandela · 2025-03-21T08:04:02Z

esmvalcore/intake/_dataset.py

+    return ([_CACHE[cat_url] for cat_url in catalog_urls], facet_list)
+
+
+class IntakeDataset(Dataset):


I'm having some reservations about subclassing the Dataset class for this purpose:

A typical use case for many of our users will be that they have most data available from a central catalog that is managed by a central administrator, but want to augment that with the ability to download some files themselves. In that case, it is really useful to have the ability to deduplicate (e.g. pick the latest version of a file). I'm not sure if this can be achieved by subclassing the Dataset object.

We will likely want to add support for other catalogs as well, e.g. intake-esgf, xcube, and STAC. If we need a new Dataset class for each of these, it may become confusing to users.

How will this work from the recipe?

As an alternative, would it be an option to load the available data sources from the configuration / Dataset.session and then make the Dataset.files method loop over the available sources and deduplicate input files?

bouweandela · 2025-03-21T17:59:14Z

Is there any documentation on where to find all potential facets that ESMValCore might accept & what they represent?

ESMValCore is quite flexible with what facets it accepts. We have a translation between some of 'our' facets and the official ones in the esmvalcore.esgf.facets module (this is the subset that we use to search for files on ESGF). A few facets are used by ESMValCore for specific purposes such as CMOR checks and fixes (off the top of my head that would be dataset, project, mip, short_name), but others are entirely free-form and only used for finding input files and defining the output file names using the paths described in the config-developer.yml file.

Our intake catalogs here on Gadi have a bunch of extra keys (facets) that I haven't mapped.

If these are completely determined by the other facets, you can add them automatically using the extra facets facility

bouweandela · 2025-03-21T18:03:35Z

Structure: I've put this in an intake submodule,

How about adding a new module called e.g. esmvalcore.data or esmvalcore.data_sources or something similar and adding it as a submodule there? We could also move the esmvalcore.local and esmvalcore.esgf modules there (does not have to be in this pull request). I foresee us adding multiple input data sources in the near future.

charles-turner-1 · 2025-03-22T00:44:49Z

Thanks for the review Bouwe, super helpful! I've only had a skim so far, but I'll get those suggestions incorporated next week

…stions

charles-turner-1 added 10 commits February 4, 2025 10:02

Add recognised intake-esm datastores on NCI systems to config_develop…

c129966

…er.yml, skeleton of intake-esm inclusiion following #1218

Skeleton

b1b76fb

Playing around

dd73d1d

Almost at a working IntakeDataset.load()

ed1676b

Working intake-esm implementation - probably still some kinks to iron…

fa1ea2e

… out

Working with multiple catalogues per project

648f119

Cleanup - mypy & ruff errors

2b91fec

Remove WIP

c7b8ffb

Update depenencies & dev environment

31b35cb

Pre-commit modifications

a8532a5

charles-turner-1 requested a review from bouweandela March 13, 2025 02:03

charles-turner-1 added 3 commits March 13, 2025 11:45

Merge branch 'main' into intake-esm

7e56959

Fixed most of codacy (mypy-strict?) gripes

568cb8d

Fix typo

91fee56

charles-turner-1 requested a review from bettina-gier March 17, 2025 23:31

bouweandela reviewed Mar 21, 2025

View reviewed changes

charles-turner-1 added 2 commits April 2, 2025 13:19

Beginning to work on Bouwe's comments (WIP)

9d894b9

Updates - restructured esmvalcore/data/intake following Bouwe's sugge…

59d0d02

…stions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intake-ESM Integration based on #1218 #2690

Intake-ESM Integration based on #1218 #2690

charles-turner-1 commented Mar 13, 2025 •

edited

Loading

codecov bot commented Mar 13, 2025 •

edited

Loading

bouweandela left a comment

bouweandela Mar 21, 2025

bouweandela Mar 21, 2025

bouweandela Mar 21, 2025 •

edited

Loading

bouweandela commented Mar 21, 2025 •

edited

Loading

bouweandela commented Mar 21, 2025

charles-turner-1 commented Mar 22, 2025

		return ([_CACHE[cat_url] for cat_url in catalog_urls], facet_list)


		class IntakeDataset(Dataset):

Intake-ESM Integration based on #1218 #2690

Are you sure you want to change the base?

Intake-ESM Integration based on #1218 #2690

Conversation

charles-turner-1 commented Mar 13, 2025 • edited Loading

Description

Before you get started

Checklist

codecov bot commented Mar 13, 2025 • edited Loading

Codecov Report

bouweandela left a comment

Choose a reason for hiding this comment

bouweandela Mar 21, 2025

Choose a reason for hiding this comment

bouweandela Mar 21, 2025

Choose a reason for hiding this comment

bouweandela Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

bouweandela commented Mar 21, 2025 • edited Loading

bouweandela commented Mar 21, 2025

charles-turner-1 commented Mar 22, 2025

charles-turner-1 commented Mar 13, 2025 •

edited

Loading

codecov bot commented Mar 13, 2025 •

edited

Loading

bouweandela Mar 21, 2025 •

edited

Loading

bouweandela commented Mar 21, 2025 •

edited

Loading