Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support wildcards in the recipe and improve support for ancillary variables and dataset versioning #1609

Merged
merged 152 commits into from
Feb 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
152 commits
Select commit Hold shift + click to select a range
12242e6
First draft of dataset class
bouweandela Sep 10, 2021
2e3b339
Merge branch 'main' of github.com:esmvalgroup/esmvalcore into split-r…
bouweandela Dec 7, 2021
f4c2d72
Work in progress
bouweandela Dec 14, 2021
696c9f2
Merge branch 'main' of github.com:esmvalgroup/esmvalcore into split-r…
bouweandela May 31, 2022
a116f88
Fix error in _dataset.py
bouweandela May 31, 2022
9b28044
Add ancillary variables
bouweandela May 31, 2022
ba2ba53
Make loading a cube work from the Python API
bouweandela May 31, 2022
dca7634
Add an integration test for API load
bouweandela Jun 2, 2022
e5dd013
Move stuff from recipe to dataset
bouweandela Jun 2, 2022
ccf44f1
Some progess
bouweandela Jun 12, 2022
1f3ab2f
Restore previous code
bouweandela Jun 13, 2022
6496f78
Make new recipe format for ancillaries work
bouweandela Jun 13, 2022
679f674
Work in progress
bouweandela Jun 15, 2022
55ad481
Progress
bouweandela Jun 16, 2022
bff4918
... is slow
bouweandela Jun 16, 2022
c5e2430
Improved addition of ancillary variables in recipe
bouweandela Jul 4, 2022
e5a1382
Merge branch 'main' of github.com:esmvalgroup/esmvalcore into split-r…
bouweandela Jul 5, 2022
f1e54a4
Fix things that got broken due to lack of tests
bouweandela Jul 5, 2022
9e95091
Remove old config-user.yml code
bouweandela Jul 8, 2022
3c36ecb
Update data finder tests
bouweandela Jul 8, 2022
5d54394
Update CMOR table load tests to use file instead of dict
bouweandela Jul 8, 2022
2d19798
Remove unnessary code
bouweandela Jul 8, 2022
8becb37
Update PreprocessorFile creation and remove test for no longer needed…
bouweandela Jul 8, 2022
7ad9c4e
Work in progress
bouweandela Jul 8, 2022
c0244a8
Fix EMAC and ICON tests
bouweandela Jul 22, 2022
185806b
Use session instead of config-user
bouweandela Jul 22, 2022
dd2cf66
Fix more tests
bouweandela Jul 30, 2022
71f6ee4
Move experimental.config code to esmvalcore._config
bouweandela Jul 31, 2022
c3f6084
Fix more tests
bouweandela Jul 31, 2022
b738a83
Fix more tests
bouweandela Jul 31, 2022
ee81fb9
Remove intermediate save step for derived variables
bouweandela Aug 1, 2022
521d921
Fix more tests
bouweandela Aug 4, 2022
ec4c6c3
Merge branch 'main' into split-recipe
bouweandela Aug 4, 2022
444608c
Fix more tests
bouweandela Aug 4, 2022
dc0ac31
Fix case where attributes is None
bouweandela Aug 4, 2022
7a46259
Fix more tests
bouweandela Aug 4, 2022
1a15bcb
Fix more tests, restore provenance of input files
bouweandela Aug 5, 2022
299d6d1
Update several ancillary variables related tests
bouweandela Aug 5, 2022
755351e
Fix more tests
bouweandela Aug 18, 2022
93d3e57
Fix tests related to legacy ancillary definitions
bouweandela Aug 19, 2022
e333878
Fix more tests
bouweandela Aug 19, 2022
5667eed
Fix dataset tests
bouweandela Aug 19, 2022
22a1086
Reduced risk of race conditions with session
bouweandela Aug 19, 2022
16f22a4
Modernize save unit tests
bouweandela Aug 19, 2022
80480d9
Add test for creating parent directory
bouweandela Aug 19, 2022
5786c5e
Add a test for disabling a preprocessor function
bouweandela Aug 19, 2022
8029478
Merge branch 'main' of github.com:esmvalgroup/esmvalcore into split-r…
bouweandela Aug 19, 2022
3daf20f
Improve check for missing ancillary data
bouweandela Aug 19, 2022
3906694
Change Dataset.files to a list of Path or ESGFFile
bouweandela Aug 19, 2022
bb2eff1
Avoid adding ancillaries both through new and legacy
bouweandela Aug 19, 2022
6bb9f3f
Improve missing data handling
bouweandela Aug 19, 2022
799b42b
Improve timerange handling
bouweandela Aug 19, 2022
4f72606
Add a function for reading facets from a path
bouweandela Aug 20, 2022
0aa77d3
Write out dataset version number to filled recipe
bouweandela Aug 29, 2022
449d232
Add support for finding specific versions of datasets
bouweandela Aug 29, 2022
8ed88a1
Nicer interface
bouweandela Aug 30, 2022
5b38d07
WIP: add globs
bouweandela Sep 19, 2022
6853af3
WIP
bouweandela Sep 20, 2022
98f052f
WIP: fix types and tests
bouweandela Oct 7, 2022
7c67836
Search local file versions and globs
bouweandela Oct 11, 2022
8e13b86
Nicer looking filled recipe
bouweandela Oct 11, 2022
ba92cd8
Small improvements
bouweandela Oct 13, 2022
b1b7c8d
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Oct 13, 2022
76a723f
Merge branch 'main' of github.com:esmvalgroup/esmvalcore into split-r…
bouweandela Oct 24, 2022
7c7d510
Make esmvalcore.config module public
bouweandela Nov 1, 2022
73c1897
Improve test coverage
bouweandela Nov 4, 2022
dcb21fb
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Nov 14, 2022
f697208
Add API docs
bouweandela Nov 15, 2022
2f5740d
Add a docstring to Dataset
bouweandela Nov 15, 2022
b3966e5
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Nov 22, 2022
83614bb
Add some docstrings
bouweandela Nov 22, 2022
19263bb
Fix documentation
bouweandela Nov 23, 2022
bb40cf5
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Nov 24, 2022
623cb22
Move all code from esmvalcore/_data_finder.py to esmvalcore/local.py
bouweandela Nov 25, 2022
43df242
Rename esmvalcore.types to esmvalcore.typing
bouweandela Nov 25, 2022
fe6b0ed
Add esmvalcore.local
bouweandela Nov 29, 2022
4904753
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into add-loc…
bouweandela Nov 29, 2022
66bc6be
Add tests
bouweandela Nov 29, 2022
fff4c8b
Add docs and remove "session" argument
bouweandela Dec 1, 2022
057f13b
Add docs for `esmvalcore.typing`
bouweandela Dec 1, 2022
952503f
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into add-loc…
bouweandela Dec 6, 2022
89df2b1
Return LocalFile objects from the download method
bouweandela Dec 6, 2022
53f22a9
Undo potentially backward incompatible change
bouweandela Dec 6, 2022
56d16b2
Rename "latestversion" to "version"
bouweandela Dec 6, 2022
787c748
Ignore versions called "latest" and add a test
bouweandela Dec 6, 2022
fe37e82
Add a note that facets are only read from directory structure
bouweandela Dec 6, 2022
971d542
Ignore line too long in docstring
bouweandela Dec 6, 2022
b9e224a
Add a note about the limitations of using timerange in find_files
bouweandela Dec 8, 2022
10625a9
Fix issues reported by @remi-kazeroni and ensure timerange is not set…
bouweandela Dec 8, 2022
098ca88
Smarter filtering of versions called "latest"
bouweandela Dec 9, 2022
4b56d79
Fixed small issues in IPSL-CM6 native model CMORizer
schlunma Dec 9, 2022
152473f
Merge branch 'add-local-module' of github.com:ESMValGroup/ESMValCore …
bouweandela Dec 12, 2022
22a4555
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Dec 13, 2022
94e20e9
Simplify use of session
bouweandela Dec 14, 2022
025085f
Split esmvalcore/_recipe.py
bouweandela Dec 16, 2022
59ca4a7
Remove outdated tests
bouweandela Dec 16, 2022
8cdd9cc
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Dec 16, 2022
c7ee519
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Jan 15, 2023
0c65f6a
Improve backward compatibility
bouweandela Jan 24, 2023
1d7761e
Improve backward compatibility, add docs, fix minor issues
bouweandela Jan 26, 2023
2d40767
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Jan 26, 2023
0cd1f1f
Add a few tests
bouweandela Jan 26, 2023
d3ff6b5
Add more tests
bouweandela Jan 27, 2023
d860d86
Fix test
bouweandela Jan 27, 2023
2ef57e3
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Jan 27, 2023
8ea3d6d
Add more tests and fix minor issues
bouweandela Jan 30, 2023
8bd18b7
Fix test
bouweandela Jan 30, 2023
2f902e1
Address Codacy issues
bouweandela Jan 31, 2023
d7444c8
Undo changes to untested code
bouweandela Jan 31, 2023
ec76d4e
Fix variable name Codacy issue
bouweandela Jan 31, 2023
dec0306
Fix issue with updating timerange and remove outdated code
bouweandela Feb 1, 2023
eae9591
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Feb 1, 2023
eb65960
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Feb 2, 2023
e2530f1
Move derive preprocessor step to after loading
bouweandela Feb 6, 2023
22886c0
Improve filled recipe
bouweandela Feb 6, 2023
d29f109
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Feb 6, 2023
2530969
Add forgotten file
bouweandela Feb 6, 2023
5df32a9
Fix type hints for older Python versions
bouweandela Feb 6, 2023
c814674
Automatically correct CMIP5 fx ensemble
bouweandela Feb 6, 2023
5667fd2
return GA tests on
valeriupredoi Feb 6, 2023
1ae8e03
Automatically fix ensemble for CMIP5 fx for derive input
bouweandela Feb 6, 2023
04a84fd
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Feb 6, 2023
fa8f392
Automatically add missing ancillaries
bouweandela Feb 7, 2023
e84159b
Ensure data is loaded before saving
bouweandela Feb 7, 2023
582bb8a
Ensure data is available at dataset load
bouweandela Feb 7, 2023
57d5782
Fix download tests
bouweandela Feb 8, 2023
56e8abe
Improve reading facets from ESGF search results
bouweandela Feb 8, 2023
a874d52
Fix derivation with custom preprocessor order
bouweandela Feb 8, 2023
21db805
Fix deletion of additional_datasets from variable when using YAML anc…
bouweandela Feb 8, 2023
1f6e811
Be quiet
bouweandela Feb 8, 2023
90185df
Add tests and fix issue with obs4MIPs dirs
bouweandela Feb 9, 2023
301903e
Automatically correct "exp" if it is a list for ancillary variables
bouweandela Feb 9, 2023
493aebf
Merge branch 'improve-esgf-facets' of github.com:ESMValGroup/ESMValCo…
bouweandela Feb 11, 2023
92fd87d
First create all datasets and then add ancillaries
bouweandela Feb 13, 2023
1d222c6
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Feb 13, 2023
8aa4942
Merge branch 'main' of github.com:ESMValGroup/ESMValCore into split-r…
bouweandela Feb 16, 2023
c3fd141
Add option to skip adding ancillaries automatically and make config-u…
bouweandela Feb 17, 2023
5eb7f4f
Fix documentation build
bouweandela Feb 17, 2023
ebd3809
Rename ancillary to supplementary
bouweandela Feb 17, 2023
74dcf4e
Fix documentation build
bouweandela Feb 17, 2023
195457a
Avoid attaching supplementaries from other datasets etc
bouweandela Feb 17, 2023
adf1d75
Fix flake8
bouweandela Feb 17, 2023
3886ef0
Save intermediate results on Dataset.load when enabled
bouweandela Feb 17, 2023
05b1e30
Improve saving of intermediary results
bouweandela Feb 17, 2023
8a9f7f2
Improve filled recipe
bouweandela Feb 17, 2023
7b332c7
Improve the way datasets with partially complete facets are handled
bouweandela Feb 21, 2023
924ed11
Do not complain if facets are missing that can be automatically popul…
bouweandela Feb 21, 2023
ef1d2a6
Improve automatic addition of supplementary datasets
bouweandela Feb 21, 2023
9cee2cf
Avoid inherting version on automatic addition of supplementaries
bouweandela Feb 22, 2023
3c27e2f
Clarify documentation
bouweandela Feb 23, 2023
3776dd0
Address review comments
bouweandela Feb 23, 2023
7004a9d
turn off GA tests
valeriupredoi Feb 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 167 additions & 7 deletions doc/recipe/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,8 @@ the following:
Recipe section: ``datasets``
============================

The ``datasets`` section includes dictionaries that, via key-value pairs, define standardized
data specifications:
The ``datasets`` section includes dictionaries that, via key-value pairs or
"facets", define standardized data specifications:

- dataset name (key ``dataset``, value e.g. ``MPI-ESM-LR`` or ``UKESM1-0-LL``).
- project (key ``project``, value ``CMIP5`` or ``CMIP6`` for CMIP data,
Expand Down Expand Up @@ -114,6 +114,162 @@ For example, a datasets section could be:
- {dataset: HadGEM3-GC31-MM, project: CMIP6, exp: dcppA-hindcast, ensemble: r1i1p1f1, sub_experiment: s2000, grid: gn, start_year: 2000, end_year, 2002}
- {dataset: BCC-CSM2-MR, project: CMIP6, exp: dcppA-hindcast, ensemble: r1i1p1f1, sub_experiment: s2000, grid: gn, timerange: '*'}

.. _dataset_wildcards:

Automatically populating a recipe with all available datasets
-------------------------------------------------------------

It is possible to use :obj:`glob` patterns or wildcards for certain facet
values, to make it easy to find all available datasets locally and/or on ESGF.
Note that ``project`` cannot be a wildcard.

The facet values for local files are retrieved from the directory tree where the
directories represent the facets values.
Reading facet values from file names is not yet supported.
See :ref:`CMOR-DRS` for more information on this kind of file organization.

When (some) files are available locally, the tool will not automatically look
for more files on ESGF. To populate a recipe with all available datasets from
ESGF, ``offline`` should be set to ``false`` and ``always_search_esgf`` should
be set to ``true`` in the
:ref:`user configuration file<user configuration file>`.
remi-kazeroni marked this conversation as resolved.
Show resolved Hide resolved

For more control over which datasets are selected, it is recommended to use
a Python script or `Jupyter notebook <https://jupyter.org/>`_ to compose
the recipe.
See :ref:`/notebooks/composing-recipes.ipynb` for an example.
This is particularly useful when specific relations are required between
datasets, e.g. when a dataset needs to be available for multiple variables
or experiments.

An example recipe that will use all CMIP6 datasets and all ensemble members
which have a ``'historical'`` experiment could look like this:

.. code-block:: yaml

datasets:
- project: CMIP6
exp: historical
dataset: '*'
institute: '*'
ensemble: '*'
grid: '*'

After running the recipe, a copy specifying exactly which datasets were used
is available in the output directory in the ``run`` subdirectory.
The filename of this recipe will end with ``_filled.yml``.

For the ``timerange`` facet, special syntax is available.
See :ref:`timerange_examples` for more information.

If populating a recipe using wildcards does not work, this is because there
were either no files found that match those facets, or the facets could not be
read from the directory name or ESGF.

.. _supplementary_variables:

Defining supplementary variables (ancillary variables and cell measures)
------------------------------------------------------------------------

It is common practice to store ancillary variables (e.g. land/sea/ice masks)
and cell measures (e.g. cell area, cell volume) in separate datasets that are
described by slightly different facets.
In ESMValCore, we call ancillary variables and cell measures "supplementary
variables".
Some :ref:`preprocessor functions <Preprocessors>` need this information to
work.
For example, the :ref:`area_statistics<area_statistics>` preprocessor function
needs to know area of each grid cell in order to compute a correctly weighted
statistic.

To attach these variables to a dataset, the ``supplementary_variables`` keyword
can be used.
For example, to add cell area to a dataset, it can be specified as follows:

.. code-block:: yaml

datasets:
- dataset: BCC-ESM1
project: CMIP6
exp: historical
ensemble: r1i1p1f1
grid: gn
supplementary_variables:
- short_name: areacella
mip: fx
exp: 1pctCO2

Note that the supplementary variable will inherit the facet values from the main
dataset, so only those facet values that differ need to be specified.

.. _supplementary_dataset_wildcards:

Automatically selecting the supplementary dataset
-------------------------------------------------

When using many datasets, it may be quite a bit of work to find out which facet
values are required to find the corresponding supplementary data.
The tool can automatically guess the best matching supplementary dataset.
To use this feature, the supplementary dataset can be specified as:

.. code-block:: yaml

datasets:
- dataset: BCC-ESM1
project: CMIP6
exp: historical
ensemble: r1i1p1f1
grid: gn
supplementary_variables:
- short_name: areacella
mip: fx
exp: '*'
activity: '*'
ensemble: '*'

With this syntax, the tool will search all available values of ``exp``,
``activity``, and ``ensemble`` and use the supplementary dataset that shares the
most facet values with the main dataset.
Note that this behaviour is different from
:ref:`using wildcards in the main dataset <dataset_wildcards>`,
where they will be expanded to generate all matching datasets.
The available datasets are shown in the debug log messages when running a recipe
with wildcards, so if a different supplementary dataset is preferred, these
messages can be used to see what facet values are available.
The facet values for local files are retrieved from the directory tree where the
directories represent the facets values.
Reading facet values from file names is not yet supported.
If wildcard expansion fails, this is because there were either no files found
that match those facets, or the facets could not be read from the directory
name or ESGF.

Automatic definition of supplementary variables
-----------------------------------------------

If an ancillary variable or cell measure is
:ref:`needed by a preprocessor function <preprocessors_using_supplementary_variables>`,
but it is not specified in the recipe, the tool will automatically make a best
guess using the syntax above.
Usually this will work fine, but if it does not, it is recommended to explicitly
define the supplementary variables in the recipe.

To disable this automatic addition, define the supplementary variable as usual,
but add the special facet ``skip`` with value ``true``.
See :ref:`preprocessors_using_supplementary_variables` for an example recipe.

Saving ancillary variables and cell measures
--------------------------------------------

By default, ancillary variables and cell measures will be removed
from the main variable before saving it to file because they can be as big as
the main variable.
To keep the supplementary variables, disable the preprocessor function that
removes them by setting ``remove_supplementary_variables: false`` in the
preprocessor profile in the recipe.

Concatenating data corresponding to multiple facets
---------------------------------------------------

It is possible to define the experiment as a list to concatenate two experiments.
Here it is an example concatenating the `historical` experiment with `rcp85`

Expand All @@ -130,6 +286,9 @@ In this case, the specified datasets are concatenated into a single cube:
datasets:
- {dataset: CanESM2, project: CMIP5, exp: [historical, rcp85], ensemble: [r1i1p1, r1i2p1], start_year: 2001, end_year: 2004}

Short notation of ensemble members and sub-experiments
------------------------------------------------------

ESMValTool also supports a simplified syntax to add multiple ensemble members from the same dataset.
In the ensemble key, any element in the form `(x:y)` will be replaced with all numbers from x to y (both inclusive),
adding a dataset entry for each replacement. For example, to add ensemble members r1i1p1 to r10i1p1
Expand All @@ -152,7 +311,7 @@ Please, bear in mind that this syntax can only be used in the ensemble tag.
Also, note that the combination of multiple experiments and ensembles, like
exp: [historical, rcp85], ensemble: [r1i1p1, "r(2:3)i1p1"] is not supported and will raise an error.

The same simplified syntax can be used to add multiple sub-experiment ids:
The same simplified syntax can be used to add multiple sub-experiments:

.. code-block:: yaml

Expand All @@ -161,6 +320,9 @@ The same simplified syntax can be used to add multiple sub-experiment ids:

.. _timerange_examples:

Time ranges
-----------

When using the ``timerange`` tag to specify the start and end points, possible values can be as follows:

- A start and end point specified with a resolution up to seconds (YYYYMMDDThhmmss)
Expand Down Expand Up @@ -262,17 +424,15 @@ section will include:
- a description of the diagnostic and lists of themes and realms that it applies to;
- an optional ``additional_datasets`` section.
- an optional ``title`` and ``description``, used to generate the title and description
of the ``index.html`` output file.
in the ``index.html`` output file.

.. _tasks:

The diagnostics section defines tasks
-------------------------------------
The diagnostic section(s) define the tasks that will be executed when running the recipe.
For each variable a preprocessing task will be defined and for each diagnostic script a
diagnostic task will be defined. If variables need to be derived
from other variables, a preprocessing task for each of the variables
needed to derive that variable will be defined as well. These tasks can be viewed
diagnostic task will be defined. These tasks can be viewed
in the main_log_debug.txt file that is produced every run. Each task has a unique
name that defines the subdirectory where the results of that task are stored. Task
names start with the name of the diagnostic section followed by a '/' and then
Expand Down
Loading