Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More reproducible recipes in ESMValTool #3054

Open
bouweandela opened this issue Feb 27, 2023 · 2 comments
Open

More reproducible recipes in ESMValTool #3054

bouweandela opened this issue Feb 27, 2023 · 2 comments

Comments

@bouweandela
Copy link
Member

bouweandela commented Feb 27, 2023

It would be interesting to discuss in a wider circle including the @ESMValGroup/scientific-lead-development-team (maybe at a workshop?) where we want to go with these 2 new recipe formats to have a common understanding of which formats should be used for recipes in the main branch. Once this PR is merged (and v2.8 released), two new recipe formats will be allowed:

  • concise recipes with possibly plenty of wildcards;
  • very lengthy recipes containing one key-value pair per line.

I understand these two new formats would be extremely useful to compose recipes and to ease reproducibility by recording dataset versions as stated in the docs with this PR. But I wonder about the readability of the new format, particularly the second one. For example, recipe_impact.yml has 227 lines and recipe_impact_filled.yml 7137 lines. I wonder if we should clarify in our docs which formats could be accepted in the main branch of the Tool and have a policy about that.

Originally posted by @remi-kazeroni in ESMValGroup/ESMValCore#1609 (comment)

@bouweandela
Copy link
Member Author

We could allow recipes with wildcards in the ESMValTool repository, but I would strongly recommend having a copy without wildcards stored alongside it for reproducibility. It is impossible to tell what a wildcard recipe was supposed to do if some of the required input data is not available (any more).

But I wonder about the readability of the new format, particularly the second one. For example, recipe_impact.yml has 227 lines and recipe_impact_filled.yml 7137 lines.

I suspect there is room for improvement in the way the _filled recipe is written. The current recipe_impact.yml does not even have wildcards, so I would think that even with explicit versions and supplementary_variables, it should be possible to store it in a relatively compact way.

@bouweandela
Copy link
Member Author

bouweandela commented Feb 27, 2023

In the longer term, it would be best if we could add the version (and supplementary_variables if applicable) to all datasets of all recipes in the ESMValTool repository. That will make the dataset part of the recipe reproducible. Of course, we would need to make sure that this process is (almost) automatic and that the recipe remains readable.

It will also make it much easier to do the regression tests (comparing to a previous run of the same recipe) because then at least we are sure that the input data did not change. Having the filled recipe will already help with this, because then we can compare the filled recipe of the two runs, so this will make it easier to see if there are changes in the data.

@bouweandela bouweandela changed the title Reproducible recipes in ESMValTool More reproducible recipes in ESMValTool Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant