Skip to content

Commit

Permalink
Start of Dask tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
bouweandela committed May 29, 2024
1 parent dead7a3 commit 49dc092
Show file tree
Hide file tree
Showing 2 changed files with 234 additions and 0 deletions.
167 changes: 167 additions & 0 deletions _episodes/11-dask-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
---
title: "Dask Configuration"
teaching: 10
exercises: 10
compatibility: ESMValCore v2.10.0

questions:
- What is the [Dask](https://www.dask.org/) configuration file and how should I use it?

objectives:
- Understand the contents of the dask.yml file
- Prepare a personalized dask.yml file
- Configure ESMValCore to use some settings

keypoints:
- The ``dask.yml`` file tells ESMValCore how to configure Dask.
- "``client`` can be used to an already running Dask cluster."
- "``cluster`` can be used to start a new Dask cluster for each run."
- "The Dask default scheduler can be configured by editing the files in ~/.config/dask."

---

## The Dask configuration file

The preprocessor functions in ESMValCore use the
[Iris](https://scitools-iris.readthedocs.io) library, which in turn uses Dask
Arrays to be able to process datasets that are larger than the available memory.
It is not necesary to understand how these work exactly to use the ESMValTool,
but if you are interested there is a
[Dask Array Tutorial](https://tutorial.dask.org/02_array.html) as a well as a
[guide to "Lazy Data"](https://scitools-iris.readthedocs.io/en/stable/userguide/real_and_lazy_data.html)
available. Lazy data is the term the Iris library uses for Dask Arrays.

The most important concept to understand when using Dask Arrays is the concept
of a Dask "worker". With Dask, computations are run in parallel by Python
processes or threads called "workers". These could be on the
same machine that you are running ESMValTool on, or they could be on one or
more other computers. Dask workers typically require 2 to 4 gigabytes of
memory (RAM) each. In order to avoid running out of memory, it is important
to use only as many workers as your computer(s) have memory for. ESMValCore
(or Dask) provide configuration files where you can configure the number of
workers.

In order to distribute the computations over the workers, Dask makes use of a
"scheduler". There are two different schedulers available. The default
scheduler can be good choice for smaller computations that can run
on a single computer, while the scheduler provided by the Dask Distributed
package is more suitable for larger computations.

> ## On using ``max_parallel_tasks``
>
> In the config-user.yml file, there is a setting called ``max_parallel_tasks``.
> With the Dask Distributed scheduler, all the tasks running in parallel
> can use the same workers, but with the default scheduler each task will
> start its own workers. For recipes that process large datasets, it is usually
> beneficial to set ``max_parallel_tasks: 1``, while for recipes that process
> many small datasets it can be beneficial to increasing this number.
>
{: .callout}

## Starting a Dask distributed cluster

Let's start the the tutorial by configuring ESMValCore so it runs its
computations using 2 workers.

We use a text editor called ``nano`` to edit the configuration file:

~~~bash
nano ~/.esmvaltool/dask.yml
~~~

Any other editor can be used, e.g. many systems have ``vi`` available.

This file contains the settings for:

- Starting a new cluster of Dask workers
- Or alternatively: connecting to an existing cluster of Dask workers

Add the following content to the file ``~/.esmvaltool/dask.yml``:

```yaml
cluster:
type: distributed.LocalCluster
n_workers: 1
threads_per_worker: 2
memory_limit: 4GiB
```
This tells ESMValCore to start a cluster of one worker, that can use 2
gigabytes (GiB) of memory and run computations using 2 threads. For a more
extensive description of the available arguments and their values, see
[``distributed.LocalCluster``](https://distributed.dask.org/en/stable/api.html#distributed.LocalCluster).
To see this configuration in action, run we will run a version
of [recipe_easy_ipcc.yml](https://docs.esmvaltool.org/en/latest/recipes/recipe_examples.html) with just two datasets. Download
the recipe [here](../files/recipe_easy_ipcc_short.yml) and run it
with the command:
~~~bash
esmvaltool run recipe_easy_ipcc_short.yml
~~~
After finding and downloading all the required input files, this will start
the Dask scheduler and workers required for processing the data. A message that
looks like this will appear on the screen:
```
2024-05-29 12:52:38,858 UTC [107445] INFO Dask dashboard: http://127.0.0.1:8787/status
```
Open the Dashboard link in a browser to see the Dask Dashboard website.
When the recipe has finished running, the Dashboard website will stop working.
The top left panel shows the memory use of each of the workers, the panel on the
right shows one row for each thread that is doing work, and the panel at the
bottom shows the progress.
> ## Explore what happens if workers do not have enough memory
>
> Reduce the amount of memory that the workers are allowed to use to 2GiB and
> run the recipe again. Note that the bars representing the memory use turn
> orange as the worker reaches the maximum amount of memory it is
> allowed to use and starts 'spilling' (writing data temporarily) to disk.
> The red blocks in the top right panel represent time spent reading/writing
> to disk.
>
>> ## Solution
>>
>> We use `memory_limit` entry in the `~/.esmvaltool/dask.yml` file to set the
>> amount of memory allowed to 2 gigabytes:
>>```yaml
>> cluster:
>> type: distributed.LocalCluster
>> n_workers: 1
>> threads_per_worker: 2
>> memory_limit: 2GiB
>>```
>>
> {: .solution}
{: .challenge}


> ## Tune the configuration to your own computer
>
> Look at how much memory you have available on your machine (run the command
> ``grep MemTotal /proc/meminfo`` on Linux), set the ``memory_limit`` back to
> 4 GiB and increase the number of Dask workers so they use total amount
> available minus a few gigabytes for your other work.
>
>> ## Solution
>>
>> For example, if your computer has 16 GiB of memory, it can comfortably use
>> 12 GiB of memory for Dask workers, so you can start 3 workers with 4 GiB
>> of memory each.
>> Use the `num_workers` entry in the `~/.esmvaltool/dask.yml` file to set the
>> number of workers to 3.
>>```yaml
>> cluster:
>> type: distributed.LocalCluster
>> n_workers: 3
>> threads_per_worker: 2
>> memory_limit: 4GiB
>>```
>>
> {: .solution}
{: .challenge}

{% include links.md %}
67 changes: 67 additions & 0 deletions files/recipe_easy_ipcc_short.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
documentation:
title: Easy IPCC
description: Reproduce part of IPCC AR6 figure 9.3a.
references:
- fox-kemper21ipcc
authors:
- kalverla_peter
- andela_bouwe
maintainer:
- andela_bouwe

preprocessors:
easy_ipcc:
custom_order: true
anomalies:
period: month
reference:
start_year: 1950
start_month: 1
start_day: 1
end_year: 1979
end_month: 12
end_day: 31
area_statistics:
operator: mean
annual_statistics:
operator: mean
convert_units:
units: 'degrees_C'
ensemble_statistics:
statistics:
- operator: mean
multi_model_statistics:
statistics:
- operator: mean
- operator: percentile
percent: 17
- operator: percentile
percent: 83
span: full
keep_input_datasets: false
ignore_scalar_coords: true

diagnostics:
AR6_Figure_9.3:
variables:
tos_ssp585:
short_name: tos
exp: ['historical', 'ssp585']
project: CMIP6
mip: Omon
preprocessor: easy_ipcc
timerange: '1850/2100'
tos_ssp126:
short_name: tos
exp: ['historical', 'ssp126']
project: CMIP6
mip: Omon
timerange: '1850/2100'
preprocessor: easy_ipcc
scripts:
Figure_9.3a:
script: examples/make_plot.py

datasets:
- {dataset: ACCESS-CM2, ensemble: r1i1p1f1, grid: gn}
- {dataset: CESM2, ensemble: r4i1p1f1, grid: gn}

0 comments on commit 49dc092

Please sign in to comment.