inbo · falkmielke · Jan 28, 2025 · Jan 28, 2025 · Jan 28, 2025 · Jan 29, 2025
diff --git a/content/tutorials/development_docker/index.md b/content/tutorials/development_docker/index.md
@@ -1,10 +1,10 @@
 ---
-title: Building Containers with Docker and Podman
+title: Containers with Docker and Podman
 description: Introduction to containerization and the practical use of Docker-like tools.
-date: "2025-02-11"
+date: "2025-02-20"
 authors: [falkmielke]
 categories: ["development", "open science"]
-tags: ["development", "open science"]
+tags: ["development", "open science", "docker", "containers"]
 number-sections: false
 params:
   math: true
@@ -96,11 +96,14 @@ This is why the rest of this tutorial will focus on terminal access.
 On the Windows terminal or Linux shell, you can install `docker` as a terminal tool.
 
 {{% callout note %}}
-On Windows, this comes bundled with the App[^1]; the steps below are not necessary.
-However, note that you need to run a terminal *as administrator*.
+On Windows, this comes bundled with the App; the steps below are not necessary.
+There might be ways to get around the Desktop App and facilitate installation, either via WSL2 or using [a windows package manager called Chocolatey](https://en.wikipedia.org/wiki/Chocolatey).
+
+Either way, note that you need to run the docker app or docker in a terminal *as administrator*.
+
 {{% /callout %}}
 
-More info on the debian installation [can be found here](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository).
+More info about the installation on Debian-based or Ubuntu Linux systems [can be found here](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository).
 The procedure requires you to add an extra repository, [some caution is warranted](https://wiki.debian.org/DontBreakDebian).
 
 ``` sh
@@ -124,7 +127,7 @@ For this change to take effect, log off and log in again and restart the Docker
 Containers are managed by a system task ("service" and "socket") which need to be started.
 Most likely, your Linux uses `systemd`.
 Your system can start and stop that service automatically, by using `systemctl enable <...>`.  
-However, due to [diverse](https://docs.docker.com/engine/security) [security](https://github.com/moby/moby/issues/9976) [pitfalls](https://snyk.io/blog/top-ten-most-popular-docker-images-each-contain-at-least-30-vulnerabilities), it is good practice to **not keep it enabled** permanently on your system.
+However, due to [diverse](https://docs.docker.com/engine/security) [security](https://github.com/moby/moby/issues/9976) [pitfalls](https://snyk.io/blog/top-ten-most-popular-docker-images-each-contain-at-least-30-vulnerabilities), it is good practice to **not keep it enabled** permanently on your system (unless, of course, if you use it all the time).
 
 On a `systemd` system, you can start and stop Docker on demand via the following commands (those will ask you for `sudo` authentification if necessary).
 
@@ -155,7 +158,14 @@ Docker is about assembling and working in containers.
 "Living" in containers.
 Or, rather, you can think of this as living in a ["tiny home", or "mobile home"](https://parametric-architecture.com/tiny-house-movement).
 Let's call it a fancy caravan.
-The good thing is that you get to pick a general design and to choose all details of the interior.
+The good thing is that at least you get to pick a general design and to choose all details of the interior.
+
+<figure>
+<img src="../../images/tutorials/development_docker/docker_metaphor_tiny_space.jpg" alt="Black/white image of a tiny home as a metaphor for software containerization." />
+<figcaption aria-hidden="true">A tiny home close to "Gare Maritime", Brussels, February 2025.</figcaption>
+</figure>
+
+
 
 The best thing: if you feel like you do not have the cash, time, or talent to build your own home, you can *of course* use someone else's.
 There are a gazillion **Docker images available for you** on [Docker Hub](https://hub.docker.com).
@@ -182,7 +192,7 @@ If it does not find the resources locally, Docker will download and extract the
 docker run --rm -p 8787:8787 -e PASSWORD=YOURNEWPASSWORD rocker/rstudio
 ```
 
--   The `--rm` flag makes the Docker image non-permanent, i.e. disk space will be freed after you close the container (<a href="#sec-permanence" class="quarto-xref">Section 2.5</a>).
+-   The `--rm` flag makes the Docker container non-permanent, i.e. disk space will be freed after you close the container (<a href="#sec-permanence" class="quarto-xref">Section 2.5</a>).
 -   The port specified at `-p` is the one you use to access this local container server (the `-p` actually maps host- and container ports). You have to specify it explicitly, otherwise the host system will not let you pass (`:gandalf-meme:`).
 -   The `-e` flag allows you to specify environment variables, in this case used to set a password for the RStudio server. But if you do not specify one, a random password will be generated and displayed upon startup (read the terminal output).
 
@@ -231,12 +241,13 @@ This is a simple and quick way to run R and RStudio in a container.
 
 However, there are limitations:
 
-{{% callout emphasize %}}
+{{% callout note %}}
 -   You have to live with the R packages provided in the container, or otherwise install them each time you access it...
 -   ... unless you make your container permanent by omitting the `--rm` option. Note that this will cost considerable disk space, will not transfer to other computers (the original purpose of Docker), and demand occasional updates (<a href="#sec-permanence" class="quarto-xref">Section 2.5</a>).
 -   You could alternatively add `--pull always` to `docker run`, which will check and pull new versions.
 -   Speaking of updates: it is good practice to keep software up to date. Occasionally update or simply re-install your Docker image and R packages to get the latest versions.
 -   You should make sure that the containers are configured correctly and securely. This is especially important with server components which expose your machine to the internet.
+-   Because most containers contain a linux system, user permissions are taken seriously, and the consequences might be confusing. There are guides online ([e.g. here](https://labex.io/tutorials/docker-how-to-handle-permissions-in-docker-415866)); there are example repositories (like the author's own struggle [here](https://github.com/inbo/containbo?tab=readme-ov-file#understanding-volumes) and [here](https://github.com/inbo/containbo/tree/main/emacs)); base images are well set up and one can normally get by with default users.
 -   There is a performance penalty from using containers: in inaccurate laymans' terms, they emulate (parts of a) "computer" inside your computer.
 {{% /callout %}}
 
@@ -309,10 +320,11 @@ cat ~/test.txt
 ```
 
 will return:
-\> cat: /root/test.txt: No such file or directory
+
+> cat: /root/test.txt: No such file or directory
 
 This behavior is desired (in the second workflow above): if you start up a fresh environment each time you work in Docker, you **assure that your work pipeline is independent of prior changes on the system**.
-Whether this makes sense as a workflow has to be evaluated with respect to with hard drive space requirement, updates, the option to build upon a customized Dockerfile, reproducibility potential.
+Whether this makes sense as a workflow has to be evaluated with respect to hard drive space requirement, updates, the option to build upon a customized Dockerfile, reproducibility potential.
 
 You can "link in" folders for working files (note how you have to specify the full path to `new_home`, and that this container uses the root user by default):
 
@@ -338,7 +350,7 @@ But it also pays off in complicated server setups and distributed computing.
 
 A standardized container from [Docker Hub](https://hub.docker.com) is a good start.
 However, you will probably require personalization.
-As a use case, imagine you would like to have an RStudio server which comes with relevant inbo packages pre-installed (e.g. [`inbodb`](https://inbo.github.io/inbodb), [`watina`](https://inbo.github.io/watina); *cf.* [contaINBO](https://github.com/inbo/contaINBO)).
+As a use case, imagine you would like to have an RStudio server which comes with relevant inbo packages pre-installed (e.g. [`inbodb`](https://inbo.github.io/inbodb), [`watina`](https://inbo.github.io/watina); *cf.* [the containbo repository](https://github.com/inbo/containbo)).
 
 I will return to this use case below.
 To explore the general workings of `docker build`, let us turn to more web-directed tasks for a change.
@@ -452,46 +464,81 @@ We have used an existing image and added `flask` on top of it.
 This works via writing a Dockerfile and building an image.
 {{% /callout %}}
 
-## Multiple Images: `compose` Versus `build`
+## Multiple Images: `compose` *versus* `build`
 
 The above works fine for most cases.
 However, if you want to assemble and combine multiple images, or build on base images from multiple sources, you need a level up.
 
 In that case `docker compose` is [the way to go](https://docs.docker.com/compose/gettingstarted).
-On Debian, this extra functionality comes with the `docker-compose-plugin`.
+On Debian or Ubuntu, this extra functionality comes with the `docker-compose-plugin`.
 I did not have the need to try this out, yet, but will return here if that changes.
 
-## Confusion with Version Control and Version Management
+## Relation to Version Control and Version Management
 
 Back to the initial paradigma of reproducibility:
 *What exactly is the Open Science aspect of containerization?*
 
 This question might have led to some confusion, and I would like to throw in a paragraph of clarification.
+A crucial distinction lies in the preparation of *Dockerfiles* (i.e. build instructions) and the preservation of *images* (i.e. end products of a build process).
+
 
-One purpose of a container may be that you document the exact components of your system environment.
-You might start at a base image (e.g. a `rocker`) and add all necessary software via a Dockerfile.
+One purpose of a Dockerfile may be that you document the exact components of your system environment.
+You start at a base image (e.g. a `rocker`) and add additional software via Dockerfile layers.
 This is good practice, and encouraged: if you publish an analysis, provide a tested container recipe with it.
 
-However, this does not solve the problem of version conflicts.
-Documenting the versions of packages you used is an extra step, for which [other tools are available](https://doi.org/10.1038/d41586-023-01469-0).
+However, this alone does not solve the problem of version conflicts and deprecation.
+Documenting the versions of packages you used is an extra step, for which [other tools are available](https://doi.org/10.1038/d41586-023-01469-0):
 
--   Version control such as `git` will track the changes within your own scripts and texts.
 -   It is good practice to report the exact versions of the software used upon publication ([see here, for example](https://arca-dpss.github.io/manual-open-science/requirements-chapter.html)).
+- Version control such as `git` will track the changes within your own texts, scripts, even version snapshots and Dockerfiles.
+-   Finally, docker images can serve as a snapshot of a (virtual) machine on which your code would run.
 
-The first point, **version control**, is a fantastic tool to enable open science, and avoid personal trouble.
+{{% callout emphasize %}}
+The simple rule of thumb is: use all three methods, ideally all the time.
+
+Virtual environments.
+Version control.
+Snapshots.
+
+Get used to them.
+They are easy.
+They will save you time and trouble almost immediately.
+{{% /callout %}}
+
+
+But unless you use them already, you might require some starting points and directions: here we go.
+The second point, **version control**, is a fantastic tool to enable open science, and avoid personal trouble.
 You will [find starting points and help in other tutorials on this website](https://tutorials.inbo.be/tags/git).
-The second point, version documentation, is ideally handled by **virtual environments**.
+It might have a steep learning curve, yet [there](https://rstudio.github.io/cheatsheets/git-github.pdf) [are](https://www.sourcetreeapp.com) [fantastic](https://magit.vc) [tools](https://www.sublimemerge.com) to get you started.
+The other point, version documentation, is trivially achieved by manual storage of currently installed versions via `sessionInfo()` in R, or `pip freeze > versions.txt` for Python.
+A small step towads somewhat more professionalism are **virtual environments**.
 Those exist for R ([renv](https://rstudio.github.io/renv/articles/renv.html)) or Python ([venv](https://docs.python.org/3/library/venv.html)).
 The `pak` library in R can [handle lock files conveniently](https://pak.r-lib.org/reference/lockfile_install.html) with `pak::lockfile_install()`.
 Then there is the integration of R, Python and system packages in `conda`-like tools ([e.g. micromamba](https://mamba.readthedocs.io/en/latest)).
+There are even system level tools, for example [`nix` and `rix`](https://docs.ropensci.org/rix).
+
+The methods are not mutually exclusive:
+all Dockerfiles, build recipes and scripts to establish virtual environments should generally be subject to version control.
+
+
+However, documenting the exact tools and versions used in a project does not guarantee that these versions will be accessible to future investigators (like oneself, trying to reproduce an analysis five years later).
+This is where **Docker images** come in.
+Docker images are the actual containers which you create from the Dockerfile blueprints by the process of building.
+In the "tiny home" metaphor: your "image" is the physical (small, but real, DIY-achievement) home to live in, built from step-by-step instructions.
+Think of a Docker image as a virtual copy of your computer which you store for later re-activation.
+For example, a collection of images for specific analysis pipelines at INBO are preserved at [Docker Hub/inbobmk](https://hub.docker.com/u/inbobmk).
+We consider these "stable" versions because they could be re-activated no matter what crazy future updates will shatter the R community, which enables us to return to all details of previous analyses.
+
+
+Some confusion might arise from the fact that managing these image snapshots is achieved with the same vocabulary as version control, for example you would ["commit"](https://docs.docker.com/reference/cli/docker/container/commit) updated versions and ["push"](https://docs.docker.com/reference/cli/docker/image/push) them to a container repository.
+
+Even more confusion might arise from the fact that you also find ready-made images online, e.g. on [Docker Hub](https://hub.docker.com), or [Quai](https://quay.io), or elsewhere.
+These provide images of (recent) versions of working environments, supposed to stand in as starting points for derived containers. 
+Hence, be aware of the dual use case of images: (i) the dynamic, universal base image which improves efficiency and (ii) the static, derived, bespoke image which you created for your analysis (shared with the world for reproducibility).
 
-A simple, less effective basic solution to version reproducibility is the manual storage of currently installed versions via `sessionInfo()` in R, or `pip freeze > versions.txt` for Python.
 
-You can find Docker images of (recent) older versions of working environments on Docker Hub.
-You might think that this is how Docker supports version reproducibility.
-However, those will fail to build once the binary dependencies get removed.
-Furthermore, Docker itself does not fix the versions of installed system components by default.
-Ideally, you want to implement **version control and virtual environments within the container**, to be a "full stack open science developer".
+And, once more, those images are not a "holy grail" solution: they are not entirely system independent (e.g. processor architecture), and they might occupy a considerable amount of hard disk space (Dockerfile optimization is warranted).
+Ideally, to be a "full stack open science developer", you want to implement **a mixed strategy** consisting virtual environments and containers, wrapped in version control and stored in a backup image. 
 
 
 <a id="sec-rootless"></a> 
@@ -813,7 +860,6 @@ Your head might be twisting in a swirl of containers by now.
 I hope you find this overview useful, nevertheless.
 Thank you for reading!
 
-[^1]: I saw several ways online to get around the Desktop App, either via WSL2 or using [a windows package manager called Chocolatey](https://en.wikipedia.org/wiki/Chocolatey).
 
 [^2]: I mostly follow [this tutorial](https://jsta.github.io/r-docker-tutorial/02-Launching-Docker.html).