Skip to content

add docker+CWL best practices #375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
50 changes: 49 additions & 1 deletion src/topics/best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,55 @@ all are required.

- Software containers should be made to be conformant to the ["Recommendations for the packaging and containerizing of bioinformatics software"][containers] (also useful to other disciplines).

The following are a set of recommended good practices to keep in mind when running CWL workflows within Docker:

- Make sure you are using the latest version of both CWL and Docker,
as this will ensure that you have access to the latest features and bug fixes.

- Use meaningful tags on your own Docker image
so you can tell versions of your Docker image apart as it is updated over time.
These can reflect the version of the underlying software,
or a version you assign to the Dockerfile itself.
These can be manually assigned version numbers (e.g. 1.0, 1.1, 1.2, 2.0),
timestamps (e.g. YYYYMMDD like 20220126) or the hash of a git commit.

- It is good practice to keep your Dockerfiles in Git, just like your workflow definitions,
because they are also scripts and should be managed and tracked with version control.

- When creating a Dockerfile, it is important to specify the exact version
of the software you want to install and the base image you want to use.
This helps ensure that your Docker image builds are consistent and reproducible.
Additionally, when using the `FROM` command, specify a tag for the base image,
otherwise it will default to "latest" which can change at any time.

- To ensure that the user specified in the Dockerfile is actually used to run the tool,
it is best to avoid using the `USER` instruction in the Dockerfile.
This is because cwltool will override the `USER` instruction and match the user instead,
which means that the user specified in the `USER` instruction
may not be the user that is actually used to run the tool.
To avoid this, use the `--no-match-user` cwltool flag
to disable passing the current user ID to `docker run --user`.

- Keep your container images as small as possible,
this speeds up the download time and consumes less storage space.
Also, when using bioinformatics tools, reference data should be supplied externally
(as workflow inputs), rather than including it in the container image.
This way, it is easier to update the reference data without the need to rebuild the Docker image.

- Avoid using the `ENTRYPOINT` command in your Dockerfile
because it changes the command line that runs inside the container.
This can cause confusion when the command line that supplied to the container
and the command that actually runs are different.

- Docker has a feature that can save you time during development by
reusing a previous command and its base layer, instead of running it again.
However, this can also cause problems if a file being downloaded changes,
but the command remains the same. In that case, the cached version of the file will be used
instead of the updated one. To avoid this, use the `--no-cache` option to force Docker to re-run the steps.

To learn more about creating workflows with Docker,
see this [tutorial](https://doc.arvados.org/rnaseq-cwl-training/08-supplement-docker/index.html).

[containers]: https://doi.org/10.12688/f1000research.15140.1
[apache-license]: https://spdx.org/licenses/Apache-2.0.html
[license-example]: https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/workflows/emg-assembly.cwl#L200
Expand All @@ -112,4 +161,3 @@ all are required.
%
% - Writing CWL workflows (include existing docs from https://github.com/common-workflow-library/cwl-patterns/blob/main/README.md)
% - FAIR best practices with CWL
% - Docker best practices with CWL - https://github.com/common-workflow-language/common-workflow-language/issues/347