This tutorial explores how containers can provide consistent environments for interactive JupyterLab sessions and non-interactive batch computing, enabling seamless workflows from development to production in HPC environments. The examples are tailored toward use in UVA's HPC environment although the concepts apply generally.
To follow along you should
- Have some experience with Linux command-line interfaces and file organization,
- Be familiar with Jupyter notebooks,
- Have basic knowledge of Python,
- Have a basic understanding of software containers. See Container Basics and Research Computing's Container tutorials
Note: It is assumed that you have access to an HPC system. HPC access and user accounts are typically tied to an allocation, which is a grant of compute resources that allows you to submit and run jobs on the HPC cluster. At UVA, faculty can request allocations through the Research Computing. Postdocs, staff and students can be sponsored through a faculty allocation.
- Overview
- Containers in HPC Environments
- Interactive JupyterLab Sessions on an HPC System
- Running a non-interactive Job
- From Development to Production Environment
- References
Containers package application code/executables and all their dependencies needed to run them. They provide lightweight operating system-level virtualization (as opposed to hardware virtualization provided by virtual machines) and offer portability of applications across different environments. Several container projects are specifically targeted at HPC environments, addressing the unique requirements of high-performance computing systems.
When working with code it is helpful to distinguish interactive vs batch (non-interactive) workloads and development vs production phases in a software's life cycle.
| Development | Production | |
|---|---|---|
| Interactive | Prototyping (e.g. Jupyter notebooks, RStudio) | Data exploration and live analysis (e.g. Jupyter notebooks, RStudio) |
| Batch | Testing scripts and analysis at small scale (e.g. scheduled jobs) | Full scale analysis (e.g., scheduled jobs) |
Development Phase
There is some tension between the fixed nature of container content and the necessary fluidity of program code as it evolves during the development process. A convenient approach is to containerize all dependencies and mount folders with the evolving code base into the container instance at runtime.
Eventually, the version controlled code base should be included in a new container image based on the image with all its dependencies as base layer. This will become the production image.
Production Phase
In the production phase, containers are deployed with the finalized code base included in the image. Both interactive and batch jobs in an HPC environment run on resources allocated by a scheduler, which manages compute node access and resource allocation. The containerized application can be executed consistently across different nodes, ensuring reproducibility and portability of the computational workflow.
Docker requires root-level daemon access, which is typically restricted in HPC environments for security reasons. Several container frameworks have been developed to address this limitation:
These frameworks are OCI (Open Container Initiative) compatible and can wrap around Docker images at runtime, allowing you to use existing Docker images without modification. This compatibility means you can develop containers using Docker on your local machine and then run them on HPC systems using these user-space frameworks. These frameworks allow users to run containers in HPC environments without requiring administrative privileges, making them suitable for shared computing resources.
Relevant for Parallel Computing and DL/ML Workloads:
All of the above frameworks support MPI for parallel computing workloads and provide abstractions that handle GPU hardware access on the host, making them suitable for both traditional HPC workloads and deep learning/machine learning applications that require GPU acceleration.
JupyterLab is an interactive development environment that provides a web-based interface for working with notebooks, code, and data. It offers a portal that allows users to select a specific app/code environment for their interactive computing sessions. A code environment is defined by its software packages, which are isolated across different environments. These environments are defined and referred to as kernels. The isolation of kernels allows you to work with different programming languages, libraries, and tools within the same JupyterLab interface.
You can define your own kernels backed by custom environments, including container images. Here we use it specifically for containerized Python environments, but the concept extends to R, Julia and other supported apps in the Jupyter ecosystem.
In the following steps, we will explore how to define your own kernels backed by custom container images of your choosing. These containerized kernels can then be used for both interactive work in JupyterLab and non-interactive batch job submissions, providing a consistent environment across your workflow.
It is assumed that you have an account and allocation on an HPC system. Log in to terminal on the HPC login node. On UVA's HPC system you can also open a terminal session in your web browser via the Open OnDemand portal. See UVA HPC login options.
Let's say we want to train an image classifier with PyTorch. DockerHub has a rich repository of PyTorch container images with a variety of versions, https://hub.docker.com/r/pytorch/pytorch/tags. Let's pull PyTorch 2.9 with CUDA support so we can utilize GPU acceleration.
cd # go to your home directory
git clone https://github.com/UVADS/jlab-hpc-containers.git
module load apptainer
apptainer pull ~/pytorch-2.9.1.sif docker://pytorch/pytorch:2.9.1-cuda12.6-cudnn9-runtimeThis will download the PyTorch Docker image and convert it into an Apptainer image file pytorch-2.9.1.sif in your current directory (in this case your home directory on the HPC system). The pytorch-2.9.1.sif file is self-contained and you can move it around like any other file.
Let's check the Python version.
apptainer exec ~/pytorch-2.9.1.sif python -VAnd the PyTorch version:
apptainer exec ~/pytorch-2.9.1.sif python -c "import torch; print (torch.__version__)"In order for JupyterLab to detect and run your Python environment as a kernel you need to have the ipykernel package installed. Let's run this command to check:
apptainer exec ~/pytorch-2.9.1.sif python -m ipykernel -VIf installed, the ipykernel package will return an output with the version number, e.g.
9.6.0
In the case of this PyTorch image the ipykernel package is missing. We will fix that in the next steps.
In addition to the ipykernel package inside your environment (i.e. inside your container) we need to tell JupyterLab how to launch the kernel environment. These instructions are provided in the kernel spec file that needs to be present on the host system.
JupyterLab searches for kernels in the following order:
- User-level directories:
~/.local/share/jupyter/kernels(or~/Library/Jupyter/kernelson macOS) - kernels specific to your user account - System-wide directories:
/usr/local/share/jupyter/kernelsor/usr/share/jupyter/kernels- kernels available to all users - Environment-specific directories:
share/jupyter/kernelswithin active conda/virtual environments
User-level kernels take precedence over system-wide kernels, allowing you to customize your kernel selection without affecting other users. When creating custom kernels, they are typically placed in ~/.local/share/jupyter/kernels.
On UVA's HPC system you can run the jkrollout command to create a new JupyterLab kernel that is backed by a container image (courtesy Ruoshi Sun, UVA Research Computing). We provide an augmented version, jkrollout2, in this repo.
Run this command:
cd ~/jlab-hpc-containers
bash jkrollout2 ~/pytorch-2.9.1.sif "PyTorch 2.9.1" gpuThis will create the kernel specifications for the PyTorch container image in ~/.local/share/jupyter/kernels/pytorch-2.9.1.
kernel.jsoninit.sh
In addition, you will have the ipykernel package installed in ~/local/pytorch-2.9.1. Note that it is installed (on the host filesystem, not the container image) which will be dynamically mounted when the Jupyter kernel for PyTorch 2.9.1 is launched.
Skip to step 4.
If you don't have access to the jkrollout script, you can create the kernel manually using the following steps:
mkdir -p ~/local/pytorch-2.9.1
apptainer exec --bind $HOME/local/pytorch-2.9.1:$HOME/.local ~/pytorch-2.9.1.sif python -m pip install --user ipykernelThe first line sets up a new directory in home. This will be the destination where the ipykernel package will be installed. You can change the name, just make it unique for each container image.
The second line installs the ipykernel package.
Notes:
The
--bind $HOME/local/pytorch-2.9.1:$HOME/.localargument will mount the new$HOME/local/pytorch-2.9.1directory on the host filesystem and make it available as$HOME/.localinside the container instance.We choose
$HOME/.localbecause that's where Python looks for additional packages by default, and that's also the destination forpip install --user.
Let's confirm that ipykernel is now available from within the container.
apptainer exec --bind $HOME/local/pytorch-2.9.1:$HOME/.local ~/pytorch-2.9.1.sif python -m ipykernel -VNote that we have to include the --bind $HOME/local/pytorch-2.9.1:$HOME/.local argument in order for this to work.
Next we'll manually create the kernel following these steps:
KERNEL_DIR="~/.local/share/jupyter/kernels/pytorch-2.9.1"
mkdir -p $KERNEL_DIR
cd $KERNEL_DIRInside this new directory we will set up two files, kernel.json and init.sh. The kernel spec file kernel.json is in JSON format and defines the launch command, display name, and display icon in JupyterLab.
kernel.json:
{
"argv": [
"/home/mst3k/.local/share/jupyter/kernels/pytorch-2.9.1/init.sh",
"-f",
"{connection_file}"
],
"display_name": "PyTorch 2.9.1",
"language": "python"
}Notes:
You cannot use env variables and shell expansion for the
argvarguments. In other words,~or$HOMEwill not work to indicate the path to the script.Replace
mst3kwith your own username on the system. On UVA's systems that's your UVA computing id.We use the
init.shscript to bundle a few commands, see next step. You can change the name of this launch script.You can customize
display_nameas you wish. That's the name that will show up in the JupyterLab UI.On UVA's HPC system the connection file will be injected at runtime by the JupyterLab instance. You don't need to define that.
The init.sh is a short bash script that wraps a few commands together for convenience.
#!/usr/bin/env bash
. /etc/profile.d/00-modulepath.sh
. /etc/profile.d/rci.sh
nvidia-modprobe -u -c=0
module purge
module load apptainer
apptainer exec --nv --bind $HOME/local/pytorch-2.9.1:$HOME/.local $HOME/pytorch-2.9.1.sif python -m ipykernel $@We need to make the init.sh script executable.
chmod +x init.shNotes:
Lines 2-4 ensure proper setup in case NVIDIA GPUs will be used.
The
--nvflag ensures that the apps and libraries inside the container have access to the GPU driver on the host.Replace
mst3kwith your username. On UVA's system your username is your computing id.Note the execution of
python -m ipykernel $@inside the container, including the$@which will shell expand command line args you may want to pass on. (Not used here.)The
--bind $HOME/local/pytorch-2.9.1:$HOME/.localargument can be dropped if the container image provides the ipykernel package.
When you now launch JupyterLab, you should see the new PyTorch 2.9.1 kernel as a new tile in the Launcher user interface. On UVA's HPC system you can start JupyterLab via Open OnDemand.
It should look similar to this.
For this tutorial we will use a pytorch-example.ipynb notebook to perform image classification on the MNIST dataset. It's been adapted from example code in the PyTorch example repository.
It is assumed that you have an account and allocation on an HPC system. Log in to terminal on the HPC login node. On UVA's HPC system you can also open a terminal session in your web browser via the Open OnDemand portal. See UVA HPC login options.
We already pulled the Docker image for PyTorch 2.9.1 and saved it as pytorch-2.9.1.sif in our home directory.
For this tutorial we will use an example from the PyTorch example repository to perform image classification on the MNIST dataset.
You can download the code and save it as pytorch-example.py with the curl command:
curl -o pytorch-example.py https://raw.githubusercontent.com/pytorch/examples/refs/heads/main/mnist/main.pyIn order to run our Python script on the HPC system we need to write a shell script that specifies the hardware resources and max time limit for the job. On UVA's HPC system job scheduling and resource allocation is handled by Slurm. The Slurm resource directives start with #SBATCH.
pytorch-example.sh:
#!/bin/bash
#SBATCH --account=<your_allocation> # replace with your allocation
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --cpus-per-task=4
#SBATCH --time=00:10:00 # 10 minutes
# get example Py script
curl -o pytorch-example.py https://raw.githubusercontent.com/pytorch/examples/refs/heads/main/mnist/main.py
module purge
module load apptainer
apptainer exec --nv --bind ~/local/pytorch-2.9.1:$HOME/.local ~/pytorch-2.9.1.sif python pytorch-example.pyTo avoid cluttering our home directory we'll change into a jobrun directory in our personal /scratch folder. We'll submit the job from there with these commands:
mkdir -p /scratch/$USER/jobrun # create new directory in scratch
cd /scratch/$USER/jobrun # change into that new directory
sbatch ~/jlab-hpc-containers/pytorch-example.sh # submit the job scriptWe can track status of the submitted job by running the sacct command. The job will create a directory /scratch/$USER/data where the MNIST dataset will be downloaded to.
In addition, all stdout and stderr outputs will be redirected to slurm-*-out and slurm-*-err files in the jobrun directory.
The end of the slurm-*-out file should look like this, indicating successful training and testing with an accuracy of ~99%:
...
Train Epoch: 14 [55680/60000 (93%)] Loss: 0.065189
Train Epoch: 14 [56320/60000 (94%)] Loss: 0.000993
Train Epoch: 14 [56960/60000 (95%)] Loss: 0.113047
Train Epoch: 14 [57600/60000 (96%)] Loss: 0.029899
Train Epoch: 14 [58240/60000 (97%)] Loss: 0.002069
Train Epoch: 14 [58880/60000 (98%)] Loss: 0.036733
Train Epoch: 14 [59520/60000 (99%)] Loss: 0.023359
Test set: Average loss: 0.0259, Accuracy: 9925/10000 (99%)
The above steps allow you to augment the packages provided by the container with additional packages installed in a dedicated directory on the host filesystem (e.g. ~/local/pytorch-2.9.1). We used it to enable Jupyter kernels for a container image that was missing the ipykernel package. In addition, the example notebook, Python script, and job script also existed outside of the container image. This avoids the need to rebuild images for each code iteration and can greatly accelerate the pace during the development cycle.
With the ultimate goal of reproducible computing, open science, portability and sharing in mind, it is generally recommended to create a new container image when the development cycle is completed. This entails:
- Version controlling the new code base in GitHub (e.g. the Jupyter notebook, python scripts, job scripts, etc)
- Exporting the package list from ~/local/pytorch-2.9.1 to
requirements.txtand adding it to the version controlled repository.- Writing a new Dockerfile using the original image (in this example from docker://pytorch/pytorch:2.9.1-cuda12.6-cudnn9-runtime). This Dockerfile should include statements to copy the requirements.txt file and install from that list the needed packages into the new image. The Dockerfile itself should also be added to the repository.
Kurtzer, G. M., Sochat, V., & Bauer, M. W. (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE, 12(5), e0177459. https://doi.org/10.1371/journal.pone.0177459 (Apptainer, formerly Singularity)
Priedhorsky, R., Randles, T., & Olivier, S. L. (2017). Charliecloud: Unprivileged containers for user-friendly HPC. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17), Article 50. Association for Computing Machinery. https://doi.org/10.1145/3126908.3126925
Gerhardt, L., Bhimji, W., Canon, S., Fasel, M., Jacobsen, D., Mustafa, M., ... & Yates, B. (2017). Shifter: Containers for HPC. In Cray User Group Conference (CUG '17). https://cug.org/proceedings/cug2017_proceedings/includes/files/pap115s2-file1.pdf
Walsh, D. (2023). Podman in Action. Manning Publications. https://www.manning.com/books/podman-in-action
Sun, R., & Siller, K. (2024). HPC Container Management at the University of Virginia. In Practice and Experience in Advanced Research Computing 2024: Human Powered Computing (PEARC '24), Article 73. Association for Computing Machinery. https://doi.org/10.1145/3626203.3670568
Additional Resources:
- Apptainer - User Guide and Documentation. Apptainer a Series of LF Projects LLC.
- Podman - Podman Documentation. Red Hat, Inc.
- Shifter - Shifter Container System. National Energy Research Scientific Computing Center (NERSC).
- CharlieCloud - CharlieCloud Documentation. Los Alamos National Laboratory.





